[
  {
    "path": ".gitignore",
    "content": "stock_data/\nmodels/\n"
  },
  {
    "path": "README.md",
    "content": "# Unsupervised Stock Market Features Construction using Generative Adversarial Networks(GAN)\nDeep Learning constructs feature using only raw data. The leaned representation of the data outperforms expert features for many modalities including Radio Frequency ([Convolutional Radio Modulation Recognition Networks](https://arxiv.org/pdf/1602.04105.pdf)), computer vision ([Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations](https://www.cs.princeton.edu/~rajeshr/papers/icml09-ConvolutionalDeepBeliefNetworks.pdf)) and audio classification ([Unsupervised feature learning for audio classification using convolutional deep belief networks](http://www.robotics.stanford.edu/~ang/papers/nips09-AudioConvolutionalDBN.pdf)). In the case of Convolutional Neural Networks (CNN), the data representation is leaned in a supervised fashion with respect to a task such as classification. For a typical CNN to generalize to unseen data in requires very large quantities of data. The amount of data available is often not sufficient to train a CNN. GANs allow features to be learned unsupervised. This reduces that potential for features being overfitted to the training data and in turn means that a classification algorithm trained on the features will generalize on a smaller amount of data. In fact, GANs promote generalization beyond the training data, as will be seen.  \n# GAN \nFor a full review of a GAN: [Generative Adversarial Nets](https://arxiv.org/pdf/1406.2661.pdf) \n![alt text](https://github.com/nmharmon8/StockMarketGAN/blob/master/figures/gan.png)\nThe Generator is trained to generate data that looks like historical price data of the target stocks over a gaussian distribution. The Discriminator is trained to tell the difference between the data from the Generator and the real data. The error from the Discriminator is used to train the Generator to defeat the Discriminator. The competition between the Generator and the Discriminator forces the Discriminator to distinguish random from real variability.    \n# Approach \n**Data**\nHistorical prices of stocks are likely not very predictive of the future price of the stock, but it is free data. Technical indicators are calculated using the historical prices of stocks. Not being a trader I don't know the validity of technical indicators, but if a sufficient number investors use technical indicators to invest such that they move the market, then the historical prices data of stocks should suffice to predict the direction of the market correctly more then 50% of the time.\n**Training**\nThe GAN is trained on 96 stocks off the Nasdaq. Each stock is normalized using a 20-day rolling window (data-mean)/(max-min). The last 356 days (1.4 years) of trading are held out as a test set. Time series of 20 day periods are constructed and used as input to the GAN. Once the GAN is finished training, the activated weighs from the last layer of convolutional lays is used as the new representation of the data. XGBoost is trained to classify whether the stock will go up or down over some period of time. \n\n**Testing**\nThe data the was held out in the training phase is run through the Discriminator portion of the GAN and the activated weights of the last convolutional layer are extracted. The extracted features are then classified using XGBoost.\n\n**Results** \nThe confusion matrix shows the results of the model's classification. The perfect confusion matrix would only have predictions on the main diagonal. Each number off the main diagonal is a misclassification.  \n\n\n**Predictions of Up or Down movement over 10 Days**\n\nThe predictions over a 10 day period are quite good. \n\n![alt text](https://github.com/nmharmon8/StockMarketGAN/blob/master/figures/XGB_GAN_Confusion_Matrix_Up_Or_Down_Over_10_Days_normalize.png)\n\n**Predictions of Up or Down movement over 1 Day**\n\nPredicting over a short time interval seems to be harder. Results loss significant accuracy when trying to predict the next day movement of the stock. \n\n![alt text](https://github.com/nmharmon8/StockMarketGAN/blob/master/figures/XGB_GAN_Confusion_Matrix_Up_Or_Down_Over_1_Days_normalize.png)\n\n**Predictions 10% Gain Over 10 Days**\n\nJust knowing that the stock will go up or down is of limited use. A lot of stocks will go up in a day but an investor will want to only buy the stocks that will go up the most, maximizing returns. This time the XGBoost model was trained to predict stocks that would go up by 10% or more over the following 10 days. \n\n![alt text](https://github.com/nmharmon8/StockMarketGAN/blob/master/figures/XGB_GAN_Confusion_Matrix_Up_Or_Down_Over_10_Days_10_percent_normalize.png)\n \n"
  },
  {
    "path": "cnn.py",
    "content": "import tensorflow as tf\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport os\n\nSEED = 42\ntf.set_random_seed(SEED)\nclass CNN():\n\n    def __init__(self, num_features, num_historical_days, is_train=True):\n      \n        self.X = tf.placeholder(tf.float32, shape=[None, num_historical_days, num_features])\n        X = tf.reshape(self.X, [-1, num_historical_days, 1, num_features])\n        self.Y = tf.placeholder(tf.int32, shape=[None, 2])\n        self.keep_prob = tf.placeholder(tf.float32, shape=[])\n\n        with tf.variable_scope(\"cnn\"):\n            #[filter_height, filter_width, in_channels, out_channels]\n            k1 = tf.Variable(tf.truncated_normal([3, 1, num_features, 16],\n                stddev=0.1,seed=SEED, dtype=tf.float32))\n            b1 = tf.Variable(tf.zeros([16], dtype=tf.float32))\n\n            conv = tf.nn.conv2d(X,k1,strides=[1, 1, 1, 1],padding='SAME')\n            relu = tf.nn.relu(tf.nn.bias_add(conv, b1))\n            if is_train:\n                relu = tf.nn.dropout(relu, keep_prob = self.keep_prob)\n            print(relu)\n\n\n            k2 = tf.Variable(tf.truncated_normal([3, 1, 16, 32],\n                stddev=0.1,seed=SEED, dtype=tf.float32))\n            b2 = tf.Variable(tf.zeros([32], dtype=tf.float32))\n            conv = tf.nn.conv2d(relu, k2,strides=[1, 1, 1, 1],padding='SAME')\n            relu = tf.nn.relu(tf.nn.bias_add(conv, b2))\n            if is_train:\n                relu = tf.nn.dropout(relu, keep_prob = self.keep_prob)\n            print(relu)\n\n\n            k3 = tf.Variable(tf.truncated_normal([3, 1, 32, 64],\n                stddev=0.1,seed=SEED, dtype=tf.float32))\n            b3 = tf.Variable(tf.zeros([64], dtype=tf.float32))\n            conv = tf.nn.conv2d(relu, k3, strides=[1, 1, 1, 1], padding='VALID')\n            relu = tf.nn.relu(tf.nn.bias_add(conv, b3))\n            if is_train:\n                relu = tf.nn.dropout(relu, keep_prob=self.keep_prob)\n            print(relu)\n\n\n            flattened_convolution_size = int(relu.shape[1]) * int(relu.shape[2]) * int(relu.shape[3])\n            print(flattened_convolution_size)\n            flattened_convolution = features = tf.reshape(relu, [-1, flattened_convolution_size])\n\n            if is_train:\n                flattened_convolution =  tf.nn.dropout(flattened_convolution, keep_prob=self.keep_prob)\n\n            W1 = tf.Variable(tf.truncated_normal([18*1*64, 32]))\n            b4 = tf.Variable(tf.truncated_normal([32]))\n            h1 = tf.nn.relu(tf.matmul(flattened_convolution, W1) + b4)\n\n\n            W2 = tf.Variable(tf.truncated_normal([32, 2]))\n            logits = tf.matmul(h1, W2)\n\n            #self.accuracy = tf.metrics.accuracy(tf.argmax(self.Y, 1), tf.argmax(logits, 1))\n            self.accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(self.Y, 1), tf.argmax(logits, 1)), tf.float32))\n            self.confusion_matrix = tf.confusion_matrix(tf.argmax(self.Y, 1), tf.argmax(logits, 1))\n            tf.summary.scalar('accuracy', self.accuracy)\n            theta_D = [k1, b1, k2, b2, k3, b3, W1, b4, W2]           \n            \n            # D_prob = tf.nn.sigmoid(D_logit)\n\n        self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=self.Y, logits=logits))\n        tf.summary.scalar('loss', self.loss)\n        # self.D_l2_loss = (0.0001 * tf.add_n([tf.nn.l2_loss(t) for t in theta_D]) / len(theta_D))\n        # self.D_loss = D_loss_real + D_loss_fake + self.D_l2_loss\n        # self.G_l2_loss = (0.00001 * tf.add_n([tf.nn.l2_loss(t) for t in theta_G]) / len(theta_G))\n        # self.G_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=D_logit_fake, labels=tf.ones_like(D_logit_fake))) + self.G_l2_loss\n\n        self.optimizer = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(self.loss)\n        self.summary = tf.summary.merge_all()"
  },
  {
    "path": "companylist.csv",
    "content": "Symbol, Name, lastsale, netchange,pctchange, share_volume, Nasdaq100_points,\r\nAAPL, Apple Inc, 44.51, -0.08,-0.18, 8648571, -.1,\r\nATVI, Activision Blizzard Inc, 44.51, -0.08,-0.18, 8648571, -.1,\r\nADBE, Adobe Systems Incorporated, 99.1, -0.52,-0.52, 2689240, -.2,\r\nAKAM, Akamai Technologies Inc., 51.02, -0.36,-0.7, 1700915, -.1,\r\nALXN, Alexion Pharmaceuticals Inc., 132.01, 2.31,1.78, 1600959, .4,\r\nGOOG, Alphabet Inc., 767.545, -4.215,-0.55, 1468065, -1.0,\r\nGOOGL, Alphabet Inc., 797.58, -3.65,-0.46, 1467020, .0,\r\nAMZN, Amazon.com Inc., 777.64, 7.95,1.03, 3643647, 3.2,\r\nAAL, American Airlines Group Inc., 35.64, -0.77,-2.11, 5510588, .0,\r\nAMGN, Amgen Inc., 173, 0.36,0.21, 2871628, .2,\r\nADI, Analog Devices Inc., 62.16, -0.49,-0.78, 1896545, -.1,\r\nAMAT, Applied Materials Inc., 30.11, -0.04,-0.13, 10809621, .0,\r\nADSK, Autodesk Inc., 67.63, 0.24,0.36, 2122174, .1,\r\nADP, Automatic Data Processing Inc., 86.9, -0.58,-0.66, 1863520, -.2,\r\nBIDU, Baidu Inc., 185.17, -1.33,-0.71, 1300251, -.3,\r\nBBBY, Bed Bath &amp; Beyond Inc., 43.13, 0.03,0.07, 1967613, .0,\r\nBIIB, Biogen Inc., 304.68, 1.86,0.61, 1369746, .4,\r\nBMRN, BioMarin Pharmaceutical Inc., 97.05, 1.67,1.75, 1337946, .0,\r\nAVGO, Broadcom Limited, 171.77, -0.95,-0.55, 3172354, -.2,\r\nCA, CA Inc., 32.07, -0.4,-1.23, 2289365, -.2,\r\nCELG, Celgene Corporation, 108.78, 1.43,1.33, 4275630, .5,\r\nCERN, Cerner Corporation, 62.1, -0.67,-1.07, 2099221, -.2,\r\nCHTR, Charter Communications Inc., 265.13, -1.14,-0.43, 1437862, -.1,\r\nCHKP, Check Point Software Technologies Ltd., 75.06, -0.45,-0.6, 971010, -.1,\r\nCSCO, Cisco Systems Inc., 30.92, -0.39,-1.25, 26516368, -1.7,\r\nCTXS, Citrix Systems Inc., 83.19, -0.67,-0.8, 1042641, -.1,\r\nCTSH, Cognizant Technology Solutions Corporation, 53.44, -0.66,-1.22, 5936067, -.4,\r\nCMCSA, Comcast Corporation, 65.995, -0.245,-0.37, 10512643, -.5,\r\nCOST, Costco Wholesale Corporation, 152.5559, -0.1141,-0.07, 2091886, .0,\r\nCTRP, Ctrip.com International Ltd., 43.73, -0.02,-0.05, 2875035, .0,\r\nXRAY, DENTSPLY SIRONA Inc., 59.81, -0.22,-0.37, 960442, .0,\r\nDISCA, Discovery Communications Inc., 24.36, -0.05,-0.2, 4304349, .0,\r\nDISCK, Discovery Communications Inc., 23.61, -0.09,-0.38, 1109834, .0,\r\nDISH, DISH Network Corporation, 52.15, 0.1,0.19, 2690270, .0,\r\nDLTR, Dollar Tree Inc., 81.195, -0.795,-0.97, 2301440, -.1,\r\nEBAY, eBay Inc., 31.815, -0.165,-0.52, 7019272, -.2,\r\nEA, Electronic Arts Inc., 82.96, -0.41,-0.49, 2609316, .0,\r\nEXPE, Expedia Inc., 107.95, -3.95,-3.53, 5108596, -.4,\r\nESRX, Express Scripts Holding Company, 70.205, -0.235,-0.33, 3617077, -.2,\r\nFB, Facebook Inc., 128.93, 0.58,0.45, 13411515, 1.0,\r\nFAST, Fastenal Company, 40.095, -0.715,-1.75, 2308194, -.2,\r\nFISV, Fiserv Inc., 99.2, -0.73,-0.73, 1138941, -.2,\r\nGILD, Gilead Sciences Inc., 78.93, 0.09,0.11, 7487560, .1,\r\nHSIC, Henry Schein Inc., 164.26, 1.49,0.92, 563392, .1,\r\nILMN, Illumina Inc., 172.93, -2.01,-1.15, 811231, -.2,\r\nINCY, Incyte Corporation, 82.45, 2.09,2.6, 987770, .0,\r\nINTU, Intuit Inc., 108.97, -0.79,-0.72, 1512370, -.2,\r\nISRG, Intuitive Surgical Inc., 686.68, 2.49,0.36, 222109, .1,\r\nJD, JD.com Inc., 26.2475, 0.1175,0.45, 8541023, .0,\r\nLRCX, Lam Research Corporation, 93.2, -0.16,-0.17, 1420773, .0,\r\nLBTYA, Liberty Global plc, 32.59, -0.055,-0.17, 1683861, .0,\r\nLBTYK, Liberty Global plc, 31.7, -0.15,-0.47, 1532985, .0,\r\nLVNTA, Liberty Interactive Corporation, 38.6, -1.35,-3.38, 1624743, .0,\r\nQVCA, Liberty Interactive Corporation, 18.46, -0.51,-2.69, 3919115, .0,\r\nLLTC, Linear Technology Corporation, 58.47, -0.08,-0.14, 1751049, .0,\r\nMAR, Marriott International, 68.7, -0.59,-0.85, 1989898, -.2,\r\nMXIM, Maxim Integrated Products Inc., 38.86, -0.13,-0.33, 3465172, .0,\r\nMCHP, Microchip Technology Incorporated, 60.42, -0.02,-0.03, 1652175, .0,\r\nMU, Micron Technology Inc., 17.49, 0.04,0.23, 31528528, .0,\r\nMSFT, Microsoft Corporation, 57.285, 0.095,0.17, 31426513, .7,\r\nMDLZ, Mondelez International Inc., 43, -0.02,-0.05, 7084275, .0,\r\nMNST, Monster Beverage Corporation, 146.41, 0.06,0.04, 1041770, .0,\r\nMYL, Mylan N.V., 41.78, 0.29,0.7, 4696846, .1,\r\nNTAP, NetApp Inc., 34.85, -0.2,-0.57, 2003381, -.1,\r\nNTES, NetEase Inc., 238.07, 0.32,0.13, 943343, .0,\r\nNFLX, Netflix Inc., 99.29, 1.95,2, 6765848, .1,\r\nNCLH, Norwegian Cruise Line Holdings Ltd., 35.87, -0.37,-1.02, 2171219, .0,\r\nNVDA, NVIDIA Corporation, 62.96, 0.27,0.43, 10420796, .1,\r\nNXPI, NXP Semiconductors N.V., 83.83, -1.71,-2, 2553456, -.4,\r\nORLY, O'Reilly Automotive Inc., 272.83, -3.21,-1.16, 743599, -.3,\r\nPCAR, PACCAR Inc., 56.655, -0.275,-0.48, 2202565, -.1,\r\nPAYX, Paychex Inc., 58.375, -0.635,-1.08, 3061211, -.2,\r\nPYPL, PayPal Holdings Inc., 40.72, -0.11,-0.27, 7734598, .0,\r\nQCOM, QUALCOMM Incorporated, 63.04, 0.5,0.8, 8801907, .7,\r\nREGN, Regeneron Pharmaceuticals Inc., 408.86, 6.11,1.52, 901266, .5,\r\nROST, Ross Stores Inc., 61.94, -0.01,-0.02, 1661730, .0,\r\nSBAC, SBA Communications Corporation, 108.68, -0.22,-0.2, 616118, .0,\r\nSTX, Seagate Technology PLC, 36.45, 0,0, 5600814, .0,\r\nSIRI, Sirius XM Holdings Inc., 4.125, -0.04,-0.96, 40279977, -.2,\r\nSWKS, Skyworks Solutions Inc., 76.22, -0.8,-1.04, 4471051, .0,\r\nSBUX, Starbucks Corporation, 53.78, -0.33,-0.61, 7818288, -.2,\r\nSRCL, Stericycle Inc., 81.21, -0.55,-0.67, 1222820, .0,\r\nSYMC, Symantec Corporation, 25.195, 0.325,1.31, 10174561, .2,\r\nTMUS, T-Mobile US Inc., 46.49, -0.69,-1.46, 3487789, .0,\r\nTSLA, Tesla Motors Inc., 205.415, 4.995,2.49, 2402292, .5,\r\nKHC, The Kraft Heinz Company, 89.014, -0.156,-0.17, 3172798, .0,\r\nPCLN, The Priceline Group Inc., 1460.55, 3.33,0.23, 609486, .2,\r\nTSCO, Tractor Supply Company, 68.29, -0.77,-1.11, 1210939, -.1,\r\nTRIP, TripAdvisor Inc., 61.2, -0.62,-1, 1832039, -.1,\r\nFOX, Twenty-First Century Fox Inc., 24.3, -0.02,-0.08, 4980247, .0,\r\nFOXA, Twenty-First Century Fox Inc., 23.915, 0.075,0.31, 20513508, .1,\r\nULTA, Ulta Salon Cosmetics &amp; Fragrance Inc., 234, 1.01,0.43, 723369, .0,\r\nVRSK, Verisk Analytics Inc., 80.85, -0.64,-0.79, 710627, -.1,\r\nVRTX, Vertex Pharmaceuticals Incorporated, 92.88, -0.22,-0.24, 1609405, .0,\r\nVIAB, Viacom Inc., 37.05, -0.22,-0.59, 3122390, -.1,\r\nVOD, Vodafone Group Plc, 29.04, -0.53,-1.79, 8078232, -.2,\r\nWBA, Walgreens Boots Alliance Inc., 81.39, -0.01,-0.01, 5454174, .0,\r\nWDC, Western Digital Corporation, 54.82, 1.54,2.89, 7008776, .3,\r\nWFM, Whole Foods Market Inc., 28.36, -0.16,-0.56, 3654014, -.1,\r\nXLNX, Xilinx Inc., 53.485, -0.085,-0.16, 1467151, .0,\r\nYHOO, Yahoo! Inc., 43.78, -0.21,-0.48, 12061751, -.2,\r\n"
  },
  {
    "path": "gan.py",
    "content": "import tensorflow as tf\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport os\n\nSEED = 42\ntf.set_random_seed(SEED)\n\nclass GAN():\n\n    def sample_Z(self, batch_size, n):\n        return np.random.uniform(-1., 1., size=(batch_size, n))\n\n    def __init__(self, num_features, num_historical_days, generator_input_size=200, is_train=True):\n        def get_batch_norm_with_global_normalization_vars(size):\n            v = tf.Variable(tf.ones([size]), dtype=tf.float32)\n            m = tf.Variable(tf.ones([size]), dtype=tf.float32)\n            beta = tf.Variable(tf.ones([size]), dtype=tf.float32)\n            gamma = tf.Variable(tf.ones([size]), dtype=tf.float32)\n            return v, m, beta, gamma\n\n        self.X = tf.placeholder(tf.float32, shape=[None, num_historical_days, num_features])\n        X = tf.reshape(self.X, [-1, num_historical_days, 1, num_features])\n        self.Z = tf.placeholder(tf.float32, shape=[None, generator_input_size])\n\n        generator_output_size = num_features*num_historical_days\n        with tf.variable_scope(\"generator\"):\n            W1 = tf.Variable(tf.truncated_normal([generator_input_size, generator_output_size*10]))\n            b1 = tf.Variable(tf.truncated_normal([generator_output_size*10]))\n\n            h1 = tf.nn.sigmoid(tf.matmul(self.Z, W1) + b1)\n\n            # v1, m1, beta1, gamma1 = get_batch_norm_with_global_normalization_vars(generator_output_size*10)\n            # h1 = tf.nn.batch_norm_with_global_normalization(h1, v1, m1,\n            #         beta1, gamma1, variance_epsilon=0.000001, scale_after_normalization=False)\n\n            W2 = tf.Variable(tf.truncated_normal([generator_output_size*10, generator_output_size*5]))\n            b2 = tf.Variable(tf.truncated_normal([generator_output_size*5]))\n\n            h2 = tf.nn.sigmoid(tf.matmul(h1, W2) + b2)\n\n            # v2, m2, beta2, gamma2 = get_batch_norm_with_global_normalization_vars(generator_output_size*5)\n            # h2 = tf.nn.batch_norm_with_global_normalization(h2, v2, m2,\n            #         beta2, gamma2, variance_epsilon=0.000001, scale_after_normalization=False)\n\n\n            W3 = tf.Variable(tf.truncated_normal([generator_output_size*5, generator_output_size]))\n            b3 = tf.Variable(tf.truncated_normal([generator_output_size]))\n\n            g_log_prob = tf.matmul(h2, W3) + b3\n            g_log_prob = tf.reshape(g_log_prob, [-1, num_historical_days, 1, num_features])\n            self.gen_data = tf.reshape(g_log_prob, [-1, num_historical_days, num_features])\n            #g_log_prob = g_log_prob / tf.reshape(tf.reduce_max(g_log_prob, axis=1), [-1, 1, num_features, 1])\n            #g_prob = tf.nn.sigmoid(g_log_prob)\n\n            theta_G = [W1, b1, W2, b2, W3, b3]\n\n\n\n        with tf.variable_scope(\"discriminator\"):\n            #[filter_height, filter_width, in_channels, out_channels]\n            k1 = tf.Variable(tf.truncated_normal([3, 1, num_features, 32],\n                stddev=0.1,seed=SEED, dtype=tf.float32))\n            b1 = tf.Variable(tf.zeros([32], dtype=tf.float32))\n\n            v1, m1, beta1, gamma1 = get_batch_norm_with_global_normalization_vars(32)\n\n            k2 = tf.Variable(tf.truncated_normal([3, 1, 32, 64],\n                stddev=0.1,seed=SEED, dtype=tf.float32))\n            b2 = tf.Variable(tf.zeros([64], dtype=tf.float32))\n\n            v2, m2, beta2, gamma2 = get_batch_norm_with_global_normalization_vars(64)\n\n            k3 = tf.Variable(tf.truncated_normal([3, 1, 64, 128],\n                stddev=0.1,seed=SEED, dtype=tf.float32))\n            b3 = tf.Variable(tf.zeros([128], dtype=tf.float32))\n\n            v3, m3, beta3, gamma3 = get_batch_norm_with_global_normalization_vars(128)\n\n            W1 = tf.Variable(tf.truncated_normal([18*1*128, 128]))\n            b4 = tf.Variable(tf.truncated_normal([128]))\n\n            v4, m4, beta4, gamma4 = get_batch_norm_with_global_normalization_vars(128)\n\n            W2 = tf.Variable(tf.truncated_normal([128, 1]))\n\n            theta_D = [k1, b1, k2, b2, k3, b3, W1, b4, W2]\n\n        def discriminator(X):\n            conv = tf.nn.conv2d(X,k1,strides=[1, 1, 1, 1],padding='SAME')\n            relu = tf.nn.relu(tf.nn.bias_add(conv, b1))\n            pool = relu\n            # pool = tf.nn.avg_pool(relu, ksize=[1, 2, 1, 1], strides=[1, 2, 1, 1], padding='SAME')\n            if is_train:\n                pool = tf.nn.dropout(pool, keep_prob = 0.8)\n            # pool = tf.nn.batch_norm_with_global_normalization(pool, v1, m1,\n            #         beta1, gamma1, variance_epsilon=0.000001, scale_after_normalization=False)\n            print(pool)\n\n            conv = tf.nn.conv2d(pool, k2,strides=[1, 1, 1, 1],padding='SAME')\n            relu = tf.nn.relu(tf.nn.bias_add(conv, b2))\n            pool = relu\n            #pool = tf.nn.avg_pool(relu, ksize=[1, 2, 1, 1], strides=[1, 2, 1, 1], padding='SAME')\n            if is_train:\n                pool = tf.nn.dropout(pool, keep_prob = 0.8)\n            # pool = tf.nn.batch_norm_with_global_normalization(pool, v2, m2,\n            #         beta2, gamma2, variance_epsilon=0.000001, scale_after_normalization=False)\n            print(pool)\n\n            conv = tf.nn.conv2d(pool, k3, strides=[1, 1, 1, 1], padding='VALID')\n            relu = tf.nn.relu(tf.nn.bias_add(conv, b3))\n            if is_train:\n                relu = tf.nn.dropout(relu, keep_prob=0.8)\n            # relu = tf.nn.batch_norm_with_global_normalization(relu, v3, m3,\n            #         beta3, gamma3, variance_epsilon=0.000001, scale_after_normalization=False)\n            print(relu)\n\n\n            flattened_convolution_size = int(relu.shape[1]) * int(relu.shape[2]) * int(relu.shape[3])\n            print(flattened_convolution_size)\n            flattened_convolution = features = tf.reshape(relu, [-1, flattened_convolution_size])\n\n            if is_train:\n                flattened_convolution =  tf.nn.dropout(flattened_convolution, keep_prob=0.8)\n\n            h1 = tf.nn.relu(tf.matmul(flattened_convolution, W1) + b4)\n\n            # h1 = tf.nn.batch_norm_with_global_normalization(h1, v4, m4,\n            #         beta4, gamma4, variance_epsilon=0.000001, scale_after_normalization=False)\n\n            D_logit = tf.matmul(h1, W2)\n            D_prob = tf.nn.sigmoid(D_logit)\n            return D_prob, D_logit, features\n\n        D_real, D_logit_real, self.features = discriminator(X)\n        D_fake, D_logit_fake, _ = discriminator(g_log_prob)\n\n\n        D_loss_real = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=D_logit_real, labels=tf.ones_like(D_logit_real)))\n        D_loss_fake = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=D_logit_fake, labels=tf.zeros_like(D_logit_fake)))\n        self.D_l2_loss = (0.0001 * tf.add_n([tf.nn.l2_loss(t) for t in theta_D]) / len(theta_D))\n        self.D_loss = D_loss_real + D_loss_fake + self.D_l2_loss\n        self.G_l2_loss = (0.00001 * tf.add_n([tf.nn.l2_loss(t) for t in theta_G]) / len(theta_G))\n        self.G_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=D_logit_fake, labels=tf.ones_like(D_logit_fake))) + self.G_l2_loss\n\n\n        self.D_solver = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(self.D_loss, var_list=theta_D)\n        self.G_solver = tf.train.AdamOptimizer(learning_rate=0.000055).minimize(self.G_loss, var_list=theta_G)\n"
  },
  {
    "path": "get_predictions.py",
    "content": "from get_stock_data import download_all\n\n#Download Stocks\ndownload_all()\n\n\n\nimport os\nimport pandas as pd\nfrom gan import GAN\nimport random\nimport tensorflow as tf\nimport xgboost as xgb\nfrom sklearn.externals import joblib\n\n\nos.environ[\"CUDA_VISIBLE_DEVICES\"]=\"\"\n\nclass Predict:\n\n    def __init__(self, num_historical_days=20, days=10, pct_change=0, gan_model='./deployed_models/gan', cnn_modle='./deployed_models/cnn', xgb_model='./deployed_models/xgb'):\n        self.data = []\n        self.num_historical_days = num_historical_days\n        self.gan_model = gan_model\n        self.cnn_modle = cnn_modle\n        self.xgb_model = xgb_model\n        # assert os.path.exists(gan_model)\n        # assert os.path.exists(cnn_modle)\n        # assert os.path.exists(xgb_model)\n\n        files = [os.path.join('./stock_data', f) for f in os.listdir('./stock_data')]\n        for file in files:\n            print(file)\n            df = pd.read_csv(file, index_col='Date', parse_dates=True)\n            df = df[['Open','High','Low','Close','Volume']]\n            df = ((df -\n            df.rolling(num_historical_days).mean().shift(-num_historical_days))\n            /(df.rolling(num_historical_days).max().shift(-num_historical_days)\n            -df.rolling(num_historical_days).min().shift(-num_historical_days)))\n            df = df.dropna()\n            self.data.append((file.split('/')[-1], df.index[0], df[200:200+num_historical_days].values))\n\n\n    def gan_predict(self):\n    \ttf.reset_default_graph()\n        gan = GAN(num_features=5, num_historical_days=self.num_historical_days,\n                        generator_input_size=200, is_train=False)\n        with tf.Session() as sess:\n            sess.run(tf.global_variables_initializer())\n            saver = tf.train.Saver()\n            saver.restore(sess, self.gan_model)\n            clf = joblib.load(self.xgb_model)\n            for sym, date, data in self.data:\n\t            features = sess.run(gan.features, feed_dict={gan.X:[data]})\n\t            features = xgb.DMatrix(features)\n\t            print('{} {} {}'.format(str(date).split(' ')[0], sym, clf.predict(features)[0][1] > 0.5))\n\t            \n\n\nif __name__ == '__main__':\n\tp = Predict()\n\tp.gan_predict()\n"
  },
  {
    "path": "get_stock_data.py",
    "content": "import os\nimport datetime\nimport urllib2\nfrom dateutil.parser import parse\nimport threading\n\nassert 'QUANDL_KEY' in os.environ\nquandl_api_key = os.environ['QUANDL_KEY']\n\nclass nasdaq():\n\tdef __init__(self):\n\t\tself.output = './stock_data'\n\t\tself.company_list = './companylist.csv'\n\n\tdef build_url(self, symbol):\n\t\turl = 'https://www.quandl.com/api/v3/datasets/WIKI/{}.csv?api_key={}'.format(symbol, quandl_api_key)\n\t\treturn url\n\n\tdef symbols(self):\n\t\tsymbols = []\n\t\twith open(self.company_list, 'r') as f:\n\t\t\tnext(f)\n\t\t\tfor line in f:\n\t\t\t\tsymbols.append(line.split(',')[0].strip())\n\t\treturn symbols\n\ndef download(i, symbol, url, output):\n\tprint('Downloading {} {}'.format(symbol, i))\n\ttry:\n\t\tresponse = urllib2.urlopen(url)\n\t\tquotes = response.read()\n\t\tlines = quotes.strip().split('\\n')\n\t\twith open(os.path.join(output, symbol), 'w') as f:\n\t\t\tfor i, line in enumerate(lines):\n\t\t\t\tf.write(line + '\\n')\n\texcept Exception as e:\n\t\tprint('Failed to download {}'.format(symbol))\n\t\tprint(e)\n\ndef download_all():\n\tif not os.path.exists('./stock_data'):\n\t    os.makedirs('./stock_data')\n\n\tnas = nasdaq()\n\tfor i, symbol in enumerate(nas.symbols()):\n\t\turl = nas.build_url(symbol)\n\t\tdownload(i, symbol, url, nas.output)\n\nif __name__ == '__main__':\n\tdownload_all()"
  },
  {
    "path": "plot_confusion_matrix.py",
    "content": "import itertools\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.metrics import confusion_matrix\n\n\ndef plot_confusion_matrix(cm, classes,\n                          normalize=False,\n                          title='Confusion matrix',\n                          cmap=plt.cm.Blues):\n    \"\"\"\n    This function prints and plots the confusion matrix.\n    Normalization can be applied by setting `normalize=True`.\n    \"\"\"\n    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n    plt.title(title)\n    plt.colorbar()\n    tick_marks = np.arange(len(classes))\n    plt.xticks(tick_marks, classes, rotation=45)\n    plt.yticks(tick_marks, classes)\n\n    if normalize:\n        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n        print(\"Normalized confusion matrix\")\n    else:\n        print('Confusion matrix, without normalization')\n\n    print(cm)\n\n    thresh = cm.max() / 2.\n    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n        plt.text(j, i, cm[i, j],\n                 horizontalalignment=\"center\",\n                 color=\"black\")\n\n    plt.tight_layout()\n    plt.ylabel('True label')\n    plt.xlabel('Predicted label')\n    plt.show()\n\n"
  },
  {
    "path": "train_cnn.py",
    "content": "import os\nimport pandas as pd\nfrom cnn import CNN\nimport random\nimport tensorflow as tf\nimport xgboost as xgb\nfrom sklearn.externals import joblib\nfrom sklearn.metrics import confusion_matrix\nfrom plot_confusion_matrix import plot_confusion_matrix\n\nrandom.seed(42)\n\nclass TrainCNN:\n\n    def __init__(self, num_historical_days, days=10, pct_change=0):\n        self.data = []\n        self.labels = []\n        self.test_data = []\n        self.test_labels = []\n        self.cnn = CNN(num_features=5, num_historical_days=num_historical_days, is_train=False)\n        files = [os.path.join('./stock_data', f) for f in os.listdir('./stock_data')]\n        for file in files:\n            print(file)\n            df = pd.read_csv(file, index_col='Date', parse_dates=True)\n            df = df[['Open','High','Low','Close','Volume']]\n            labels = df.Close.pct_change(days).map(lambda x: [int(x > pct_change/100.0), int(x <= pct_change/100.0)])\n            df = ((df -\n            df.rolling(num_historical_days).mean().shift(-num_historical_days))\n            /(df.rolling(num_historical_days).max().shift(-num_historical_days)\n            -df.rolling(num_historical_days).min().shift(-num_historical_days)))\n            df['labels'] = labels\n            df = df.dropna()\n            test_df = df[:365]\n            df = df[400:]\n            data = df[['Open', 'High', 'Low', 'Close', 'Volume']].values\n            labels = df['labels'].values\n            for i in range(num_historical_days, len(df), num_historical_days):\n                self.data.append(data[i-num_historical_days:i])\n                self.labels.append(labels[i-1])\n            data = test_df[['Open', 'High', 'Low', 'Close', 'Volume']].values\n            labels = test_df['labels'].values\n            for i in range(num_historical_days, len(test_df), 1):\n                self.test_data.append(data[i-num_historical_days:i])\n                self.test_labels.append(labels[i-1])\n\n\n\n    def random_batch(self, batch_size=128):\n        batch = []\n        labels = []\n        data = zip(self.data, self.labels)\n        i = 0\n        while True:\n            i+= 1\n            while True:\n                d = random.choice(data)\n                if(d[1][0]== int(i%2)):\n                    break\n            batch.append(d[0])\n            labels.append(d[1])\n            if (len(batch) == batch_size):\n                yield batch, labels\n                batch = []\n                labels = []\n\n    def train(self, print_steps=100, display_steps=100, save_steps=1000, batch_size=128, keep_prob=0.6):\n        if not os.path.exists('./cnn_models'):\n            os.makedirs('./cnn_models')\n        if not os.path.exists('./logs'):\n            os.makedirs('./logs')\n        if os.path.exists('./logs/train'):\n            for file in [os.path.join('./logs/train/', f) for f in os.listdir('./logs/train/')]:\n                os.remove(file)\n        if os.path.exists('./logs/test'):\n            for file in [os.path.join('./logs/test/', f) for f in os.listdir('./logs/test')]:\n                os.remove(file)\n\n        sess = tf.Session()\n        loss = 0\n        l2_loss = 0\n        accuracy = 0\n        saver = tf.train.Saver()\n        train_writer = tf.summary.FileWriter('./logs/train')\n        test_writer = tf.summary.FileWriter('./logs/test')\n        sess.run(tf.global_variables_initializer())\n        if os.path.exists('./cnn_models/checkpoint'):\n            with open('./cnn_models/checkpoint', 'rb') as f:\n                model_name = next(f).split('\"')[1]\n            #saver.restore(sess, \"./models/{}\".format(model_name))\n        for i, [X, y] in enumerate(self.random_batch(batch_size)):\n            _, loss_curr, accuracy_curr = sess.run([self.cnn.optimizer, self.cnn.loss, self.cnn.accuracy], feed_dict=\n                    {self.cnn.X:X, self.cnn.Y:y, self.cnn.keep_prob:keep_prob})\n            loss += loss_curr\n            accuracy += accuracy_curr\n            if (i+1) % print_steps == 0:\n                print('Step={} loss={}, accuracy={}'.format(i, loss/print_steps, accuracy/print_steps))\n                loss = 0\n                l2_loss = 0\n                accuracy = 0\n                test_loss, test_accuracy, confusion_matrix = sess.run([self.cnn.loss, self.cnn.accuracy, self.cnn.confusion_matrix], feed_dict={self.cnn.X:self.test_data, self.cnn.Y:self.test_labels, self.cnn.keep_prob:1})\n                print(\"Test loss = {}, Test accuracy = {}\".format(test_loss, test_accuracy))\n                print(confusion_matrix)\n            if (i+1) % save_steps == 0:\n                saver.save(sess, './cnn_models/cnn.ckpt', i)\n\n            if (i+1) % display_steps == 0:\n                summary = sess.run(self.cnn.summary, feed_dict=\n                    {self.cnn.X:X, self.cnn.Y:y, self.cnn.keep_prob:keep_prob})\n                train_writer.add_summary(summary, i)\n                summary = sess.run(self.cnn.summary, feed_dict={\n                    self.cnn.X:self.test_data, self.cnn.Y:self.test_labels, self.cnn.keep_prob:1})\n                test_writer.add_summary(summary, i)\n\n\nif __name__ == '__main__':\n    cnn = TrainCNN(num_historical_days=20, days=10, pct_change=10)\n    cnn.train()\n"
  },
  {
    "path": "train_gan.py",
    "content": "import os\nimport pandas as pd\nfrom gan import GAN\nimport random\nimport tensorflow as tf\n\nrandom.seed(42)\nclass TrainGan:\n\n    def __init__(self, num_historical_days, batch_size=128):\n        self.batch_size = batch_size\n        self.data = []\n        files = [os.path.join('./stock_data', f) for f in os.listdir('./stock_data')]\n        for file in files:\n            print(file)\n            #Read in file -- note that parse_dates will be need later\n            df = pd.read_csv(file, index_col='Date', parse_dates=True)\n            df = df[['Open','High','Low','Close','Volume']]\n            # #Create new index with missing days\n            # idx = pd.date_range(df.index[-1], df.index[0])\n            # #Reindex and fill the missing day with the value from the day before\n            # df = df.reindex(idx, method='bfill').sort_index(ascending=False)\n            #Normilize using a of size num_historical_days\n            df = ((df -\n            df.rolling(num_historical_days).mean().shift(-num_historical_days))\n            /(df.rolling(num_historical_days).max().shift(-num_historical_days)\n            -df.rolling(num_historical_days).min().shift(-num_historical_days)))\n            #Drop the last 10 day that we don't have data for\n            df = df.dropna()\n            #Hold out the last year of trading for testing\n            #Padding to keep labels from bleeding\n            df = df[400:]\n            #This may not create good samples if num_historical_days is a\n            #mutliple of 7\n            for i in range(num_historical_days, len(df), num_historical_days):\n                self.data.append(df.values[i-num_historical_days:i])\n\n        self.gan = GAN(num_features=5, num_historical_days=num_historical_days,\n                        generator_input_size=200)\n\n    def random_batch(self, batch_size=128):\n        batch = []\n        while True:\n            batch.append(random.choice(self.data))\n            if (len(batch) == batch_size):\n                yield batch\n                batch = []\n\n    def train(self, print_steps=100, display_data=100, save_steps=1000):\n        if not os.path.exists('./models'):\n            os.makedirs('./models')\n        sess = tf.Session()\n        G_loss = 0\n        D_loss = 0\n        G_l2_loss = 0\n        D_l2_loss = 0\n        sess.run(tf.global_variables_initializer())\n        saver = tf.train.Saver()\n        with open('./models/checkpoint', 'rb') as f:\n            model_name = next(f).split('\"')[1]\n        saver.restore(sess, \"./models/{}\".format(model_name))\n        for i, X in enumerate(self.random_batch(self.batch_size)):\n            if i % 1 == 0:\n                _, D_loss_curr, D_l2_loss_curr = sess.run([self.gan.D_solver, self.gan.D_loss, self.gan.D_l2_loss], feed_dict=\n                        {self.gan.X:X, self.gan.Z:self.gan.sample_Z(self.batch_size, 200)})\n                D_loss += D_loss_curr\n                D_l2_loss += D_l2_loss_curr\n            if i % 1 == 0:\n                _, G_loss_curr, G_l2_loss_curr = sess.run([self.gan.G_solver, self.gan.G_loss, self.gan.G_l2_loss],\n                        feed_dict={self.gan.Z:self.gan.sample_Z(self.batch_size, 200)})\n                G_loss += G_loss_curr\n                G_l2_loss += G_l2_loss_curr\n            if (i+1) % print_steps == 0:\n                print('Step={} D_loss={}, G_loss={}'.format(i, D_loss/print_steps - D_l2_loss/print_steps, G_loss/print_steps - G_l2_loss/print_steps))\n                #print('D_l2_loss = {} G_l2_loss={}'.format(D_l2_loss/print_steps, G_l2_loss/print_steps))\n                G_loss = 0\n                D_loss = 0\n                G_l2_loss = 0\n                D_l2_loss = 0\n            if (i+1) % save_steps == 0:\n                saver.save(sess, './models/gan.ckpt', i)\n            # if (i+1) % display_data == 0:\n            #     print('Generated Data')\n            #     print(sess.run(self.gan.gen_data, feed_dict={self.gan.Z:self.gan.sample_Z(1, 200)}))\n            #     print('Real Data')\n            #     print(X[0])\n\n\nif __name__ == '__main__':\n    gan = TrainGan(20, 128)\n    gan.train()\n"
  },
  {
    "path": "train_xgb_boost.py",
    "content": "import os\nimport pandas as pd\nfrom gan import GAN\nimport random\nimport tensorflow as tf\nimport xgboost as xgb\nfrom sklearn.externals import joblib\nfrom sklearn.metrics import confusion_matrix\nfrom plot_confusion_matrix import plot_confusion_matrix\n\nos.environ[\"CUDA_VISIBLE_DEVICES\"]=\"\"\n\nclass TrainXGBBoost:\n\n    def __init__(self, num_historical_days, days=10, pct_change=0):\n        self.data = []\n        self.labels = []\n        self.test_data = []\n        self.test_labels = []\n        assert os.path.exists('./models/checkpoint')\n        gan = GAN(num_features=5, num_historical_days=num_historical_days,\n                        generator_input_size=200, is_train=False)\n        with tf.Session() as sess:\n            sess.run(tf.global_variables_initializer())\n            saver = tf.train.Saver()\n            with open('./models/checkpoint', 'rb') as f:\n                model_name = next(f).split('\"')[1]\n            saver.restore(sess, \"./models/{}\".format(model_name))\n            files = [os.path.join('./stock_data', f) for f in os.listdir('./stock_data')]\n            for file in files:\n                print(file)\n                #Read in file -- note that parse_dates will be need later\n                df = pd.read_csv(file, index_col='Date', parse_dates=True)\n                df = df[['Open','High','Low','Close','Volume']]\n                # #Create new index with missing days\n                # idx = pd.date_range(df.index[-1], df.index[0])\n                # #Reindex and fill the missing day with the value from the day before\n                # df = df.reindex(idx, method='bfill').sort_index(ascending=False)\n                #Normilize using a of size num_historical_days\n                labels = df.Close.pct_change(days).map(lambda x: int(x > pct_change/100.0))\n                df = ((df -\n                df.rolling(num_historical_days).mean().shift(-num_historical_days))\n                /(df.rolling(num_historical_days).max().shift(-num_historical_days)\n                -df.rolling(num_historical_days).min().shift(-num_historical_days)))\n                df['labels'] = labels\n                #Drop the last 10 day that we don't have data for\n                df = df.dropna()\n                #Hold out the last year of trading for testing\n                test_df = df[:365]\n                #Padding to keep labels from bleeding\n                df = df[400:]\n                #This may not create good samples if num_historical_days is a\n                #mutliple of 7\n                data = df[['Open', 'High', 'Low', 'Close', 'Volume']].values\n                labels = df['labels'].values\n                for i in range(num_historical_days, len(df), num_historical_days):\n                    features = sess.run(gan.features, feed_dict={gan.X:[data[i-num_historical_days:i]]})\n                    self.data.append(features[0])\n                    print(features[0])\n                    self.labels.append(labels[i-1])\n                data = test_df[['Open', 'High', 'Low', 'Close', 'Volume']].values\n                labels = test_df['labels'].values\n                for i in range(num_historical_days, len(test_df), 1):\n                    features = sess.run(gan.features, feed_dict={gan.X:[data[i-num_historical_days:i]]})\n                    self.test_data.append(features[0])\n                    self.test_labels.append(labels[i-1])\n\n\n\n    def train(self):\n        params = {}\n        params['objective'] = 'multi:softprob'\n        params['eta'] = 0.01\n        params['num_class'] = 2\n        params['max_depth'] = 20\n        params['subsample'] = 0.05\n        params['colsample_bytree'] = 0.05\n        params['eval_metric'] = 'mlogloss'\n        #params['scale_pos_weight'] = 10\n        #params['silent'] = True\n        #params['gpu_id'] = 0\n        #params['max_bin'] = 16\n        #params['tree_method'] = 'gpu_hist'\n\n        train = xgb.DMatrix(self.data, self.labels)\n        test = xgb.DMatrix(self.test_data, self.test_labels)\n\n        watchlist = [(train, 'train'), (test, 'test')]\n        clf = xgb.train(params, train, 1000, evals=watchlist, early_stopping_rounds=100)\n        joblib.dump(clf, 'models/clf.pkl')\n        cm = confusion_matrix(self.test_labels, map(lambda x: int(x[1] > .5), clf.predict(test)))\n        print(cm)\n        plot_confusion_matrix(cm, ['Down', 'Up'], normalize=True, title=\"Confusion Matrix\")\n\n\nboost_model = TrainXGBBoost(num_historical_days=20, days=10, pct_change=10)\nboost_model.train()\n"
  }
]