[
  {
    "path": ".gitignore",
    "content": ".swp\n"
  },
  {
    "path": "LICENSE",
    "content": "Copyright (c) 2016, mp2893\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n* Redistributions of source code must retain the above copyright notice, this\n  list of conditions and the following disclaimer.\n\n* Redistributions in binary form must reproduce the above copyright notice,\n  this list of conditions and the following disclaimer in the documentation\n  and/or other materials provided with the distribution.\n\n* Neither the name of Med2Vec nor the names of its\n  contributors may be used to endorse or promote products derived from\n  this software without specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\nIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\nFOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\nSERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\nCAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\nOR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\nOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n"
  },
  {
    "path": "README.md",
    "content": "Med2Vec\n=========================================\n\nMed2Vec is a multi-layer representation learning tool for learning code representations and visit representations from EHR datasets.\n\n[![Med2Vec Coordinate-wise Interpretation Demo](http://www.cc.gatech.edu/~echoi48/images/med2vec_interpret.png)](https://youtu.be/UR_f2rmMJkk?t=2m34s \"Med2Vec Coordinate-wise Interpretation Demo - Click to Watch!\")\nMed2Vec embeddings not only help improve predictive performance of healthcare applications, but also enable the interpretation of the learned code representations in a coodinate-wise manner. You can see that these six coordinates (chosen by their strong correlation with patient severity level) of the code representation space demonstrate medically coherent groups of symptoms (diagnoses, medications, and procedures). \n\n#### Relevant Publications\n\nMed2Vec implements an algorithm introduced in the following [paper](http://www.kdd.org/kdd2016/subtopic/view/multi-layer-representation-learning-for-medical-concepts):\n\n    Multi-layer Representation Learning for Medical Concepts\n\tEdward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, \n\tMichael Thompson, James Bost, Javier Tejedor-Sojo, Jimeng Sun\n\tKDD 2016, pp.1495-1504\n\n#### Running Med2Vec\n\n**STEP 1: Installation**  \n\n1. Install [python](https://www.python.org/), [Theano](http://deeplearning.net/software/theano/index.html). We use Python 2.7, Theano 0.7. Theano can be easily installed in Ubuntu as suggested [here](http://deeplearning.net/software/theano/install_ubuntu.html#install-ubuntu)\n\n2. If you plan to use GPU computation, install [CUDA](https://developer.nvidia.com/cuda-downloads)\n\n3. Download/clone the Med2Vec code  \n\n**STEP 2: Fast way to test Med2Vec with MIMIC-III**\n\nThis step describes how to run, with minimum number of steps, Med2Vec using MIMIC-III. \n\n0. You will first need to request access for [MIMIC-III](https://mimic.physionet.org/gettingstarted/access/), a publicly avaiable electronic health records collected from ICU patients over 11 years. \n\n1. You can use \"process_mimic.py\" to process MIMIC-III dataset and generate a suitable training dataset for Med2Vec.\nPlace the script to the same location where the MIMIC-III CSV files are located, and run the script. \nThe execution command is `python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv <output file>`.\nInstructions are described inside the script. \n\n2. Run Med2Vec using the \".seqs\" file generated by process_mimic.py, using the following command.\n`python med2vec.py <seqs file> 4894 <output path>`\nwhere 4894 is the number of unique ICD9 diagnosis codes in the dataset.\nAs described in the paper, however, it is a good idea to use the grouped codes for training the Softmax component of Med2Vec. Therefore we recommend using the following command instead.\n`python med2vec.py <seqs file> 4894 <output path> --label_file <3digitICD9.seqs file> --n_output_codes 942`\nwhere 942 is the number of unique 3-digit ICD9 diagnosis codes in the dataset.\nYou can also use \".3digitICD9.seqs\" to begin with, if you interested in learning the representation of 3-digit ICD9 codes only, using the following command.\n`python med2vec.py <3digitICD9.seqs file> 942 <output path>`\n\n3. As suggested in STEP 4, you might want to adjust the hyper-parameters. \nI recommend decreasing the `--batch_size` to 100 or so, since the default value 1,000 is too big considering the small number of patients in MIMIC-III datasets. \nThere are only 7,500 patients who made more than a single visit, and most of them have only two visits.\n\n**STEP 3: Preparing training data**  \n\n1. Med2Vec training data need to be a Python Pickled list of list of medical codes (e.g. diagnosis codes, medication codes, or procedure codes). \nFirst, medical codes need to be converted to an integer. Then a single visit can be converted as a list of integers. \nFor example, [5,8,15] means the patient was assigned with code 5, 8, and 15 at a certain visit. \nIf a patient made two visits [1,2,3] and [4,5,6,7], it can be converted to a list of list [[1,2,3], [4,5,6,7]]. \nIf there are multiple patients, each patient must be delimited by a list [-1]. \nFor example, [[1,2,3], [4,5,6,7], [-1], [2,4], [8,3,1], [3]] means there are two patients where the first patient made two visits and the second patient made three visits. \nThis list of list needs to be pickled using cPickle. We will refer to this file as the \"visit file\".\n\n2. The total number of unique medical codes is required to run Med2Vec. \nFor example, if the dataset is using 14,000 diagnosis codes and 11,000 procedure codes, the total number is 25,000. \nNote that using a huge number of codes could lead to memory problems, depending on your RAM/VRAM (thanks for the tip [tRosenflanz](https://github.com/tRosenflanz))\n\n3. For a faster training, you can provide an additional dataset, which is simply the same dataset in step 1, but with grouped medical codes. \nFor example, ICD9 diagnosis codes can be grouped into 283 categories by using [CCS](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp) groupers. \nYou will still be able to learn the code representations for the original un-grouped codes. \nThe grouped dataset is used only for speeding up the training speed. (Refer to section 4.4 of the paper) \nThe grouped dataset should be prepared in the same way as the dataset in step 1. We will refer to this grouped dataset as the \"label file\".\n\n4. Same as step 2, you will need to remember the total number of unique grouped codes if you plan to use this grouped dataset.\n\n5. If you wish to use patient demographic information (e.g. age, weight, gender) you need to create a demographics vector for each visit the patient made. \nFor example, if you are using age (real-valued) and ethnicity(categorical, assume 6 categories), you can create a vector such as [45.0, 0, 0, 0, 0, 1, 0]. \nSimilar to the [-1] vector in step 1, each patient is delimited with an all-zero vector. \nTherefore the demographic information will be a pickled matrix where column size is the size of the demographics vector and row size is the number of total visits of all patients plus the delimiters. \nWe will refer to this file as the \"demo file\".\n\n6. Similar to step 2, you will need to remeber the size of the demographics vector if you plan to use the demo file. \nIn the example of step 5, the size of the demographics vector is 7.\n\n**STEP 4: Running Med2Vec**  \n\n1. The minimum input you need to run Med2Vec is the visit file, the number of unique medical codes and the output path\n`python med2vec <path/to/visit_file> <the number of unique medical codes> <path/to/output>`  \n\n2. Specifying `--verbose` option will print training process after each 10 mini-batches.\n\n3. Additional options can be specified such as the size of the code representation, the size of the visit representation and the number of epochs. Detailed information can be accessed by `python med2vec --help`\n\n**STEP 5: Looking at your results**  \n\nMed2Vec produces a model file after each epoch. The model file is generated by [numpy.savez_compressed](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.savez_compressed.html).\n\nThe 2D scatterplot of the learned code representations would look similar to [this](http://mp2893.com/scatterplot/nnsg_h200e49_category10.html).\n(This is the scatterplot of the code representations trained with Non-negative Skip-gram, which is essentially Med2Vec minus the visit-level training)\n"
  },
  {
    "path": "med2vec.py",
    "content": "#################################################################\n# Code written by Edward Choi (mp2893@gatech.edu)\n# For bug report, please contact author using the email address\n#################################################################\n\nimport sys, random\nimport numpy as np\nimport cPickle as pickle\nfrom collections import OrderedDict\nimport argparse\n\nimport theano\nimport theano.tensor as T\nfrom theano import config\n\ndef numpy_floatX(data):\n\treturn np.asarray(data, dtype=config.floatX)\n\ndef unzip(zipped):\n\tnew_params = OrderedDict()\n\tfor k, v in zipped.iteritems():\n\t\tnew_params[k] = v.get_value()\n\treturn new_params\n\ndef init_params(options):\n\tparams = OrderedDict()\n\n\tnumXcodes = options['numXcodes']\n\tnumYcodes = options['numYcodes']\n\tembDimSize= options['embDimSize']\n\tdemoSize = options['demoSize']\n\thiddenDimSize = options['hiddenDimSize']\n\n\tparams['W_emb'] = np.random.uniform(-0.01, 0.01, (numXcodes, embDimSize)).astype(config.floatX) #emb matrix needs an extra dimension for the time\n\tparams['b_emb'] = np.zeros(embDimSize).astype(config.floatX)\n\tparams['W_hidden'] = np.random.uniform(-0.01, 0.01, (embDimSize+demoSize, hiddenDimSize)).astype(config.floatX) #emb matrix needs an extra dimension for the time\n\tparams['b_hidden'] = np.zeros(hiddenDimSize).astype(config.floatX)\n\tif numYcodes > 0:\n\t\tparams['W_output'] = np.random.uniform(-0.01, 0.01, (hiddenDimSize, numYcodes)).astype(config.floatX) #emb matrix needs an extra dimension for the time\n\t\tparams['b_output'] = np.zeros(numYcodes).astype(config.floatX)\n\telse:\n\t\tparams['W_output'] = np.random.uniform(-0.01, 0.01, (hiddenDimSize, numXcodes)).astype(config.floatX) #emb matrix needs an extra dimension for the time\n\t\tparams['b_output'] = np.zeros(numXcodes).astype(config.floatX)\n\n\treturn params\n\ndef load_params(options):\n\tparams = np.load(options['modelFile'])\n\treturn params\n\ndef init_tparams(params):\n\ttparams = OrderedDict()\n\tfor k, v in params.iteritems():\n\t\ttparams[k] = theano.shared(v, name=k)\n\treturn tparams\n\ndef build_model(tparams, options):\n\tx = T.matrix('x', dtype=config.floatX)\n\td = T.matrix('d', dtype=config.floatX)\n\ty = T.matrix('y', dtype=config.floatX)\n\tmask = T.vector('mask', dtype=config.floatX)\n\n\tlogEps = options['logEps']\n\n\temb = T.maximum(T.dot(x, tparams['W_emb']) + tparams['b_emb'],0)\n\tif options['demoSize'] > 0: emb = T.concatenate((emb, d), axis=1)\n\tvisit = T.maximum(T.dot(emb, tparams['W_hidden']) + tparams['b_hidden'],0)\n\tresults = T.nnet.softmax(T.dot(visit, tparams['W_output']) + tparams['b_output'])\n\t\n\tmask1 = (mask[:-1] * mask[1:])[:,None]\n\tmask2 = (mask[:-2] * mask[1:-1] * mask[2:])[:,None]\n\tmask3 = (mask[:-3] * mask[1:-2] * mask[2:-1] * mask[3:])[:,None]\n\tmask4 = (mask[:-4] * mask[1:-3] * mask[2:-2] * mask[3:-1] * mask[4:])[:,None]\n\tmask5 = (mask[:-5] * mask[1:-4] * mask[2:-3] * mask[3:-2] * mask[4:-1] * mask[5:])[:,None]\n\n\tt = None\n\tif options['numYcodes'] > 0: t = y\n\telse: t = x\n\n\tforward_results =  results[:-1] * mask1\n\tforward_cross_entropy = -(t[1:] * T.log(forward_results + logEps) + (1. - t[1:]) * T.log(1. - forward_results + logEps))\n\n\tforward_results2 =  results[:-2] * mask2\n\tforward_cross_entropy2 = -(t[2:] * T.log(forward_results2 + logEps) + (1. - t[2:]) * T.log(1. - forward_results2 + logEps))\n\n\tforward_results3 =  results[:-3] * mask3\n\tforward_cross_entropy3 = -(t[3:] * T.log(forward_results3 + logEps) + (1. - t[3:]) * T.log(1. - forward_results3 + logEps))\n\n\tforward_results4 =  results[:-4] * mask4\n\tforward_cross_entropy4 = -(t[4:] * T.log(forward_results4 + logEps) + (1. - t[4:]) * T.log(1. - forward_results4 + logEps))\n\n\tforward_results5 =  results[:-5] * mask5\n\tforward_cross_entropy5 = -(t[5:] * T.log(forward_results5 + logEps) + (1. - t[5:]) * T.log(1. - forward_results5 + logEps))\n\n\tbackward_results =  results[1:] * mask1\n\tbackward_cross_entropy = -(t[:-1] * T.log(backward_results + logEps) + (1. - t[:-1]) * T.log(1. - backward_results + logEps))\n\n\tbackward_results2 =  results[2:] * mask2\n\tbackward_cross_entropy2 = -(t[:-2] * T.log(backward_results2 + logEps) + (1. - t[:-2]) * T.log(1. - backward_results2 + logEps))\n\n\tbackward_results3 =  results[3:] * mask3\n\tbackward_cross_entropy3 = -(t[:-3] * T.log(backward_results3 + logEps) + (1. - t[:-3]) * T.log(1. - backward_results3 + logEps))\n\n\tbackward_results4 =  results[4:] * mask4\n\tbackward_cross_entropy4 = -(t[:-4] * T.log(backward_results4 + logEps) + (1. - t[:-4]) * T.log(1. - backward_results4 + logEps))\n\n\tbackward_results5 =  results[5:] * mask5\n\tbackward_cross_entropy5 = -(t[:-5] * T.log(backward_results5 + logEps) + (1. - t[:-5]) * T.log(1. - backward_results5 + logEps))\n\n\tvisit_cost1 = (forward_cross_entropy.sum(axis=1).sum(axis=0) + backward_cross_entropy.sum(axis=1).sum(axis=0)) / (mask1.sum() + logEps)\n\tvisit_cost2 = (forward_cross_entropy2.sum(axis=1).sum(axis=0) + backward_cross_entropy2.sum(axis=1).sum(axis=0)) / (mask2.sum() + logEps)\n\tvisit_cost3 = (forward_cross_entropy3.sum(axis=1).sum(axis=0) + backward_cross_entropy3.sum(axis=1).sum(axis=0)) / (mask3.sum() + logEps)\n\tvisit_cost4 = (forward_cross_entropy4.sum(axis=1).sum(axis=0) + backward_cross_entropy4.sum(axis=1).sum(axis=0)) / (mask4.sum() + logEps)\n\tvisit_cost5 = (forward_cross_entropy5.sum(axis=1).sum(axis=0) + backward_cross_entropy5.sum(axis=1).sum(axis=0)) / (mask5.sum() + logEps)\n\n\twindowSize = options['windowSize']\n\tvisit_cost = visit_cost1\n\tif windowSize == 2:\n\t\tvisit_cost = visit_cost1 + visit_cost2\n\telif windowSize == 3:\n\t\tvisit_cost = visit_cost1 + visit_cost2 + visit_cost3\n\telif windowSize == 4:\n\t\tvisit_cost = visit_cost1 + visit_cost2 + visit_cost3 + visit_cost4\n\telif windowSize == 5:\n\t\tvisit_cost = visit_cost1 + visit_cost2 + visit_cost3 + visit_cost4 + visit_cost5\n\n\tiVector = T.vector('iVector', dtype='int32')\n\tjVector = T.vector('jVector', dtype='int32')\n\tpreVec = T.maximum(tparams['W_emb'],0)\n\tnorms = (T.exp(T.dot(preVec, preVec.T))).sum(axis=1)\n\temb_cost = -T.log((T.exp((preVec[iVector] * preVec[jVector]).sum(axis=1)) / norms[iVector]) + logEps)\n\n\ttotal_cost = visit_cost + T.mean(emb_cost) + options['L2_reg'] * (tparams['W_emb'] ** 2).sum()\n\n\tif options['demoSize'] > 0 and options['numYcodes'] > 0: return x, d, y, mask, iVector, jVector, total_cost\n\telif options['demoSize'] == 0 and options['numYcodes'] > 0: return x, y, mask, iVector, jVector, total_cost\n\telif options['demoSize'] > 0 and options['numYcodes'] == 0: return x, d, mask, iVector, jVector, total_cost\n\telse: return x, mask, iVector, jVector, total_cost\n\ndef adadelta(tparams, grads, x, mask, iVector, jVector, cost, options, d=None, y=None):\n\tzipped_grads = [theano.shared(p.get_value() * numpy_floatX(0.), name='%s_grad' % k) for k, p in tparams.iteritems()]\n\trunning_up2 = [theano.shared(p.get_value() * numpy_floatX(0.), name='%s_rup2' % k) for k, p in tparams.iteritems()]\n\trunning_grads2 = [theano.shared(p.get_value() * numpy_floatX(0.), name='%s_rgrad2' % k) for k, p in tparams.iteritems()]\n\n\tzgup = [(zg, g) for zg, g in zip(zipped_grads, grads)]\n\trg2up = [(rg2, 0.95 * rg2 + 0.05 * (g ** 2)) for rg2, g in zip(running_grads2, grads)]\n\n\tif options['demoSize'] > 0 and options['numYcodes'] > 0:\n\t\tf_grad_shared = theano.function([x, d, y, mask, iVector, jVector], cost, updates=zgup + rg2up, name='adadelta_f_grad_shared')\n\telif options['demoSize'] == 0 and options['numYcodes'] > 0:\n\t\tf_grad_shared = theano.function([x, y, mask, iVector, jVector], cost, updates=zgup + rg2up, name='adadelta_f_grad_shared')\n\telif options['demoSize'] > 0 and options['numYcodes'] == 0:\n\t\tf_grad_shared = theano.function([x, d, mask, iVector, jVector], cost, updates=zgup + rg2up, name='adadelta_f_grad_shared')\n\telse:\n\t\tf_grad_shared = theano.function([x, mask, iVector, jVector], cost, updates=zgup + rg2up, name='adadelta_f_grad_shared')\n\n\tupdir = [-T.sqrt(ru2 + 1e-6) / T.sqrt(rg2 + 1e-6) * zg for zg, ru2, rg2 in zip(zipped_grads, running_up2, running_grads2)]\n\tru2up = [(ru2, 0.95 * ru2 + 0.05 * (ud ** 2)) for ru2, ud in zip(running_up2, updir)]\n\tparam_up = [(p, p + ud) for p, ud in zip(tparams.values(), updir)]\n\n\tf_update = theano.function([], [], updates=ru2up + param_up, on_unused_input='ignore', name='adadelta_f_update')\n\n\treturn f_grad_shared, f_update\n\ndef load_data(xFile, dFile, yFile):\n\tseqX = np.array(pickle.load(open(xFile, 'rb')))\n\tseqD = []\n\tif len(dFile) > 0: seqD = np.asarray(pickle.load(open(dFile, 'rb')), dtype=config.floatX)\n\tseqY = []\n\tif len(yFile) > 0: seqY = np.array(pickle.load(open(yFile, 'rb')))\n\treturn seqX, seqD, seqY\n\ndef pickTwo(codes, iVector, jVector):\n\tfor first in codes:\n\t\tfor second in codes:\n\t\t\tif first == second: continue\n\t\t\tiVector.append(first)\n\t\t\tjVector.append(second)\n\t\ndef padMatrix(seqs, labels, options):\n\tn_samples = len(seqs)\n\tiVector = []\n\tjVector = []\n\tnumXcodes = options['numXcodes']\n\tnumYcodes = options['numYcodes']\n\n\tif numYcodes > 0:\n\t\tx = np.zeros((n_samples, numXcodes)).astype(config.floatX)\n\t\ty = np.zeros((n_samples, numYcodes)).astype(config.floatX)\n\t\tmask = np.zeros((n_samples,)).astype(config.floatX)\n\t\tfor idx, (seq, label) in enumerate(zip(seqs, labels)):\n\t\t\tif not seq[0] == -1:\n\t\t\t\tx[idx][seq] = 1.\n\t\t\t\ty[idx][label] = 1.\n\t\t\t\tpickTwo(seq, iVector, jVector)\n\t\t\t\tmask[idx] = 1.\n\t\treturn x, y, mask, iVector, jVector\n\telse:\n\t\tx = np.zeros((n_samples, numXcodes)).astype(config.floatX)\n\t\tmask = np.zeros((n_samples,)).astype(config.floatX)\n\t\tfor idx, seq in enumerate(seqs):\n\t\t\tif not seq[0] == -1:\n\t\t\t\tx[idx][seq] = 1.\n\t\t\t\tpickTwo(seq, iVector, jVector)\n\t\t\t\tmask[idx] = 1.\n\t\treturn x, mask, iVector, jVector\n\ndef train_med2vec(seqFile='seqFile.txt', \n\t\t\t\tdemoFile='demoFile.txt',\n\t\t\t\tlabelFile='labelFile.txt',\n\t\t\t\toutFile='outFile.txt',\n\t\t\t\tmodelFile='modelFile.txt',\n\t\t\t\tL2_reg=0.001,\n\t\t\t\tnumXcodes=20000, \n\t\t\t\tnumYcodes=20000, \n\t\t\t\tembDimSize=1000,\n\t\t\t\thiddenDimSize=2000,\n\t\t\t\tbatchSize=100,\n\t\t\t\tdemoSize=2,\n\t\t\t\tlogEps=1e-8,\n\t\t\t\twindowSize=1,\n\t\t\t\tverbose=False,\n\t\t\t\tmaxEpochs=1000):\n\n\toptions = locals().copy()\n\tprint 'initializing parameters'\n\tparams = init_params(options)\n\t#params = load_params(options)\n\ttparams = init_tparams(params)\n\n\tprint 'building models'\n\tf_grad_shared = None\n\tf_update = None\n\tif demoSize > 0 and numYcodes > 0:\n\t\tx, d, y, mask, iVector, jVector, cost = build_model(tparams, options)\n\t\tgrads = T.grad(cost, wrt=tparams.values())\n\t\tf_grad_shared, f_update = adadelta(tparams, grads, x, mask, iVector, jVector, cost, options, d=d, y=y)\n\telif demoSize == 0 and numYcodes > 0:\n\t\tx, y, mask, iVector, jVector, cost = build_model(tparams, options)\n\t\tgrads = T.grad(cost, wrt=tparams.values())\n\t\tf_grad_shared, f_update = adadelta(tparams, grads, x, mask, iVector, jVector, cost, options, y=y)\n\telif demoSize > 0 and numYcodes == 0:\n\t\tx, d, mask, iVector, jVector, cost = build_model(tparams, options)\n\t\tgrads = T.grad(cost, wrt=tparams.values())\n\t\tf_grad_shared, f_update = adadelta(tparams, grads, x, mask, iVector, jVector, cost, options, d=d)\n\telse:\n\t\tx, mask, iVector, jVector, cost = build_model(tparams, options)\n\t\tgrads = T.grad(cost, wrt=tparams.values())\n\t\tf_grad_shared, f_update = adadelta(tparams, grads, x, mask, iVector, jVector, cost, options)\n\n\tprint 'loading data'\n\tseqs, demos, labels = load_data(seqFile, demoFile, labelFile)\n\tn_batches = int(np.ceil(float(len(seqs)) / float(batchSize)))\n\n\tprint 'training start'\n\tfor epoch in xrange(maxEpochs):\n\t\titeration = 0\n\t\tcostVector = []\n\t\tfor index in random.sample(range(n_batches), n_batches):\n\t\t\tbatchX = seqs[batchSize*index:batchSize*(index+1)]\n\t\t\tbatchY = []\n\t\t\tbatchD = []\n\t\t\tif demoSize > 0 and numYcodes > 0:\n\t\t\t\tbatchY = labels[batchSize*index:batchSize*(index+1)]\n\t\t\t\tx, y, mask, iVector, jVector = padMatrix(batchX, batchY, options)\n\t\t\t\tbatchD = demos[batchSize*index:batchSize*(index+1)]\n\t\t\t\tcost = f_grad_shared(x, batchD, y, mask, iVector, jVector)\n\t\t\telif demoSize == 0 and numYcodes > 0:\n\t\t\t\tbatchY = labels[batchSize*index:batchSize*(index+1)]\n\t\t\t\tx, y, mask, iVector, jVector = padMatrix(batchX, batchY, options)\n\t\t\t\tcost = f_grad_shared(x, y, mask, iVector, jVector)\n\t\t\telif demoSize > 0 and numYcodes == 0:\n\t\t\t\tx, mask, iVector, jVector = padMatrix(batchX, batchY, options)\n\t\t\t\tbatchD = demos[batchSize*index:batchSize*(index+1)]\n\t\t\t\tcost = f_grad_shared(x, batchD, mask, iVector, jVector)\n\t\t\telse:\n\t\t\t\tx, mask, iVector, jVector = padMatrix(batchX, batchY, options)\n\t\t\t\tcost = f_grad_shared(x, mask, iVector, jVector)\n\t\t\tcostVector.append(cost)\n\t\t\tf_update()\n\t\t\tif (iteration % 10 == 0) and verbose: print 'epoch:%d, iteration:%d/%d, cost:%f' % (epoch, iteration, n_batches, cost)\n\t\t\titeration += 1\n\t\tprint 'epoch:%d, mean_cost:%f' % (epoch, np.mean(costVector))\n\t\ttempParams = unzip(tparams)\n\t\tnp.savez_compressed(outFile + '.' + str(epoch), **tempParams)\n\ndef parse_arguments(parser):\n\tparser.add_argument('seq_file', type=str, metavar='<visit_file>', help='The path to the Pickled file containing visit information of patients')\n\tparser.add_argument('n_input_codes', type=int, metavar='<n_input_codes>', help='The number of unique input medical codes')\n\tparser.add_argument('out_file', type=str, metavar='<out_file>', help='The path to the output models. The models will be saved after every epoch')\n\tparser.add_argument('--label_file', type=str, default='', help='The path to the Pickled file containing grouped visit information of patients. If you are not using a grouped output, do not use this option')\n\tparser.add_argument('--n_output_codes', type=int, default=0, help='The number of unique output medical codes (the number of unique grouped codes). If you are not using a grouped output, do not use this option')\n\tparser.add_argument('--demo_file', type=str, default='', help='The path to the Pickled file containing demographic information of patients. If you are not using patient demographic information, do not use this option')\n\tparser.add_argument('--demo_size', type=int, default=0, help='The size of the demographic information vector. If you are not using patient demographic information, do not use this option')\n\tparser.add_argument('--cr_size', type=int, default=200, help='The size of the code representation (default value: 200)')\n\tparser.add_argument('--vr_size', type=int, default=200, help='The size of the visit representation (default value: 200)')\n\tparser.add_argument('--batch_size', type=int, default=1000, help='The size of a single mini-batch (default value: 1000)')\n\tparser.add_argument('--n_epoch', type=int, default=10, help='The number of training epochs (default value: 10)')\n\tparser.add_argument('--L2_reg', type=float, default=0.001, help='L2 regularization for the code representation matrix W_c (default value: 0.001)')\n\tparser.add_argument('--window_size', type=int, default=1, choices=[1,2,3,4,5], help='The size of the visit context window (range: 1,2,3,4,5), (default value: 1)')\n\tparser.add_argument('--log_eps', type=float, default=1e-8, help='A small value to prevent log(0) (default value: 1e-8)')\n\tparser.add_argument('--verbose', action='store_true', help='Print output after every 10 mini-batches')\n\targs = parser.parse_args()\n\treturn args\n\nif __name__ == '__main__':\n\tparser = argparse.ArgumentParser()\n\targs = parse_arguments(parser)\n\n\ttrain_med2vec(seqFile=args.seq_file, demoFile=args.demo_file, labelFile=args.label_file, outFile=args.out_file, numXcodes=args.n_input_codes, numYcodes=args.n_output_codes, embDimSize=args.cr_size, hiddenDimSize=args.vr_size, batchSize=args.batch_size, maxEpochs=args.n_epoch, L2_reg=args.L2_reg, demoSize=args.demo_size, windowSize=args.window_size, logEps=args.log_eps, verbose=args.verbose)\n"
  },
  {
    "path": "process_mimic.py",
    "content": "# This script processes MIMIC-III dataset and builds longitudinal diagnosis records for patients with at least two visits.\n# The output data are cPickled, and suitable for training Doctor AI or RETAIN\n# Written by Edward Choi (mp2893@gatech.edu)\n# Usage: Put this script to the foler where MIMIC-III CSV files are located. Then execute the below command.\n# python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv <output file> \n\n# Output files\n# <output file>.seqs: Dataset that follows the format described in the README.md.\n# <output file>.types: Python dictionary that maps string diagnosis codes to integer diagnosis codes.\n# <output file>.3digitICD9.seqs: Dataset that follows the format described in the README.md. This uses only the first 3 digits of the ICD9 diagnosis code.\n# <output file>.3digitICD9.types: Python dictionary that maps 3-digit string diagnosis codes to integer diagnosis codes.\n\nimport sys\nimport cPickle as pickle\nfrom datetime import datetime\n\ndef convert_to_icd9(dxStr):\n\tif dxStr.startswith('E'):\n\t\tif len(dxStr) > 4: return dxStr[:4] + '.' + dxStr[4:]\n\t\telse: return dxStr\n\telse:\n\t\tif len(dxStr) > 3: return dxStr[:3] + '.' + dxStr[3:]\n\t\telse: return dxStr\n\t\ndef convert_to_3digit_icd9(dxStr):\n\tif dxStr.startswith('E'):\n\t\tif len(dxStr) > 4: return dxStr[:4]\n\t\telse: return dxStr\n\telse:\n\t\tif len(dxStr) > 3: return dxStr[:3]\n\t\telse: return dxStr\n\nif __name__ == '__main__':\n\tadmissionFile = sys.argv[1]\n\tdiagnosisFile = sys.argv[2]\n\toutFile = sys.argv[3]\n\n\tprint 'Building pid-admission mapping, admission-date mapping'\n\tpidAdmMap = {}\n\tadmDateMap = {}\n\tinfd = open(admissionFile, 'r')\n\tinfd.readline()\n\tfor line in infd:\n\t\ttokens = line.strip().split(',')\n\t\tpid = int(tokens[1])\n\t\tadmId = int(tokens[2])\n\t\tadmTime = datetime.strptime(tokens[3], '%Y-%m-%d %H:%M:%S')\n\t\tadmDateMap[admId] = admTime\n\t\tif pid in pidAdmMap: pidAdmMap[pid].append(admId)\n\t\telse: pidAdmMap[pid] = [admId]\n\tinfd.close()\n\n\tprint 'Building admission-dxList mapping'\n\tadmDxMap = {}\n\tadmDxMap_3digit = {}\n\tinfd = open(diagnosisFile, 'r')\n\tinfd.readline()\n\tfor line in infd:\n\t\ttokens = line.strip().split(',')\n\t\tadmId = int(tokens[2])\n\t\tdxStr = 'D_' + convert_to_icd9(tokens[4][1:-1]) ############## Uncomment this line and comment the line below, if you want to use the entire ICD9 digits.\n\t\tdxStr_3digit = 'D_' + convert_to_3digit_icd9(tokens[4][1:-1])\n\n\t\tif admId in admDxMap: \n\t\t\tadmDxMap[admId].append(dxStr)\n\t\telse: \n\t\t\tadmDxMap[admId] = [dxStr]\n\n\t\tif admId in admDxMap_3digit: \n\t\t\tadmDxMap_3digit[admId].append(dxStr_3digit)\n\t\telse: \n\t\t\tadmDxMap_3digit[admId] = [dxStr_3digit]\n\tinfd.close()\n\n\tprint 'Building pid-sortedVisits mapping'\n\tpidSeqMap = {}\n\tpidSeqMap_3digit = {}\n\tfor pid, admIdList in pidAdmMap.iteritems():\n\t\tif len(admIdList) < 2: continue\n\n\t\tsortedList = sorted([(admDateMap[admId], admDxMap[admId]) for admId in admIdList])\n\t\tpidSeqMap[pid] = sortedList\n\n\t\tsortedList_3digit = sorted([(admDateMap[admId], admDxMap_3digit[admId]) for admId in admIdList])\n\t\tpidSeqMap_3digit[pid] = sortedList_3digit\n\t\n\tprint 'Building pids, dates, strSeqs'\n\tpids = []\n\tdates = []\n\tseqs = []\n\tfor pid, visits in pidSeqMap.iteritems():\n\t\tpids.append(pid)\n\t\tseq = []\n\t\tdate = []\n\t\tfor visit in visits:\n\t\t\tdate.append(visit[0])\n\t\t\tseq.append(visit[1])\n\t\tdates.append(date)\n\t\tseqs.append(seq)\n\t\n\tprint 'Building pids, dates, strSeqs for 3digit ICD9 code'\n\tseqs_3digit = []\n\tfor pid, visits in pidSeqMap_3digit.iteritems():\n\t\tseq = []\n\t\tfor visit in visits:\n\t\t\tseq.append(visit[1])\n\t\tseqs_3digit.append(seq)\n\t\n\tprint 'Converting strSeqs to intSeqs, and making types'\n\ttypes = {}\n\tnewSeqs = []\n\tfor patient in seqs:\n\t\tnewPatient = []\n\t\tfor visit in patient:\n\t\t\tnewVisit = []\n\t\t\tfor code in visit:\n\t\t\t\tif code in types:\n\t\t\t\t\tnewVisit.append(types[code])\n\t\t\t\telse:\n\t\t\t\t\ttypes[code] = len(types)\n\t\t\t\t\tnewVisit.append(types[code])\n\t\t\tnewPatient.append(newVisit)\n\t\tnewSeqs.append(newPatient)\n\t\n\tprint 'Converting strSeqs to intSeqs, and making types for 3digit ICD9 code'\n\ttypes_3digit = {}\n\tnewSeqs_3digit = []\n\tfor patient in seqs_3digit:\n\t\tnewPatient = []\n\t\tfor visit in patient:\n\t\t\tnewVisit = []\n\t\t\tfor code in set(visit):\n\t\t\t\tif code in types_3digit:\n\t\t\t\t\tnewVisit.append(types_3digit[code])\n\t\t\t\telse:\n\t\t\t\t\ttypes_3digit[code] = len(types_3digit)\n\t\t\t\t\tnewVisit.append(types_3digit[code])\n\t\t\tnewPatient.append(newVisit)\n\t\tnewSeqs_3digit.append(newPatient)\n\t\n\tprint 'Re-formatting to Med2Vec dataset'\n\tseqs = []\n\tfor patient in newSeqs:\n\t\tseqs.extend(patient)\n\t\tseqs.append([-1])\n\tseqs = seqs[:-1]\n\n\tseqs_3digit = []\n\tfor patient in newSeqs_3digit:\n\t\tseqs_3digit.extend(patient)\n\t\tseqs_3digit.append([-1])\n\tseqs_3digit = seqs_3digit[:-1]\n\n\tpickle.dump(seqs, open(outFile+'.seqs', 'wb'), -1)\n\tpickle.dump(types, open(outFile+'.types', 'wb'), -1)\n\tpickle.dump(seqs_3digit, open(outFile+'.3digitICD9.seqs', 'wb'), -1)\n\tpickle.dump(types_3digit, open(outFile+'.3digitICD9.types', 'wb'), -1)\n"
  }
]