Repository: mp2893/retain
Branch: master
Commit: 9fd39c46e44b
Files: 5
Total size: 54.2 KB
Directory structure:
gitextract_9kcmn7wq/
├── LICENSE
├── README.md
├── process_mimic.py
├── retain.py
└── test_retain.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
Copyright (c) 2016, mp2893
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of RETAIN nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: README.md
================================================
RETAIN
=========================================
RETAIN is an interpretable predictive model for healthcare applications. Given patient records, it can make predictions while explaining how each medical code (diagnosis codes, medication codes, or procedure codes) at each visit contributes to the prediction. The interpretation is possible due to the use of neural attention mechanism.
[](https://youtu.be/co3lTOSgFlA?t=1m46s "RETAIN Interpretation Demo - Click to Watch!")
Using RETAIN, you can calculate how positively/negatively each medical code (diagnosis, medication, or procedure code) at different visits contributes to the final score. In this case, we are predicting whether the given patient will be diagnosed with Heart Failure (HF). You can see that the codes that are highly related to HF makes positive contributions. RETAIN also learns to pay more attention to new information than old information. You can see that Cardiac Dysrythmia (CD) makes a bigger contribution as it occurs in the more recent visit.
#### Relevant Publications
RETAIN implements an algorithm introduced in the following [paper](http://papers.nips.cc/paper/6321-retain-an-interpretable-predictive-model-for-healthcare-using-reverse-time-attention-mechanism):
RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism
Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, Jimeng Sun,
NIPS 2016, pp.3504-3512
#### Notice
The RETAIN paper formulates the model as being able to make prediction at each timestep (e.g. try to predict what diagnoses the patient will receive at each visit), and treats sequence classification (e.g. Given a patient record, will he be diagnosed with heart failure in the future?) as a special case, since sequence classification makes the prediction at the last timestep only.
This code, however, is implemented to perform the sequence classification task. For example, you can use this code to predict whether the given patient is a heart failure patient or not. Or you can predict whether this patient will be readmitted in the future. The more general version of RETAIN will be released in the future.
#### Running RETAIN
**STEP 1: Installation**
1. Install [python](https://www.python.org/), [Theano](http://deeplearning.net/software/theano/index.html). We use Python 2.7, Theano 0.8. Theano can be easily installed in Ubuntu as suggested [here](http://deeplearning.net/software/theano/install_ubuntu.html#install-ubuntu)
2. If you plan to use GPU computation, install [CUDA](https://developer.nvidia.com/cuda-downloads)
3. Download/clone the RETAIN code
**STEP 2: Fast way to test RETAIN with MIMIC-III**
This step describes how to train RETAIN, with minimum number of steps using MIMIC-III, to predict patients' mortality using their visit records.
0. You will first need to request access for [MIMIC-III](https://mimic.physionet.org/gettingstarted/access/), a publicly avaiable electronic health records collected from ICU patients over 11 years.
1. You can use "process_mimic.py" to process MIMIC-III dataset and generate a suitable training dataset for RETAIN.
Place the script to the same location where the MIMIC-III CSV files are located, and run the script.
The execution command is `python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv PATIENTS.csv <output file>`.
2. Run RETAIN using the ".seqs" and ".morts" file generated by process_mimic.py.
The ".seqs" file contains the sequence of visits for each patient. Each visit consists of multiple diagnosis codes.
However we recommend using ".3digitICD9.seqs" file instead, as the results will be much more interpretable.
(Or you could use [Single-level Clical Classification Software for ICD9](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp#examples) to decrease the number of codes to a couple of hundreds, which will even more improve the performance)
The ".morts" file contains the sequence of mortality labels for each patient.
The command is `python retain.py <3digitICD9.seqs file> 942 <morts file> <output path> --simple_load --n_epochs 100 --keep_prob_context 0.8 --keep_prob_emb 0.5`.
`942` is the number of the entire 3-digit ICD9 codes used in the dataset.
3. To test the model for interpretation, please refer to Step 6. I personally found that _perinatal jaundice (ICD9 774)_ has high correlation with mortality.
4. The model reaches AUC above 0.8 with the above command, but the interpretations are not super clear.
You could tune the hyper-parameters, but I doubt things will dramatically improve.
After all, only 7,500 patients made more than a single hospital visit, and most of them have only two visits.
**STEP 3: How to prepare your own dataset**
1. RETAIN's training dataset needs to be a Python cPickled list of list of list. The outermost list corresponds to patients, the intermediate to the visit sequence each patient made, and the innermost to the medical codes (e.g. diagnosis codes, medication codes, procedure codes, etc.) that occurred within each visit.
First, medical codes need to be converted to an integer. Then a single visit can be seen as a list of integers. Then a patient can be seen as a list of visits.
For example, [5,8,15] means the patient was assigned with code 5, 8, and 15 at a certain visit.
If a patient made two visits [1,2,3] and [4,5,6,7], it can be converted to a list of list [[1,2,3], [4,5,6,7]].
Multiple patients can be represented as [[[1,2,3], [4,5,6,7]], [[2,4], [8,3,1], [3]]], which means there are two patients where the first patient made two visits and the second patient made three visits.
This list of list of list needs to be pickled using cPickle. We will refer to this file as the "visit file".
2. The total number of unique medical codes is required to run RETAIN.
For example, if the dataset is using 14,000 diagnosis codes and 11,000 procedure codes, the total number is 25,000.
3. The label dataset (let us call this "label file") needs to be a Python cPickled list. Each element corresponds to the true label of each patient. For example, 1 can be the case patient and 0 can be the control patient. If there are two patients where only the first patient is a case, then we should have [1,0].
4. The "visit file" and "label file" need to have 3 sets respectively: training set, validation set, and test set.
The file extension must be ".train", ".valid", and ".test" respectivley.
For example, if you want to use a file named "my_visit_sequences" as the "visit file", then RETAIN will try to load "my_visit_sequences.train", "my_visit_sequences.valid", and "my_visit_sequences.test".
This is also true for the "label file"
5. You can use the time information regarding the visits as an additional source of information. Let us call this "time file".
Note that the time information could be anything: duration between consecutive visits, cumulative number of days since the first visit, etc.
"time file" needs to be prepared as a Python cPickled list of list. The outermost list corresponds to patients, and the innermost to the time information of each visit.
For example, given a "visit file" [[[1,2,3], [4,5,6,7]], [[2,4], [8,3,1], [3]]], its corresponding "time file" could look like [[0, 15], [0, 45, 23]], if we are using the duration between the consecutive visits. (of course the numbers are fake, and I've set the duration for the first visit to zero.)
Use `--time_file <path to time file>` option to use "time file"
Remember that the ".train", ".valid", ".test" rule also applies to the "time file" as well.
**Additional: Using your own medical code representations**
RETAIN internally learns the vector representation of medical codes while training. These vectors are initialized with random values of course.
You can, however, also use your own medical code representations, if you have one. (They can be trained by using Skip-gram like algorithms. Refer to [Med2Vec](http://www.kdd.org/kdd2016/subtopic/view/multi-layer-representation-learning-for-medical-concepts) or [this](http://arxiv.org/abs/1602.03686) for further details.)
If you want to provide the medical code representations, it has to be a list of list (basically a matrix) of N rows and M columns where N is the number of unique codes in your "visit file" and M is the size of the code representations.
Specify the path to your code representation file using `--embed_file <path to embedding file>`.
Additionally, even if you use your own medical code representations, you can re-train (a.k.a fine-tune) them as you train RETAIN.
Use `--embed_finetune` option to do this. If you are not providing your own medical code representations, RETAIN will use randomly initialized one, which obviously requires this fine-tuning process. Since the default is to use the fine-tuning, you do not need to worry about this.
**STEP 4: Running RETAIN**
1. The minimum input you need to run RETAIN is the "visit file", the number of unique medical codes in the "visit file",
the "label file", and the output path. The output path is where the learned weights and the log will be saved.
`python retain.py <visit file> <# codes in the visit file> <label file> <output path>`
2. Specifying `--verbose` option will print training process after each 10 mini-batches.
3. You can specify the size of the embedding W_emb, the size of the hidden layer of the GRU that generates alpha, and the size of the hidden layer of the GRU that generates beta.
The respective commands are `--embed_size <integer>`, `--alpha_hidden_dim_size <integer>`, and `--beta_hidden_dim_size <integer>`.
For example `--alpha_hidden_dim_size 128` will tell RETAIN to use a GRU with 128-dimensional hidden layer for generating alpha.
4. Dropouts are applied to two places: 1) to the input embedding, 2) to the context vector c_i. The respective dropout rates can be adjusted using `--keep_prob_embed {0.0, 1.0}` and `--keep_prob_context {0.0, 1.0}`. Dropout values affect the performance so it is recommended to tune them for your data.
5. L2 regularizations can be applied to W_emb, w_alpha, W_beta, and w_output.
6. Additional options can be specified such as the size of the batch size, the number of epochs, etc. Detailed information can be accessed by `python retain.py --help`
7. My personal recommendation: use mild regularization (0.0001 ~ 0.001) on all four weights, and use moderate dropout on the context vector only. But this entirely depends on your data, so you should always tune the hyperparameters for yourself.
**STEP 5: Getting your results**
RETAIN checks the AUC of the validation set after each epoch, and if it is higher than all previous values, it will save the current model. The model file is generated by [numpy.savez_compressed](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.savez_compressed.html).
**Step 6: Testing your model**
1. Using the file "test_retain.py", you can calculate the contributions of each medical code at each visit. First you need to have a trained model that was saved by numpy.savez_compressed. Note that you need to know the configuration with which you trained RETAIN (e.g. use of `--time_file`, use of `--use_log_time`.)
2. Again, you need the "visit file" and "label file" prepared in the same way. This time, however, you do not need to follow the ".train", ".valid", ".test" rule. The testing script will try to load the file name as given.
3. You also need the mapping information between the actual string medical codes and their integer codes.
(e.g. "Hypertension" is mapped to 24)
This file (let's call this "mapping file") need to be a Python cPickled dictionary where the keys are the string medical codes and the values are the corresponding intergers.
(e.g. The mapping file generated by process_mimic.py is the ".types" file)
This file is required to print the contributions of each medical code in a user-friendly format.
4. For the additional options such as `--time_file` or `--use_log_time`, you should use exactly the same configuration with which you trained the model. For more detailed information, use "--help" option.
5. The minimum input to run the testing script is the "model file", "visit file", "label file", "mapping file", and "output file". "output file" is where the contributions will be stored.
`python test_retain.py <model file> <visit file> <label file> <mapping file> <output file>`
================================================
FILE: process_mimic.py
================================================
# This script processes MIMIC-III dataset and builds longitudinal diagnosis records for patients with at least two visits.
# The output data are cPickled, and suitable for training Doctor AI or RETAIN
# Written by Edward Choi (mp2893@gatech.edu)
# Usage: Put this script to the foler where MIMIC-III CSV files are located. Then execute the below command.
# python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv PATIENTS.csv <output file>
# Output files
# <output file>.pids: List of unique Patient IDs. Used for intermediate processing
# <output file>.morts: List of binary values indicating the mortality of each patient
# <output file>.dates: List of List of Python datetime objects. The outer List is for each patient. The inner List is for each visit made by each patient
# <output file>.seqs: List of List of List of integer diagnosis codes. The outer List is for each patient. The middle List contains visits made by each patient. The inner List contains the integer diagnosis codes that occurred in each visit
# <output file>.types: Python dictionary that maps string diagnosis codes to integer diagnosis codes.
import sys
import cPickle as pickle
from datetime import datetime
def convert_to_icd9(dxStr):
if dxStr.startswith('E'):
if len(dxStr) > 4: return dxStr[:4] + '.' + dxStr[4:]
else: return dxStr
else:
if len(dxStr) > 3: return dxStr[:3] + '.' + dxStr[3:]
else: return dxStr
def convert_to_3digit_icd9(dxStr):
if dxStr.startswith('E'):
if len(dxStr) > 4: return dxStr[:4]
else: return dxStr
else:
if len(dxStr) > 3: return dxStr[:3]
else: return dxStr
if __name__ == '__main__':
admissionFile = sys.argv[1]
diagnosisFile = sys.argv[2]
patientsFile = sys.argv[3]
outFile = sys.argv[4]
print 'Collecting mortality information'
pidDodMap = {}
infd = open(patientsFile, 'r')
infd.readline()
for line in infd:
tokens = line.strip().split(',')
pid = int(tokens[1])
dod_hosp = tokens[5]
if len(dod_hosp) > 0:
pidDodMap[pid] = 1
else:
pidDodMap[pid] = 0
infd.close()
print 'Building pid-admission mapping, admission-date mapping'
pidAdmMap = {}
admDateMap = {}
infd = open(admissionFile, 'r')
infd.readline()
for line in infd:
tokens = line.strip().split(',')
pid = int(tokens[1])
admId = int(tokens[2])
admTime = datetime.strptime(tokens[3], '%Y-%m-%d %H:%M:%S')
admDateMap[admId] = admTime
if pid in pidAdmMap: pidAdmMap[pid].append(admId)
else: pidAdmMap[pid] = [admId]
infd.close()
print 'Building admission-dxList mapping'
admDxMap = {}
admDxMap_3digit = {}
infd = open(diagnosisFile, 'r')
infd.readline()
for line in infd:
tokens = line.strip().split(',')
admId = int(tokens[2])
dxStr = 'D_' + convert_to_icd9(tokens[4][1:-1]) ############## Uncomment this line and comment the line below, if you want to use the entire ICD9 digits.
dxStr_3digit = 'D_' + convert_to_3digit_icd9(tokens[4][1:-1])
if admId in admDxMap:
admDxMap[admId].append(dxStr)
else:
admDxMap[admId] = [dxStr]
if admId in admDxMap_3digit:
admDxMap_3digit[admId].append(dxStr_3digit)
else:
admDxMap_3digit[admId] = [dxStr_3digit]
infd.close()
print 'Building pid-sortedVisits mapping'
pidSeqMap = {}
pidSeqMap_3digit = {}
for pid, admIdList in pidAdmMap.iteritems():
if len(admIdList) < 2: continue
sortedList = sorted([(admDateMap[admId], admDxMap[admId]) for admId in admIdList])
pidSeqMap[pid] = sortedList
sortedList_3digit = sorted([(admDateMap[admId], admDxMap_3digit[admId]) for admId in admIdList])
pidSeqMap_3digit[pid] = sortedList_3digit
print 'Building pids, dates, mortality_labels, strSeqs'
pids = []
dates = []
seqs = []
morts = []
for pid, visits in pidSeqMap.iteritems():
pids.append(pid)
morts.append(pidDodMap[pid])
seq = []
date = []
for visit in visits:
date.append(visit[0])
seq.append(visit[1])
dates.append(date)
seqs.append(seq)
print 'Building pids, dates, strSeqs for 3digit ICD9 code'
seqs_3digit = []
for pid, visits in pidSeqMap_3digit.iteritems():
seq = []
for visit in visits:
seq.append(visit[1])
seqs_3digit.append(seq)
print 'Converting strSeqs to intSeqs, and making types'
types = {}
newSeqs = []
for patient in seqs:
newPatient = []
for visit in patient:
newVisit = []
for code in visit:
if code in types:
newVisit.append(types[code])
else:
types[code] = len(types)
newVisit.append(types[code])
newPatient.append(newVisit)
newSeqs.append(newPatient)
print 'Converting strSeqs to intSeqs, and making types for 3digit ICD9 code'
types_3digit = {}
newSeqs_3digit = []
for patient in seqs_3digit:
newPatient = []
for visit in patient:
newVisit = []
for code in set(visit):
if code in types_3digit:
newVisit.append(types_3digit[code])
else:
types_3digit[code] = len(types_3digit)
newVisit.append(types_3digit[code])
newPatient.append(newVisit)
newSeqs_3digit.append(newPatient)
pickle.dump(pids, open(outFile+'.pids', 'wb'), -1)
pickle.dump(dates, open(outFile+'.dates', 'wb'), -1)
pickle.dump(morts, open(outFile+'.morts', 'wb'), -1)
pickle.dump(newSeqs, open(outFile+'.seqs', 'wb'), -1)
pickle.dump(types, open(outFile+'.types', 'wb'), -1)
pickle.dump(newSeqs_3digit, open(outFile+'.3digitICD9.seqs', 'wb'), -1)
pickle.dump(types_3digit, open(outFile+'.3digitICD9.types', 'wb'), -1)
================================================
FILE: retain.py
================================================
#################################################################
# Code written by Edward Choi (mp2893@gatech.edu)
# For bug report, please contact author using the email address
#################################################################
import sys, random
import numpy as np
import cPickle as pickle
from collections import OrderedDict
import argparse
import theano
import theano.tensor as T
from theano import config
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
from sklearn.metrics import roc_auc_score
_TEST_RATIO = 0.2
_VALIDATION_RATIO = 0.1
def unzip(zipped):
new_params = OrderedDict()
for key, value in zipped.iteritems():
new_params[key] = value.get_value()
return new_params
def numpy_floatX(data):
return np.asarray(data, dtype=config.floatX)
def get_random_weight(dim1, dim2, left=-0.1, right=0.1):
return np.random.uniform(left, right, (dim1, dim2)).astype(config.floatX)
def load_embedding(infile):
Wemb = np.array(pickle.load(open(infile, 'rb'))).astype(config.floatX)
return Wemb
def init_params(options):
params = OrderedDict()
timeFile = options['timeFile']
embFile = options['embFile']
embDimSize = options['embDimSize']
inputDimSize = options['inputDimSize']
alphaHiddenDimSize= options['alphaHiddenDimSize']
betaHiddenDimSize= options['betaHiddenDimSize']
numClass = options['numClass']
if len(embFile) > 0:
print 'using external code embedding'
params['W_emb'] = load_embedding(embFile)
embDimSize = params['W_emb'].shape[1]
else:
print 'using randomly initialized code embedding'
params['W_emb'] = get_random_weight(inputDimSize, embDimSize)
gruInputDimSize = embDimSize
if len(timeFile) > 0: gruInputDimSize = embDimSize + 1
params['W_gru_a'] = get_random_weight(gruInputDimSize, 3*alphaHiddenDimSize)
params['U_gru_a'] = get_random_weight(alphaHiddenDimSize, 3*alphaHiddenDimSize)
params['b_gru_a'] = np.zeros(3 * alphaHiddenDimSize).astype(config.floatX)
params['W_gru_b'] = get_random_weight(gruInputDimSize, 3*betaHiddenDimSize)
params['U_gru_b'] = get_random_weight(betaHiddenDimSize, 3*betaHiddenDimSize)
params['b_gru_b'] = np.zeros(3 * betaHiddenDimSize).astype(config.floatX)
params['w_alpha'] = get_random_weight(alphaHiddenDimSize, 1)
params['b_alpha'] = np.zeros(1).astype(config.floatX)
params['W_beta'] = get_random_weight(betaHiddenDimSize, embDimSize)
params['b_beta'] = np.zeros(embDimSize).astype(config.floatX)
params['w_output'] = get_random_weight(embDimSize, numClass)
params['b_output'] = np.zeros(numClass).astype(config.floatX)
return params
def load_params(options):
return np.load(options['modelFile'])
def init_tparams(params, options):
tparams = OrderedDict()
for key, value in params.iteritems():
if not options['embFineTune'] and key == 'W_emb': continue
tparams[key] = theano.shared(value, name=key)
return tparams
def dropout_layer(state_before, use_noise, trng, keep_prob=0.5):
proj = T.switch(
use_noise,
state_before * trng.binomial(state_before.shape, p=keep_prob, n=1, dtype=state_before.dtype) / keep_prob,
state_before)
return proj
def _slice(_x, n, dim):
if _x.ndim == 3:
return _x[:, :, n*dim:(n+1)*dim]
return _x[:, n*dim:(n+1)*dim]
def gru_layer(tparams, emb, name, hiddenDimSize):
timesteps = emb.shape[0]
if emb.ndim == 3: n_samples = emb.shape[1]
else: n_samples = 1
def stepFn(wx, h, U_gru):
uh = T.dot(h, U_gru)
r = T.nnet.sigmoid(_slice(wx, 0, hiddenDimSize) + _slice(uh, 0, hiddenDimSize))
z = T.nnet.sigmoid(_slice(wx, 1, hiddenDimSize) + _slice(uh, 1, hiddenDimSize))
h_tilde = T.tanh(_slice(wx, 2, hiddenDimSize) + r * _slice(uh, 2, hiddenDimSize))
h_new = z * h + ((1. - z) * h_tilde)
return h_new
Wx = T.dot(emb, tparams['W_gru_'+name]) + tparams['b_gru_'+name]
results, updates = theano.scan(fn=stepFn, sequences=[Wx], outputs_info=T.alloc(numpy_floatX(0.0), n_samples, hiddenDimSize), non_sequences=[tparams['U_gru_'+name]], name='gru_layer', n_steps=timesteps)
return results
def build_model(tparams, options, W_emb=None):
keep_prob_emb = options['keepProbEmb']
keep_prob_context = options['keepProbContext']
alphaHiddenDimSize = options['alphaHiddenDimSize']
betaHiddenDimSize = options['betaHiddenDimSize']
trng = RandomStreams(1234)
use_noise = theano.shared(numpy_floatX(0.))
useTime = options['useTime']
x = T.tensor3('x', dtype=config.floatX)
t = T.matrix('t', dtype=config.floatX)
y = T.vector('y', dtype=config.floatX)
lengths = T.ivector('lengths')
n_timesteps = x.shape[0]
n_samples = x.shape[1]
if options['embFineTune']: emb = T.dot(x, tparams['W_emb'])
else: emb = T.dot(x, W_emb)
if keep_prob_emb < 1.0: emb = dropout_layer(emb, use_noise, trng, keep_prob_emb)
if useTime: temb = T.concatenate([emb, t.reshape([n_timesteps,n_samples,1])], axis=2) #Adding the time element to the embedding
else: temb = emb
def attentionStep(att_timesteps):
reverse_emb_t = temb[:att_timesteps][::-1]
reverse_h_a = gru_layer(tparams, reverse_emb_t, 'a', alphaHiddenDimSize)[::-1] * 0.5
reverse_h_b = gru_layer(tparams, reverse_emb_t, 'b', betaHiddenDimSize)[::-1] * 0.5
preAlpha = T.dot(reverse_h_a, tparams['w_alpha']) + tparams['b_alpha']
preAlpha = preAlpha.reshape((preAlpha.shape[0], preAlpha.shape[1]))
alpha = (T.nnet.softmax(preAlpha.T)).T
beta = T.tanh(T.dot(reverse_h_b, tparams['W_beta']) + tparams['b_beta'])
c_t = (alpha[:,:,None] * beta * emb[:att_timesteps]).sum(axis=0)
return c_t
counts = T.arange(n_timesteps)+ 1
c_t, updates = theano.scan(fn=attentionStep, sequences=[counts], outputs_info=None, name='attention_layer', n_steps=n_timesteps)
if keep_prob_context < 1.0: c_t = dropout_layer(c_t, use_noise, trng, keep_prob_context)
preY = T.nnet.sigmoid(T.dot(c_t, tparams['w_output']) + tparams['b_output'])
preY = preY.reshape((preY.shape[0], preY.shape[1]))
indexRow = T.arange(n_samples)
y_hat = preY.T[indexRow, lengths - 1]
logEps = options['logEps']
cross_entropy = -(y * T.log(y_hat + logEps) + (1. - y) * T.log(1. - y_hat + logEps))
cost_noreg = T.mean(cross_entropy)
cost = cost_noreg + options['L2_output'] * (tparams['w_output']**2).sum() + options['L2_alpha'] * (tparams['w_alpha']**2).sum() + options['L2_beta'] * (tparams['W_beta']**2).sum()
if options['embFineTune']: cost += options['L2_emb'] * (tparams['W_emb']**2).sum()
if useTime: return use_noise, x, y, t, lengths, cost_noreg, cost, y_hat
else: return use_noise, x, y, lengths, cost_noreg, cost, y_hat
def adadelta(tparams, grads, x, y, lengths, cost, options, t=None):
zipped_grads = [theano.shared(p.get_value() * numpy_floatX(0.), name='%s_grad' % k) for k, p in tparams.iteritems()]
running_up2 = [theano.shared(p.get_value() * numpy_floatX(0.), name='%s_rup2' % k) for k, p in tparams.iteritems()]
running_grads2 = [theano.shared(p.get_value() * numpy_floatX(0.), name='%s_rgrad2' % k) for k, p in tparams.iteritems()]
zgup = [(zg, g) for zg, g in zip(zipped_grads, grads)]
rg2up = [(rg2, 0.95 * rg2 + 0.05 * (g ** 2)) for rg2, g in zip(running_grads2, grads)]
if len(options['timeFile']) > 0:
f_grad_shared = theano.function([x, y, t, lengths], cost, updates=zgup + rg2up, name='adadelta_f_grad_shared')
else:
f_grad_shared = theano.function([x, y, lengths], cost, updates=zgup + rg2up, name='adadelta_f_grad_shared')
updir = [-T.sqrt(ru2 + 1e-6) / T.sqrt(rg2 + 1e-6) * zg for zg, ru2, rg2 in zip(zipped_grads, running_up2, running_grads2)]
ru2up = [(ru2, 0.95 * ru2 + 0.05 * (ud ** 2)) for ru2, ud in zip(running_up2, updir)]
param_up = [(p, p + ud) for p, ud in zip(tparams.values(), updir)]
f_update = theano.function([], [], updates=ru2up + param_up, on_unused_input='ignore', name='adadelta_f_update')
return f_grad_shared, f_update
def adam(cost, tparams, lr=0.0002, b1=0.1, b2=0.001, e=1e-8):
updates = []
grads = T.grad(cost, wrt=tparams.values())
i = theano.shared(numpy_floatX(0.))
i_t = i + 1.
fix1 = 1. - (1. - b1)**i_t
fix2 = 1. - (1. - b2)**i_t
lr_t = lr * (T.sqrt(fix2) / fix1)
for p, g in zip(tparams.values(), grads):
m = theano.shared(p.get_value() * 0.)
v = theano.shared(p.get_value() * 0.)
m_t = (b1 * g) + ((1. - b1) * m)
v_t = (b2 * T.sqr(g)) + ((1. - b2) * v)
g_t = m_t / (T.sqrt(v_t) + e)
p_t = p - (lr_t * g_t)
updates.append((m, m_t))
updates.append((v, v_t))
updates.append((p, p_t))
updates.append((i, i_t))
return updates
def padMatrixWithTime(seqs, times, options):
lengths = np.array([len(seq) for seq in seqs]).astype('int32')
n_samples = len(seqs)
maxlen = np.max(lengths)
x = np.zeros((maxlen, n_samples, options['inputDimSize'])).astype(config.floatX)
t = np.zeros((maxlen, n_samples)).astype(config.floatX)
for idx, (seq,time) in enumerate(zip(seqs,times)):
for xvec, subseq in zip(x[:,idx,:], seq):
xvec[subseq] = 1.
t[:lengths[idx], idx] = time
if options['useLogTime']: t = np.log(t + 1.)
return x, t, lengths
def padMatrixWithoutTime(seqs, options):
lengths = np.array([len(seq) for seq in seqs]).astype('int32')
n_samples = len(seqs)
maxlen = np.max(lengths)
x = np.zeros((maxlen, n_samples, options['inputDimSize'])).astype(config.floatX)
for idx, seq in enumerate(seqs):
for xvec, subseq in zip(x[:,idx,:], seq):
xvec[subseq] = 1.
return x, lengths
def load_data_simple(seqFile, labelFile, timeFile=''):
sequences = np.array(pickle.load(open(seqFile, 'rb')))
labels = np.array(pickle.load(open(labelFile, 'rb')))
if len(timeFile) > 0:
times = np.array(pickle.load(open(timeFile, 'rb')))
dataSize = len(labels)
np.random.seed(0)
ind = np.random.permutation(dataSize)
nTest = int(_TEST_RATIO * dataSize)
nValid = int(_VALIDATION_RATIO * dataSize)
test_indices = ind[:nTest]
valid_indices = ind[nTest:nTest+nValid]
train_indices = ind[nTest+nValid:]
train_set_x = sequences[train_indices]
train_set_y = labels[train_indices]
test_set_x = sequences[test_indices]
test_set_y = labels[test_indices]
valid_set_x = sequences[valid_indices]
valid_set_y = labels[valid_indices]
train_set_t = None
test_set_t = None
valid_set_t = None
if len(timeFile) > 0:
train_set_t = times[train_indices]
test_set_t = times[test_indices]
valid_set_t = times[valid_indices]
def len_argsort(seq):
return sorted(range(len(seq)), key=lambda x: len(seq[x]))
train_sorted_index = len_argsort(train_set_x)
train_set_x = [train_set_x[i] for i in train_sorted_index]
train_set_y = [train_set_y[i] for i in train_sorted_index]
valid_sorted_index = len_argsort(valid_set_x)
valid_set_x = [valid_set_x[i] for i in valid_sorted_index]
valid_set_y = [valid_set_y[i] for i in valid_sorted_index]
test_sorted_index = len_argsort(test_set_x)
test_set_x = [test_set_x[i] for i in test_sorted_index]
test_set_y = [test_set_y[i] for i in test_sorted_index]
if len(timeFile) > 0:
train_set_t = [train_set_t[i] for i in train_sorted_index]
valid_set_t = [valid_set_t[i] for i in valid_sorted_index]
test_set_t = [test_set_t[i] for i in test_sorted_index]
train_set = (train_set_x, train_set_y, train_set_t)
valid_set = (valid_set_x, valid_set_y, valid_set_t)
test_set = (test_set_x, test_set_y, test_set_t)
return train_set, valid_set, test_set
def load_data(seqFile, labelFile, timeFile):
train_set_x = pickle.load(open(seqFile+'.train', 'rb'))
valid_set_x = pickle.load(open(seqFile+'.valid', 'rb'))
test_set_x = pickle.load(open(seqFile+'.test', 'rb'))
train_set_y = pickle.load(open(labelFile+'.train', 'rb'))
valid_set_y = pickle.load(open(labelFile+'.valid', 'rb'))
test_set_y = pickle.load(open(labelFile+'.test', 'rb'))
train_set_t = None
valid_set_t = None
test_set_t = None
if len(timeFile) > 0:
train_set_t = pickle.load(open(timeFile+'.train', 'rb'))
valid_set_t = pickle.load(open(timeFile+'.valid', 'rb'))
test_set_t = pickle.load(open(timeFile+'.test', 'rb'))
def len_argsort(seq):
return sorted(range(len(seq)), key=lambda x: len(seq[x]))
train_sorted_index = len_argsort(train_set_x)
train_set_x = [train_set_x[i] for i in train_sorted_index]
train_set_y = [train_set_y[i] for i in train_sorted_index]
valid_sorted_index = len_argsort(valid_set_x)
valid_set_x = [valid_set_x[i] for i in valid_sorted_index]
valid_set_y = [valid_set_y[i] for i in valid_sorted_index]
test_sorted_index = len_argsort(test_set_x)
test_set_x = [test_set_x[i] for i in test_sorted_index]
test_set_y = [test_set_y[i] for i in test_sorted_index]
if len(timeFile) > 0:
train_set_t = [train_set_t[i] for i in train_sorted_index]
valid_set_t = [valid_set_t[i] for i in valid_sorted_index]
test_set_t = [test_set_t[i] for i in test_sorted_index]
train_set = (train_set_x, train_set_y, train_set_t)
valid_set = (valid_set_x, valid_set_y, valid_set_t)
test_set = (test_set_x, test_set_y, test_set_t)
return train_set, valid_set, test_set
def calculate_auc(test_model, dataset, options):
batchSize = options['batchSize']
useTime = options['useTime']
n_batches = int(np.ceil(float(len(dataset[0])) / float(batchSize)))
scoreVec = []
for index in xrange(n_batches):
batchX = dataset[0][index*batchSize:(index+1)*batchSize]
if useTime:
batchT = dataset[2][index*batchSize:(index+1)*batchSize]
x, t, lengths = padMatrixWithTime(batchX, batchT, options)
scores = test_model(x, t, lengths)
else:
x, lengths = padMatrixWithoutTime(batchX, options)
scores = test_model(x, lengths)
scoreVec.extend(list(scores))
labels = dataset[1]
auc = roc_auc_score(list(labels), list(scoreVec))
return auc
def calculate_cost(test_model, dataset, options):
batchSize = options['batchSize']
useTime = options['useTime']
costSum = 0.0
dataCount = 0
n_batches = int(np.ceil(float(len(dataset[0])) / float(batchSize)))
for index in xrange(n_batches):
batchX = dataset[0][index*batchSize:(index+1)*batchSize]
if useTime:
batchT = dataset[2][index*batchSize:(index+1)*batchSize]
x, t, lengths = padMatrixWithTime(batchX, batchT, options)
y = np.array(dataset[1][index*batchSize:(index+1)*batchSize]).astype(config.floatX)
scores = test_model(x, y, t, lengths)
else:
x, lengths = padMatrixWithoutTime(batchX, options)
y = np.array(dataset[1][index*batchSize:(index+1)*batchSize]).astype(config.floatX)
scores = test_model(x, y, lengths)
costSum += scores * len(batchX)
dataCount += len(batchX)
return costSum / dataCount
def print2file(buf, outFile):
outfd = open(outFile, 'a')
outfd.write(buf + '\n')
outfd.close()
def train_RETAIN(
seqFile='seqFile.txt',
inputDimSize=20000,
labelFile='labelFile.txt',
numClass=1,
outFile='outFile.txt',
timeFile='',
modelFile='model.npz',
useLogTime=True,
embFile='embFile.txt',
embDimSize=128,
embFineTune=True,
alphaHiddenDimSize=128,
betaHiddenDimSize=128,
batchSize=100,
max_epochs=10,
L2_output=0.001,
L2_emb=0.001,
L2_alpha=0.001,
L2_beta=0.001,
keepProbEmb=0.5,
keepProbContext=0.5,
logEps=1e-8,
solver='adadelta',
simpleLoad=False,
verbose=False
):
options = locals().copy()
if len(timeFile) > 0: useTime = True
else: useTime = False
options['useTime'] = useTime
print 'Initializing the parameters ... ',
params = init_params(options)
if len(modelFile) > 0: params = load_params(options)
tparams = init_tparams(params, options)
print 'Building the model ... ',
if useTime and embFineTune:
print 'using time information, fine-tuning code representations'
use_noise, x, y, t, lengths, cost_noreg, cost, y_hat = build_model(tparams, options)
if solver=='adadelta':
grads = T.grad(cost, wrt=tparams.values())
f_grad_shared, f_update = adadelta(tparams, grads, x, y, lengths, cost, options, t)
elif solver=='adam':
updates = adam(cost, tparams)
update_model = theano.function(inputs=[x, y, t, lengths], outputs=cost, updates=updates, name='update_model')
get_prediction = theano.function(inputs=[x, t, lengths], outputs=y_hat, name='get_prediction')
get_cost = theano.function(inputs=[x, y, t, lengths], outputs=cost_noreg, name='get_cost')
elif useTime and not embFineTune:
print 'using time information, not fine-tuning code representations'
W_emb = theano.shared(params['W_emb'], name='W_emb')
use_noise, x, y, t, lengths, cost_noreg, cost, y_hat = build_model(tparams, options, W_emb)
if solver=='adadelta':
grads = T.grad(cost, wrt=tparams.values())
f_grad_shared, f_update = adadelta(tparams, grads, x, y, lengths, cost, options, t)
elif solver=='adam':
updates = adam(cost, tparams)
update_model = theano.function(inputs=[x, y, t, lengths], outputs=cost, updates=updates, name='update_model')
get_prediction = theano.function(inputs=[x, t, lengths], outputs=y_hat, name='get_prediction')
get_cost = theano.function(inputs=[x, y, t, lengths], outputs=cost_noreg, name='get_cost')
elif not useTime and embFineTune:
print 'not using time information, fine-tuning code representations'
use_noise, x, y, lengths, cost_noreg, cost, y_hat = build_model(tparams, options)
if solver=='adadelta':
grads = T.grad(cost, wrt=tparams.values())
f_grad_shared, f_update = adadelta(tparams, grads, x, y, lengths, cost, options)
elif solver=='adam':
updates = adam(cost, tparams)
update_model = theano.function(inputs=[x, y, lengths], outputs=cost, updates=updates, name='update_model')
get_prediction = theano.function(inputs=[x, lengths], outputs=y_hat, name='get_prediction')
get_cost = theano.function(inputs=[x, y, lengths], outputs=cost_noreg, name='get_cost')
elif not useTime and not embFineTune:
print 'not using time information, not fine-tuning code representations'
W_emb = theano.shared(params['W_emb'], name='W_emb')
use_noise, x, y, lengths, cost_noreg, cost, y_hat = build_model(tparams, options, W_emb)
if solver=='adadelta':
grads = T.grad(cost, wrt=tparams.values())
f_grad_shared, f_update = adadelta(tparams, grads, x, y, lengths, cost, options)
elif solver=='adam':
updates = adam(cost, tparams)
update_model = theano.function(inputs=[x, y, lengths], outputs=cost, updates=updates, name='update_model')
get_prediction = theano.function(inputs=[x, lengths], outputs=y_hat, name='get_prediction')
get_cost = theano.function(inputs=[x, y, lengths], outputs=cost_noreg, name='get_cost')
print 'Loading data ... ',
if simpleLoad:
trainSet, validSet, testSet = load_data_simple(seqFile, labelFile, timeFile)
else:
trainSet, validSet, testSet = load_data(seqFile, labelFile, timeFile)
n_batches = int(np.ceil(float(len(trainSet[0])) / float(batchSize)))
print 'done'
bestValidAuc = 0.0
bestTestAuc = 0.0
bestValidEpoch = 0
logFile = outFile + '.log'
print 'Optimization start !!'
for epoch in xrange(max_epochs):
iteration = 0
costVector = []
for index in random.sample(range(n_batches), n_batches):
use_noise.set_value(1.)
batchX = trainSet[0][index*batchSize:(index+1)*batchSize]
y = np.array(trainSet[1][index*batchSize:(index+1)*batchSize]).astype(config.floatX)
if useTime:
batchT = trainSet[2][index*batchSize:(index+1)*batchSize]
x, t, lengths = padMatrixWithTime(batchX, batchT, options)
if solver=='adadelta':
costValue = f_grad_shared(x, y, t, lengths)
f_update()
elif solver=='adam':
costValue = update_model(x, y, t, lengths)
else:
x, lengths = padMatrixWithoutTime(batchX, options)
if solver=='adadelta':
costValue = f_grad_shared(x, y, lengths)
f_update()
elif solver=='adam':
costValue = update_model(x, y, lengths)
costVector.append(costValue)
if (iteration % 10 == 0) and verbose:
print 'Epoch:%d, Iteration:%d/%d, Train_Cost:%f' % (epoch, iteration, n_batches, costValue)
iteration += 1
use_noise.set_value(0.)
trainCost = np.mean(costVector)
validAuc = calculate_auc(get_prediction, validSet, options)
buf = 'Epoch:%d, Train_cost:%f, Validation_AUC:%f' % (epoch, trainCost, validAuc)
print buf
print2file(buf, logFile)
if validAuc > bestValidAuc:
bestValidAuc = validAuc
bestValidEpoch = epoch
bestTestAuc = calculate_auc(get_prediction, testSet, options)
buf = 'Currently the best validation AUC found. Test AUC:%f at epoch:%d' % (bestTestAuc, epoch)
print buf
print2file(buf, logFile)
tempParams = unzip(tparams)
np.savez_compressed(outFile + '.' + str(epoch), **tempParams)
buf = 'The best validation & test AUC:%f, %f at epoch:%d' % (bestValidAuc, bestTestAuc, bestValidEpoch)
print buf
print2file(buf, logFile)
def parse_arguments(parser):
parser.add_argument('seq_file', type=str, metavar='<visit_file>', help='The path to the Pickled file containing visit information of patients')
parser.add_argument('n_input_codes', type=int, metavar='<n_input_codes>', help='The number of unique input medical codes')
parser.add_argument('label_file', type=str, metavar='<label_file>', help='The path to the Pickled file containing label information of patients')
#parser.add_argument('n_output_codes', type=int, metavar='<n_output_codes>', help='The number of unique label medical codes')
parser.add_argument('out_file', metavar='<out_file>', help='The path to the output models. The models will be saved after every epoch')
parser.add_argument('--time_file', type=str, default='', help='The path to the Pickled file containing durations between visits of patients. If you are not using duration information, do not use this option')
parser.add_argument('--model_file', type=str, default='', help='The path to the Numpy-compressed file containing the model parameters. Use this option if you want to re-train an existing model')
parser.add_argument('--use_log_time', type=int, default=1, choices=[0,1], help='Use logarithm of time duration to dampen the impact of the outliers (0 for false, 1 for true) (default value: 1)')
parser.add_argument('--embed_file', type=str, default='', help='The path to the Pickled file containing the representation vectors of medical codes. If you are not using medical code representations, do not use this option')
parser.add_argument('--embed_size', type=int, default=128, help='The size of the visit embedding. If you are not providing your own medical code vectors, you can specify this value (default value: 128)')
parser.add_argument('--embed_finetune', type=int, default=1, choices=[0,1], help='If you are using randomly initialized code representations, always use this option. If you are using an external medical code representations, and you want to fine-tune them as you train RETAIN, use this option (0 for false, 1 for true) (default value: 1)')
parser.add_argument('--alpha_hidden_dim_size', type=int, default=128, help='The size of the hidden layers of the GRU responsible for generating alpha weights (default value: 128)')
parser.add_argument('--beta_hidden_dim_size', type=int, default=128, help='The size of the hidden layers of the GRU responsible for generating beta weights (default value: 128)')
parser.add_argument('--batch_size', type=int, default=100, help='The size of a single mini-batch (default value: 100)')
parser.add_argument('--n_epochs', type=int, default=10, help='The number of training epochs (default value: 10)')
parser.add_argument('--L2_output', type=float, default=0.001, help='L2 regularization for the final classifier weight w (default value: 0.001)')
parser.add_argument('--L2_emb', type=float, default=0.001, help='L2 regularization for the input embedding weight W_emb (default value: 0.001)')
parser.add_argument('--L2_alpha', type=float, default=0.001, help='L2 regularization for the alpha generating weight w_alpha (default value: 0.001).')
parser.add_argument('--L2_beta', type=float, default=0.001, help='L2 regularization for the input embedding weight W_beta (default value: 0.001)')
parser.add_argument('--keep_prob_emb', type=float, default=0.5, help='Decides how much you want to keep during the dropout between the embedded input and the alpha & beta generation process (default value: 0.5)')
parser.add_argument('--keep_prob_context', type=float, default=0.5, help='Decides how much you want to keep during the dropout between the context vector c_i and the final classifier (default value: 0.5)')
parser.add_argument('--log_eps', type=float, default=1e-8, help='A small value to prevent log(0) (default value: 1e-8)')
parser.add_argument('--solver', type=str, default='adadelta', choices=['adadelta','adam'], help='Select which solver to train RETAIN: adadelta, or adam. (default: adadelta)')
parser.add_argument('--simple_load', action='store_true', help='Use an alternative way to load the dataset. Instead of you having to provide a trainign set, validation set, test set, this will automatically divide the dataset. (default false)')
parser.add_argument('--verbose', action='store_true', help='Print output after every 100 mini-batches (default false)')
args = parser.parse_args()
return args
if __name__ == '__main__':
parser = argparse.ArgumentParser()
args = parse_arguments(parser)
train_RETAIN(
seqFile=args.seq_file,
inputDimSize=args.n_input_codes,
labelFile=args.label_file,
#numClass=args.n_output_codes,
outFile=args.out_file,
timeFile=args.time_file,
modelFile=args.model_file,
useLogTime=args.use_log_time,
embFile=args.embed_file,
embDimSize=args.embed_size,
embFineTune=args.embed_finetune,
alphaHiddenDimSize=args.alpha_hidden_dim_size,
betaHiddenDimSize=args.beta_hidden_dim_size,
batchSize=args.batch_size,
max_epochs=args.n_epochs,
L2_output=args.L2_output,
L2_emb=args.L2_emb,
L2_alpha=args.L2_alpha,
L2_beta=args.L2_beta,
keepProbEmb=args.keep_prob_emb,
keepProbContext=args.keep_prob_context,
logEps=args.log_eps,
solver=args.solver,
simpleLoad=args.simple_load,
verbose=args.verbose
)
================================================
FILE: test_retain.py
================================================
#################################################################
# Code written by Edward Choi (mp2893@gatech.edu)
# For bug report, please contact author using the email address
#################################################################
import sys, random
import numpy as np
import cPickle as pickle
from collections import OrderedDict
import argparse
import theano
import theano.tensor as T
from theano import config
def sigmoid(x):
return 1. / (1. + np.exp(-x))
def numpy_floatX(data):
return np.asarray(data, dtype=config.floatX)
def load_embedding(infile):
Wemb = np.array(pickle.load(open(infile, 'rb'))).astype(config.floatX)
return Wemb
def load_params(options):
params = OrderedDict()
weights = np.load(options['modelFile'])
for k,v in weights.iteritems():
params[k] = v
if len(options['embFile']) > 0: params['W_emb'] = np.array(pickle.load(open(options['embFile'], 'rb'))).astype(config.floatX)
return params
def init_tparams(params, options):
tparams = OrderedDict()
for key, value in params.iteritems():
tparams[key] = theano.shared(value, name=key)
return tparams
def _slice(_x, n, dim):
if _x.ndim == 3:
return _x[:, :, n*dim:(n+1)*dim]
return _x[:, n*dim:(n+1)*dim]
def gru_layer(tparams, emb, name, hiddenDimSize):
timesteps = emb.shape[0]
if emb.ndim == 3: n_samples = emb.shape[1]
else: n_samples = 1
def stepFn(wx, h, U_gru):
uh = T.dot(h, U_gru)
r = T.nnet.sigmoid(_slice(wx, 0, hiddenDimSize) + _slice(uh, 0, hiddenDimSize))
z = T.nnet.sigmoid(_slice(wx, 1, hiddenDimSize) + _slice(uh, 1, hiddenDimSize))
h_tilde = T.tanh(_slice(wx, 2, hiddenDimSize) + r * _slice(uh, 2, hiddenDimSize))
h_new = z * h + ((1. - z) * h_tilde)
return h_new
Wx = T.dot(emb, tparams['W_gru_'+name]) + tparams['b_gru_'+name]
results, updates = theano.scan(fn=stepFn, sequences=[Wx], outputs_info=T.alloc(numpy_floatX(0.0), n_samples, hiddenDimSize), non_sequences=[tparams['U_gru_'+name]], name='gru_layer', n_steps=timesteps)
return results
def build_model(tparams, options):
alphaHiddenDimSize = options['alphaHiddenDimSize']
betaHiddenDimSize = options['betaHiddenDimSize']
x = T.tensor3('x', dtype=config.floatX)
reverse_emb_t = x[::-1]
reverse_h_a = gru_layer(tparams, reverse_emb_t, 'a', alphaHiddenDimSize)[::-1] * 0.5
reverse_h_b = gru_layer(tparams, reverse_emb_t, 'b', betaHiddenDimSize)[::-1] * 0.5
preAlpha = T.dot(reverse_h_a, tparams['w_alpha']) + tparams['b_alpha']
preAlpha = preAlpha.reshape((preAlpha.shape[0], preAlpha.shape[1]))
alpha = (T.nnet.softmax(preAlpha.T)).T
beta = T.tanh(T.dot(reverse_h_b, tparams['W_beta']) + tparams['b_beta'])
return x, alpha, beta
def padMatrixWithTime(seqs, times, options):
lengths = np.array([len(seq) for seq in seqs]).astype('int32')
n_samples = len(seqs)
maxlen = np.max(lengths)
x = np.zeros((maxlen, n_samples, options['inputDimSize'])).astype(config.floatX)
t = np.zeros((maxlen, n_samples)).astype(config.floatX)
for idx, (seq,time) in enumerate(zip(seqs,times)):
for xvec, subseq in zip(x[:,idx,:], seq):
xvec[subseq] = 1.
t[:lengths[idx], idx] = time
if options['useLogTime']: t = np.log(t + 1.)
return x, t, lengths
def padMatrixWithoutTime(seqs, options):
lengths = np.array([len(seq) for seq in seqs]).astype('int32')
n_samples = len(seqs)
maxlen = np.max(lengths)
x = np.zeros((maxlen, n_samples, options['inputDimSize'])).astype(config.floatX)
for idx, seq in enumerate(seqs):
for xvec, subseq in zip(x[:,idx,:], seq):
xvec[subseq] = 1.
return x, lengths
def load_data_debug(seqFile, labelFile, timeFile=''):
sequences = np.array(pickle.load(open(seqFile, 'rb')))
labels = np.array(pickle.load(open(labelFile, 'rb')))
if len(timeFile) > 0:
times = np.array(pickle.load(open(timeFile, 'rb')))
dataSize = len(labels)
np.random.seed(0)
ind = np.random.permutation(dataSize)
nTest = int(0.15 * dataSize)
nValid = int(0.10 * dataSize)
test_indices = ind[:nTest]
valid_indices = ind[nTest:nTest+nValid]
train_indices = ind[nTest+nValid:]
train_set_x = sequences[train_indices]
train_set_y = labels[train_indices]
test_set_x = sequences[test_indices]
test_set_y = labels[test_indices]
valid_set_x = sequences[valid_indices]
valid_set_y = labels[valid_indices]
train_set_t = None
test_set_t = None
valid_set_t = None
if len(timeFile) > 0:
train_set_t = times[train_indices]
test_set_t = times[test_indices]
valid_set_t = times[valid_indices]
def len_argsort(seq):
return sorted(range(len(seq)), key=lambda x: len(seq[x]))
train_sorted_index = len_argsort(train_set_x)
train_set_x = [train_set_x[i] for i in train_sorted_index]
train_set_y = [train_set_y[i] for i in train_sorted_index]
valid_sorted_index = len_argsort(valid_set_x)
valid_set_x = [valid_set_x[i] for i in valid_sorted_index]
valid_set_y = [valid_set_y[i] for i in valid_sorted_index]
test_sorted_index = len_argsort(test_set_x)
test_set_x = [test_set_x[i] for i in test_sorted_index]
test_set_y = [test_set_y[i] for i in test_sorted_index]
if len(timeFile) > 0:
train_set_t = [train_set_t[i] for i in train_sorted_index]
valid_set_t = [valid_set_t[i] for i in valid_sorted_index]
test_set_t = [test_set_t[i] for i in test_sorted_index]
train_set = (train_set_x, train_set_y, train_set_t)
valid_set = (valid_set_x, valid_set_y, valid_set_t)
test_set = (test_set_x, test_set_y, test_set_t)
return train_set, valid_set, test_set
def load_data(dataFile, labelFile, timeFile):
test_set_x = np.array(pickle.load(open(dataFile, 'rb')))
test_set_y = np.array(pickle.load(open(labelFile, 'rb')))
test_set_t = None
if len(timeFile) > 0:
test_set_t = np.array(pickle.load(open(timeFile, 'rb')))
def len_argsort(seq):
return sorted(range(len(seq)), key=lambda x: len(seq[x]))
sorted_index = len_argsort(test_set_x)
test_set_x = [test_set_x[i] for i in sorted_index]
test_set_y = [test_set_y[i] for i in sorted_index]
if len(timeFile) > 0:
test_set_t = [test_set_t[i] for i in sorted_index]
test_set = (test_set_x, test_set_y, test_set_t)
return test_set
def print2file(buf, outFile):
outfd = open(outFile, 'a')
outfd.write(buf + '\n')
outfd.close()
def train_RETAIN(
modelFile='model.npz',
seqFile='seqFile.txt',
labelFile='labelFile.txt',
outFile='outFile.txt',
timeFile='timeFile.txt',
typeFile='types.txt',
useLogTime=True,
embFile='embFile.txt',
logEps=1e-8
):
options = locals().copy()
if len(timeFile) > 0: useTime = True
else: useTime = False
options['useTime'] = useTime
if len(embFile) > 0: useFixedEmb = True
else: useFixedEmb = False
options['useFixedEmb'] = useFixedEmb
print 'Loading the parameters ... ',
params = load_params(options)
tparams = init_tparams(params, options)
options['alphaHiddenDimSize'] = params['w_alpha'].shape[0]
options['betaHiddenDimSize'] = params['W_beta'].shape[0]
options['inputDimSize'] = params['W_emb'].shape[0]
print 'Building the model ... ',
x, alpha, beta = build_model(tparams, options)
get_result = theano.function(inputs=[x], outputs=[alpha, beta], name='get_result')
print 'Loading data ... ',
testSet = load_data(seqFile, labelFile, timeFile)
print 'done'
types = pickle.load(open(typeFile, 'rb'))
rtypes = dict([(v,k) for k,v in types.iteritems()])
print 'Contribution calculation start!!'
count = 0
outfd = open(outFile, 'w')
for index in range(len(testSet[0])):
if count % 100 == 0: print 'processed %d patients' % count
count += 1
batchX = [testSet[0][index]]
label = testSet[1][index]
if useTime:
batchT = [testSet[2][index]]
x, t, lengths = padMatrixWithTime(batchX, batchT, options)
else:
x, lengths = padMatrixWithoutTime(batchX, options)
n_timesteps = x.shape[0]
n_samples = x.shape[1]
emb = np.dot(x, params['W_emb'])
if useTime:
temb = np.concatenate([emb, t.reshape((n_timesteps,n_samples,1))], axis=2)
else:
temb = emb
alpha, beta = get_result(temb)
alpha = alpha[:,0]
beta = beta[:,0,:]
ct = (alpha[:,None] * beta * emb[:,0,:]).sum(axis=0)
y_t = sigmoid(np.dot(ct, params['w_output']) + params['b_output'])
buf = ''
patient = batchX[0]
for i in range(len(patient)):
visit = patient[i]
buf += '-------------- visit_index:%d ---------------\n' % i
for j in range(len(visit)):
code = visit[j]
contribution = np.dot(params['w_output'].flatten(), alpha[i] * beta[i] * params['W_emb'][code])
buf += '%s:%f ' % (rtypes[code], contribution)
buf += '\n------------------------------------\n'
buf += 'patient_index:%d, label:%d, score:%f\n\n' % (index, label, y_t)
outfd.write(buf + '\n')
outfd.close()
def parse_arguments(parser):
parser.add_argument('model_file', type=str, metavar='<model_file>', help='The path to the Numpy-compressed file containing the model parameters.')
parser.add_argument('seq_file', type=str, metavar='<visit_file>', help='The path to the cPickled file containing visit information of patients')
parser.add_argument('label_file', type=str, metavar='<label_file>', help='The path to the cPickled file containing label information of patients')
parser.add_argument('type_file', type=str, metavar='<type_file>', help='The path to the cPickled dictionary for mapping medical code strings to integers')
parser.add_argument('out_file', metavar='<out_file>', help='The path to the output models. The models will be saved after every epoch')
parser.add_argument('--time_file', type=str, default='', help='The path to the cPickled file containing durations between visits of patients. If you are not using duration information, do not use this option')
parser.add_argument('--use_log_time', type=int, default=1, choices=[0,1], help='Use logarithm of time duration to dampen the impact of the outliers (0 for false, 1 for true) (default value: 1)')
parser.add_argument('--embed_file', type=str, default='', help='The path to the cPickled file containing the representation vectors of medical codes. If you are not using medical code representations, do not use this option')
args = parser.parse_args()
return args
if __name__ == '__main__':
parser = argparse.ArgumentParser()
args = parse_arguments(parser)
train_RETAIN(
modelFile=args.model_file,
seqFile=args.seq_file,
labelFile=args.label_file,
typeFile=args.type_file,
outFile=args.out_file,
timeFile=args.time_file,
useLogTime=args.use_log_time,
embFile=args.embed_file
)
gitextract_9kcmn7wq/ ├── LICENSE ├── README.md ├── process_mimic.py ├── retain.py └── test_retain.py
SYMBOL INDEX (39 symbols across 3 files) FILE: process_mimic.py function convert_to_icd9 (line 18) | def convert_to_icd9(dxStr): function convert_to_3digit_icd9 (line 26) | def convert_to_3digit_icd9(dxStr): FILE: retain.py function unzip (line 22) | def unzip(zipped): function numpy_floatX (line 28) | def numpy_floatX(data): function get_random_weight (line 31) | def get_random_weight(dim1, dim2, left=-0.1, right=0.1): function load_embedding (line 34) | def load_embedding(infile): function init_params (line 38) | def init_params(options): function load_params (line 76) | def load_params(options): function init_tparams (line 79) | def init_tparams(params, options): function dropout_layer (line 86) | def dropout_layer(state_before, use_noise, trng, keep_prob=0.5): function _slice (line 93) | def _slice(_x, n, dim): function gru_layer (line 98) | def gru_layer(tparams, emb, name, hiddenDimSize): function build_model (line 116) | def build_model(tparams, options, W_emb=None): function adadelta (line 175) | def adadelta(tparams, grads, x, y, lengths, cost, options, t=None): function adam (line 196) | def adam(cost, tparams, lr=0.0002, b1=0.1, b2=0.001, e=1e-8): function padMatrixWithTime (line 217) | def padMatrixWithTime(seqs, times, options): function padMatrixWithoutTime (line 233) | def padMatrixWithoutTime(seqs, options): function load_data_simple (line 245) | def load_data_simple(seqFile, labelFile, timeFile=''): function load_data (line 303) | def load_data(seqFile, labelFile, timeFile): function calculate_auc (line 345) | def calculate_auc(test_model, dataset, options): function calculate_cost (line 365) | def calculate_cost(test_model, dataset, options): function print2file (line 388) | def print2file(buf, outFile): function train_RETAIN (line 393) | def train_RETAIN( function parse_arguments (line 539) | def parse_arguments(parser): FILE: test_retain.py function sigmoid (line 16) | def sigmoid(x): function numpy_floatX (line 19) | def numpy_floatX(data): function load_embedding (line 22) | def load_embedding(infile): function load_params (line 26) | def load_params(options): function init_tparams (line 34) | def init_tparams(params, options): function _slice (line 40) | def _slice(_x, n, dim): function gru_layer (line 45) | def gru_layer(tparams, emb, name, hiddenDimSize): function build_model (line 63) | def build_model(tparams, options): function padMatrixWithTime (line 81) | def padMatrixWithTime(seqs, times, options): function padMatrixWithoutTime (line 97) | def padMatrixWithoutTime(seqs, options): function load_data_debug (line 109) | def load_data_debug(seqFile, labelFile, timeFile=''): function load_data (line 166) | def load_data(dataFile, labelFile, timeFile): function print2file (line 186) | def print2file(buf, outFile): function train_RETAIN (line 191) | def train_RETAIN( function parse_arguments (line 276) | def parse_arguments(parser):
Condensed preview — 5 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (58K chars).
[
{
"path": "LICENSE",
"chars": 1470,
"preview": "Copyright (c) 2016, mp2893\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodi"
},
{
"path": "README.md",
"chars": 12508,
"preview": "RETAIN\n=========================================\n\nRETAIN is an interpretable predictive model for healthcare application"
},
{
"path": "process_mimic.py",
"chars": 5376,
"preview": "# This script processes MIMIC-III dataset and builds longitudinal diagnosis records for patients with at least two visit"
},
{
"path": "retain.py",
"chars": 25737,
"preview": "#################################################################\n# Code written by Edward Choi (mp2893@gatech.edu)\n# Fo"
},
{
"path": "test_retain.py",
"chars": 10393,
"preview": "#################################################################\n# Code written by Edward Choi (mp2893@gatech.edu)\n# Fo"
}
]
About this extraction
This page contains the full source code of the mp2893/retain GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 5 files (54.2 KB), approximately 15.5k tokens, and a symbol index with 39 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.