[
  {
    "path": "License.txt",
    "content": "Implementations of the LF-LDA and LF-DMM latent feature topic models\n\nCopyright (C) 2015-2016 by Dat Quoc Nguyen \ndat.nguyen@students.mq.edu.au\nDepartment of Computing, Macquarie University, Australia\n\t\nThis program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.\n\nThis program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>"
  },
  {
    "path": "README.md",
    "content": "# LF-LDA and LF-DMM latent feature topic models\n\nThe implementations of the LF-LDA and LF-DMM latent feature topic models, as described in my TACL paper:\n\nDat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. [Improving Topic Models with Latent Feature Word Representations](https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/582/158). <i>Transactions of the Association for Computational Linguistics</i>, vol. 3, pp. 299-313, 2015. [[.bib]](http://web.science.mq.edu.au/~dqnguyen/papers/TACL.bib) [[Datasets]](http://web.science.mq.edu.au/~dqnguyen/papers/TACL-datasets.zip) [[Example_20Newsgroups_20Topics_Top50Words]](http://web.science.mq.edu.au/~dqnguyen/papers/TACL_TopWords_N20_20Topics.zip)\n\nThe implementations  of the LDA and DMM topic models are available at  [http://jldadmm.sourceforge.net/](http://jldadmm.sourceforge.net/)\n\n## Usage\n\nThis section describes the usage of the implementations in command line or terminal, using the pre-compiled `LFTM.jar` file. \n\nHere, it is expected that Java 1.7+ is already set to run in command line or terminal (for example: adding Java to the `path` environment variable  in Windows OS).\n\nThe pre-compiled `LFTM.jar` file and source codes are in the `jar` and `src` folders, respectively. Users can recompile the source codes by simply running `ant` (it is also expected that `ant` is already installed). In addition, the users can find input examples in the `test` folder.\n\n#### File format of input topic-modeling corpus\n\nSimilar to the `corpus.txt` file in the `test` folder, each line in the input topic-modeling corpus represents a document. Here, a document is a sequence words/tokens separated by white space characters. The users should preprocess the input topic-modeling corpus before training the topic models, for example: down-casing, removing non-alphabetic characters and stop-words, removing words shorter than 3 characters and words appearing less than a certain times.  \n\n#### Format of input word-vector file\n\nSimilar to the `wordVectors.txt` file in the `test` folder, each line in the input word-vector file starts with a word type which is followed by a vector representation.\n\nTo obtain the vector representations of words, the users can use `the pre-trained word vectors learned from large external corpora` OR `the word vectors which are trained on the input topic-modeling corpus`. \n\nIn case of using the pre-trained word vectors learned from the large external corpora, the users have to remove words in the input topic-modeling corpus, in which these words are not found in the input word-vector file.\n\nSome sets of the pre-trained word vectors can be found at:\n\n[Word2Vec: https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/)\n\n[Glove: http://nlp.stanford.edu/projects/glove/](http://nlp.stanford.edu/projects/glove/) \n\nIf the input topic-modeling corpus is too domain-specific, the domain of the external corpus (from which the word vectors are derived) should not be too different to that of the input topic-modeling corpus. For example, when applying to the  biomedical domain, the users may use Word2Vec or Glove to learn 50 or 100-dimensional word vectors on the large external MEDLINE corpus instead of using the pre-trained Word2Vec or Glove word vectors. \n\n\n### Training LF-LDA and LF-DMM\n\n`$ java [-Xmx2G] -jar jar/LFTM.jar –model <LFLDA_or_LFDMM> -corpus <Input_corpus_file_path> -vectors <Input_vector_file_path> [-ntopics <int>] [-alpha <double>] [-beta <double>] [-lambda <double>] [-initers <int>] [-niters <int>] [-twords <int>] [-name <String>] [-sstep <int>]`\n\nwhere hyper-parameters in [ ] are optional.\n\n* `-model`: Specify the topic model.\n\n* `-corpus`: Specify the path to the input training corpus file.\n\n* `-vectors`: Specify the path to the file containing word vectors.\n\n* `-ntopics <int>`: Specify the number of topics. The default value is 20.\n\n* `-alpha <double>`: Specify the hyper-parameter alpha. Following [1, 2], the default value is 0.1.\n\n* `-beta <double>`: Specify the hyper-parameter beta. The default value is 0.01. Following [2], you might also want to  try beta value of 0.1 for short texts.\n\n* `-lambda <double>`: Specify the mixture weight lambda (0.0 < lambda <= 1.0). Set the mixture weight lambda to be 1.0 to obtain the best topic coherence. \nIn case of document clustering/classification evaluation, fine-tune this parameter to obtain the highest results if you have time; otherwise try both values 0.6 and 1.0 (I would suggest to set lambda 0.6 for normal text corpora and 1.0 for short text corpora if you don't have time to try both 0.6 and 1.0). \n \n* `-initers <int>`: Specify the number of initial sampling iterations to separate the counts for the latent feature component and the Dirichlet multinomial component. The default value is 2000.\n\n* `-niters <int>`: Specify the number of sampling iterations for the latent feature topic models. The default value is 200. \n\n* `-twords <int>`: Specify the number of the most probable topical words. The default value is 20.\n\n* `-name <String>`: Specify a name to the topic modeling experiment. The default value is “model”.\n\n* `-sstep <int>`: Specify a step to save the sampling output (`-sstep` value < `-niters` value). The default value is 0 (i.e. only saving the output from the last sample).\n\nNOTE that (topic vectors are learned in parallel, so) running LFTM code with multiple CPU/core machine to obtain a significantly faster training process, e.g. using a multi-core computer, or set number of CPUs requested for a remote job to be equal to number of topics.\n\n<b>Examples:</b>\n\n`$ java -jar jar/LFTM.jar -model LFLDA -corpus test/corpus.txt -vectors test/wordVectors.txt -ntopics 4 -alpha 0.1 -beta 0.01 -lambda 0.6 -initers 500 -niters 50 -name testLFLDA`\n\nBasically, with this command we run 500 `LDA` sampling iterations (i.e., `-initers 500`) for initialization and then run 50 `LF-LDA` sampling iterations (i.e., `-niters 50`).  The output files are saved in the same folder as the input training corpus file, in this case in the `test` folder. We have output files of `testLFLDA.theta`, `testLFLDA.phi`, `testLFLDA.topWords`, `testLFLDA.topicAssignments` and `testLFLDA.paras`,  referring to the document-to-topic distributions, topic-to-word distributions, top topical words, topic assignments and model hyper-parameters, respectively. Similarly, we perform:\n\n`$ java -jar jar/LFTM.jar -model LFDMM -corpus test/corpus.txt -vectors test/wordVectors.txt -ntopics 4 -alpha 0.1 -beta 0.1 -lambda 1.0 -initers 500 -niters 50 -name testLFDMM`\n\nWe have output files of `testLFDMM.theta`, `testLFDMM.phi`, `testLFDMM.topWords`, `testLFDMM.topicAssignments` and `testLFDMM.paras`.\n\nIn the LF-LDA and LF-DMM latent feature topic models, a word is generated by the latent feature topic-to-word component OR by the topic-to-word Dirichlet multinomial component. In practical implementation, instead of using a binary selection variable to record that, I simply add a value of the number of topics to the actual topic assignment value. For example with 20 topics, the output topic assignment `3 23 4 4 24 3 23 3 23 3 23` for a document means that the first word in the document is generated from topic 3 by the latent feature topic-to-word component. The second word is also generated from the topic `23 - 20 = 3`, but by the topic-to-word Dirichlet multinomial component. It is similar for the remaining words in the document.\n\n### Document clustering evaluation\n\nHere, we treat each topic as a cluster, and we assign every document the topic with the highest probability given the document. To get the  clustering scores of Purity and normalized mutual information, we perform:\n\n`$ java –jar jar/LFTM.jar –model Eval –label <Golden_label_file_path> -dir <Directory_path> -prob <Document-topic-prob/Suffix>`\n\n* `–label`: Specify the path to the ground truth label file. Each line in this label file contains the golden label of the corresponding document in the input training corpus. See the `corpus.LABEL` and `corpus.txt` files in the `test` folder.\n\n* `-dir`: Specify the path to the directory containing document-to-topic distribution files.\n\n* `-prob`: Specify a document-to-topic distribution file or a group of document-to-topic distribution files in the specified directory.\n\n<b>Examples:</b>\n\nThe command `$ java -jar jar/LFTM.jar -model Eval -label test/corpus.LABEL -dir test -prob testLFLDA.theta` will produce the clustering score for the `testLFLDA.theta` file.\n\nThe command `$ java -jar jar/LFTM.jar -model Eval -label test/corpus.LABEL -dir test -prob testLFDMM.theta` will produce the clustering score for  `testLFDMM.theta` file.\n\nThe command `$ java -jar jar/LFTM.jar -model Eval -label test/corpus.LABEL -dir test -prob theta` will produce the clustering scores for all the document-to-topic distribution files having names ended by `theta`. In this case, the distribution files are `testLFLDA.theta` and `testLFDMM.theta`. It also provides the mean and standard deviation of the clustering scores.\n\n### Inference of topic distribution on unseen corpus\n\nTo infer topics on an unseen/new corpus using a pre-trained LF-LDA/LF-DMM topic model, we perform:\n\n`$ java -jar jar/LFTM.jar -model <LFLDAinf_or_LFDMMinf> -paras <Hyperparameter_file_path> -corpus <Unseen_corpus_file_path> [-initers <int>] [-niters <int>] [-twords <int>] [-name <String>] [-sstep <int>]`\n\n* `-paras`: Specify the path to the hyper-parameter file produced by the pre-trained LF-LDA/LF-DMM topic model.\n\n<b>Examples:</b>\n\n`$ java -jar jar/LFTM.jar -model LFLDAinf -paras test/testLFLDA.paras -corpus test/corpus_test.txt -initers 500 -niters 50 -name testLFLDAinf`\n\n`$ java -jar jar/LFTM.jar -model LFDMMinf -paras test/testLFDMM.paras -corpus test/corpus_test.txt -initers 500 -niters 50 -name testLFDMMinf`\n\n## Acknowledgments\n\nThe LF-LDA and LF-DMM implementations used utilities including the LBFGS implementation from [MALLET toolkit](http://mallet.cs.umass.edu/), the random number generator in [Java version of MersenneTwister](http://cs.gmu.edu/~sean/research/), the `Parallel.java` utility from [Mines Java Toolkit](http://dhale.github.io/jtk/api/edu/mines/jtk/util/Parallel.html) and the [Java command line arguments parser](http://args4j.kohsuke.org/sample.html).  I would like to thank the authors of the mentioned utilities for sharing the codes.\n\n## References\n\n[1] Yue Lu, Qiaozhu Mei, and ChengXiang Zhai. 2011. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Information Retrieval, 14:178–203.\n\n[2] Jianhua Yin and Jianyong Wang. 2014. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242.\n"
  },
  {
    "path": "build.xml",
    "content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<project name=\"LFTM\" basedir=\".\" default=\"main\">\n\n    <property name=\"src.dir\"     value=\"src\"/>\n\t<property name=\"lib.dir\" location=\"lib\" />\n    <property name=\"classes.dir\" value=\"bin\"/>\n    <property name=\"jar.dir\"     value=\"jar\"/>\n\n    <property name=\"main-class\"  value=\"LFTM\"/>\n\n\t<path id=\"build.classpath\">\n\t\t<fileset dir=\"${lib.dir}\">\n\t\t\t<include name=\"**/*.jar\" />\n\t\t</fileset>\n\t</path>\n\n    <target name=\"clean\">\n        <delete dir=\"${classes.dir}\"/>\n\t\t<delete dir=\"${jar.dir}\"/>\n    </target>\n\n\t<presetdef name=\"javac\">\n\t\t<javac includeantruntime=\"false\" />\n\t</presetdef>\n\t\n    <target name=\"compile\">\n        <mkdir dir=\"${classes.dir}\"/>\n        <javac srcdir=\"${src.dir}\" destdir=\"${classes.dir}\" classpathref=\"build.classpath\"/>\n    </target>\n\t\n    <target name=\"jar\" depends=\"compile\">\n        <mkdir dir=\"${jar.dir}\"/>\n        <jar destfile=\"${jar.dir}/${ant.project.name}.jar\" basedir=\"${classes.dir}\">\n            <manifest>\n\t\t\t\t<attribute name=\"Class-Path\" value=\"${build.classpath}\"/>\n                <attribute name=\"Main-Class\" value=\"${main-class}\"/>\n            </manifest>\n\t\t\t<zipgroupfileset  dir=\"${lib.dir}\"/>\n        </jar>\n    </target>\n\t\n\t<target name=\"run\" depends=\"jar\">\n        <java fork=\"true\" classname=\"${main-class}\">\n            <classpath>\n                <path refid=\"build.classpath\"/>\n                <pathelement location=\"${jar.dir}/${ant.project.name}.jar\"/>\n            </classpath>\n        </java>\n    </target>\n\t\n    <target name=\"clean-build\" depends=\"clean,jar\"/>\n    <target name=\"main\" depends=\"clean,compile,jar\"/>\n</project>"
  },
  {
    "path": "src/LFTM.java",
    "content": "import models.LFDMM;\nimport models.LFDMM_Inf;\nimport models.LFLDA;\nimport models.LFLDA_Inf;\n\nimport org.kohsuke.args4j.CmdLineException;\nimport org.kohsuke.args4j.CmdLineParser;\n\nimport utility.CmdArgs;\nimport eval.ClusteringEval;\n\n/**\n * Implementations of the LF-LDA and LF-DMM latent feature topic models, using\n * collapsed Gibbs sampling, as described in:\n * \n * Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015.\n * Improving Topic Models with Latent Feature Word Representations. Transactions\n * of the Association for Computational Linguistics, vol. 3, pp. 299-313.\n * \n * @author Dat Quoc Nguyen\n * \n */\npublic class LFTM\n{\n\tpublic static void main(String[] args)\n\t{\n\t\tCmdArgs cmdArgs = new CmdArgs();\n\t\tCmdLineParser parser = new CmdLineParser(cmdArgs);\n\t\ttry {\n\n\t\t\tparser.parseArgument(args);\n\n\t\t\tif (cmdArgs.model.equals(\"LFLDA\")) {\n\t\t\t\tLFLDA lflda = new LFLDA(cmdArgs.corpus, cmdArgs.vectors,\n\t\t\t\t\tcmdArgs.ntopics, cmdArgs.alpha, cmdArgs.beta,\n\t\t\t\t\tcmdArgs.lambda, cmdArgs.initers, cmdArgs.niters,\n\t\t\t\t\tcmdArgs.twords, cmdArgs.expModelName,\n\t\t\t\t\tcmdArgs.initTopicAssgns, cmdArgs.savestep);\n\t\t\t\tlflda.inference();\n\t\t\t}\n\t\t\telse if (cmdArgs.model.equals(\"LFDMM\")) {\n\t\t\t\tLFDMM lfdmm = new LFDMM(cmdArgs.corpus, cmdArgs.vectors,\n\t\t\t\t\tcmdArgs.ntopics, cmdArgs.alpha, cmdArgs.beta,\n\t\t\t\t\tcmdArgs.lambda, cmdArgs.initers, cmdArgs.niters,\n\t\t\t\t\tcmdArgs.twords, cmdArgs.expModelName,\n\t\t\t\t\tcmdArgs.initTopicAssgns, cmdArgs.savestep);\n\t\t\t\tlfdmm.inference();\n\t\t\t}\n\t\t\telse if (cmdArgs.model.equals(\"LFLDAinf\")) {\n\t\t\t\tLFLDA_Inf lfldaInf = new LFLDA_Inf(cmdArgs.paras,\n\t\t\t\t\tcmdArgs.corpus, cmdArgs.initers, cmdArgs.niters,\n\t\t\t\t\tcmdArgs.twords, cmdArgs.expModelName, cmdArgs.savestep);\n\t\t\t\tlfldaInf.inference();\n\t\t\t}\n\t\t\telse if (cmdArgs.model.equals(\"LFDMMinf\")) {\n\t\t\t\tLFDMM_Inf lfdmmInf = new LFDMM_Inf(cmdArgs.paras,\n\t\t\t\t\tcmdArgs.corpus, cmdArgs.initers, cmdArgs.niters,\n\t\t\t\t\tcmdArgs.twords, cmdArgs.expModelName, cmdArgs.savestep);\n\t\t\t\tlfdmmInf.inference();\n\t\t\t}\n\t\t\telse if (cmdArgs.model.equals(\"Eval\")) {\n\t\t\t\tClusteringEval.evaluate(cmdArgs.labelFile, cmdArgs.dir,\n\t\t\t\t\tcmdArgs.prob);\n\t\t\t}\n\t\t\telse {\n\t\t\t\tSystem.out\n\t\t\t\t\t.println(\"Error: Option \\\"-model\\\" must get \\\"LFLDA\\\" or \\\"LFDMM\\\" or \\\"LFLDAinf\\\" or \\\"LFDMMinf\\\" or \\\"Eval\\\"\");\n\t\t\t\tSystem.out.println(\"\\tLFLDA: Specify the LF-LDA topic model\");\n\t\t\t\tSystem.out.println(\"\\tLFDMM: Specify the LF-DMM topic model\");\n\t\t\t\tSystem.out\n\t\t\t\t\t.println(\"\\tLFLDAinf: Infer topics for unseen corpus using a pre-trained LF-LDA model\");\n\t\t\t\tSystem.out\n\t\t\t\t\t.println(\"\\tLFDMMinf: Infer topics for unseen corpus using a pre-trained LF-DMM model\");\n\t\t\t\tSystem.out\n\t\t\t\t\t.println(\"\\tEval: Specify the document clustering evaluation\");\n\t\t\t\thelp(parser);\n\t\t\t\treturn;\n\t\t\t}\n\t\t}\n\t\tcatch (CmdLineException cle) {\n\t\t\tSystem.out.println(\"Error: \" + cle.getMessage());\n\t\t\thelp(parser);\n\t\t\treturn;\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\tSystem.out.println(\"Error: \" + e.getMessage());\n\t\t\te.printStackTrace();\n\t\t\treturn;\n\t\t}\n\t}\n\n\tpublic static void help(CmdLineParser parser)\n\t{\n\t\tSystem.out.println(\"java -jar LFTM.jar [options ...] [arguments...]\");\n\t\tparser.printUsage(System.out);\n\t}\n}\n"
  },
  {
    "path": "src/eval/ClusteringEval.java",
    "content": "package eval;\n\nimport java.io.BufferedReader;\nimport java.io.BufferedWriter;\nimport java.io.File;\nimport java.io.FileReader;\nimport java.io.FileWriter;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.HashSet;\nimport java.util.List;\nimport java.util.Set;\n\nimport utility.FuncUtils;\n\n/**\n * Implementation of the Purity and NMI clustering evaluation scores, as described in Section 16.3\n * in:\n * \n * Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. 2008. Introduction to\n * Information Retrieval. Cambridge University Press.\n * \n * @author: Dat Quoc Nguyen\n */\n\npublic class ClusteringEval\n{\n    String pathDocTopicProsFile;\n\n    String pathGoldenLabelsFile;\n\n    HashMap<String, Set<Integer>> goldenClusers;\n    HashMap<String, Set<Integer>> outputClusers;\n\n    int numDocs;\n\n    public ClusteringEval(String inPathGoldenLabelsFile, String inPathDocTopicProsFile)\n        throws Exception\n    {\n        pathDocTopicProsFile = inPathDocTopicProsFile;\n        pathGoldenLabelsFile = inPathGoldenLabelsFile;\n\n        goldenClusers = new HashMap<String, Set<Integer>>();\n        outputClusers = new HashMap<String, Set<Integer>>();\n\n        readGoldenLabelsFile();\n        readDocTopicProsFile();\n    }\n\n    public void readGoldenLabelsFile()\n        throws Exception\n    {\n        System.out.println(\"Reading golden labels file \" + pathGoldenLabelsFile);\n\n        int id = 0;\n\n        BufferedReader br = null;\n        try {\n            br = new BufferedReader(new FileReader(pathGoldenLabelsFile));\n            for (String label; (label = br.readLine()) != null;) {\n                label = label.trim();\n                Set<Integer> ids = new HashSet<Integer>();\n                if (goldenClusers.containsKey(label))\n                    ids = goldenClusers.get(label);\n                ids.add(id);\n                goldenClusers.put(label, ids);\n                id += 1;\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n        numDocs = id;\n    }\n\n    public void readDocTopicProsFile()\n        throws Exception\n    {\n        System.out.println(\"Reading document-to-topic distribution file \" + pathDocTopicProsFile);\n\n        HashMap<Integer, String> docLabelOutput = new HashMap<Integer, String>();\n\n        int docIndex = 0;\n\n        BufferedReader br = null;\n        try {\n            br = new BufferedReader(new FileReader(pathDocTopicProsFile));\n\n            for (String docTopicProbs; (docTopicProbs = br.readLine()) != null;) {\n                String[] pros = docTopicProbs.trim().split(\"\\\\s+\");\n                double maxPro = 0.0;\n                int index = -1;\n                for (int topicIndex = 0; topicIndex < pros.length; topicIndex++) {\n                    double pro = new Double(pros[topicIndex]);\n                    if (pro > maxPro) {\n                        maxPro = pro;\n                        index = topicIndex;\n                    }\n                }\n                docLabelOutput.put(docIndex, \"Topic_\" + new Integer(index).toString());\n                docIndex++;\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n\n        if (numDocs != docIndex) {\n            System.out\n                    .println(\"Error: the number of documents is different to the number of labels!\");\n            throw new Exception();\n        }\n\n        for (Integer id : docLabelOutput.keySet()) {\n            String label = docLabelOutput.get(id);\n            Set<Integer> ids = new HashSet<Integer>();\n            if (outputClusers.containsKey(label))\n                ids = outputClusers.get(label);\n            ids.add(id);\n            outputClusers.put(label, ids);\n        }\n\n    }\n\n    public double computePurity()\n    {\n        int count = 0;\n        for (String label : outputClusers.keySet()) {\n            Set<Integer> docs = outputClusers.get(label);\n            int correctAssignedDocNum = 0;\n            for (String goldenLabel : goldenClusers.keySet()) {\n                Set<Integer> goldenDocs = goldenClusers.get(goldenLabel);\n                Set<Integer> outputDocs = new HashSet<Integer>(docs);\n                outputDocs.retainAll(goldenDocs);\n                if (outputDocs.size() >= correctAssignedDocNum)\n                    correctAssignedDocNum = outputDocs.size();\n            }\n            count += correctAssignedDocNum;\n        }\n        double value = count * 1.0 / numDocs;\n        System.out.println(\"\\tPurity accuracy: \" + value);\n        return value;\n    }\n\n    public double computeNMIscore()\n    {\n        double MIscore = 0.0;\n        for (String label : outputClusers.keySet()) {\n            Set<Integer> docs = outputClusers.get(label);\n            for (String goldenLabel : goldenClusers.keySet()) {\n                Set<Integer> goldenDocs = goldenClusers.get(goldenLabel);\n                Set<Integer> outputDocs = new HashSet<Integer>(docs);\n                outputDocs.retainAll(goldenDocs);\n                double numCorrectAssignedDocs = outputDocs.size() * 1.0;\n                if (numCorrectAssignedDocs == 0.0)\n                    continue;\n                MIscore += (numCorrectAssignedDocs / numDocs)\n                        * Math.log(numCorrectAssignedDocs * numDocs\n                                / (docs.size() * goldenDocs.size()));\n            }\n\n        }\n        double entropy = 0.0;\n        for (String label : outputClusers.keySet()) {\n            Set<Integer> docs = outputClusers.get(label);\n            entropy += (-1.0 * docs.size() / numDocs) * Math.log(1.0 * docs.size() / numDocs);\n        }\n\n        for (String label : goldenClusers.keySet()) {\n            Set<Integer> docs = goldenClusers.get(label);\n            entropy += (-1.0 * docs.size() / numDocs) * Math.log(1.0 * docs.size() / numDocs);\n        }\n\n        double value = 2 * MIscore / entropy;\n        System.out.println(\"\\tNMI score: \" + value);\n        return value;\n    }\n\n    public static void evaluate(String pathGoldenLabelsFile,\n            String pathToFolderOfDocTopicProsFiles, String suffix)\n        throws Exception\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(pathToFolderOfDocTopicProsFiles\n                + \"/\" + suffix + \".PurityNMI\"));\n        writer.write(\"Golden-labels in: \" + pathGoldenLabelsFile + \"\\n\\n\");\n        File[] files = new File(pathToFolderOfDocTopicProsFiles).listFiles();\n\n        List<Double> purity = new ArrayList<Double>(), nmi = new ArrayList<Double>();\n        for (File file : files) {\n            if (!file.getName().endsWith(suffix))\n                continue;\n            writer.write(\"Results for: \" + file.getAbsolutePath() + \"\\n\");\n            ClusteringEval dce = new ClusteringEval(pathGoldenLabelsFile, file.getAbsolutePath());\n            double value = dce.computePurity();\n            writer.write(\"\\tPurity: \" + value + \"\\n\");\n            purity.add(value);\n            value = dce.computeNMIscore();\n            writer.write(\"\\tNMI: \" + value + \"\\n\");\n            nmi.add(value);\n        }\n        if (purity.size() == 0 || nmi.size() == 0) {\n            System.out.println(\"Error: There is no file ending with \" + suffix);\n            throw new Exception();\n        }\n\n        double[] purityValues = new double[purity.size()];\n        double[] nmiValues = new double[nmi.size()];\n\n        for (int i = 0; i < purity.size(); i++)\n            purityValues[i] = purity.get(i).doubleValue();\n        for (int i = 0; i < nmi.size(); i++)\n            nmiValues[i] = nmi.get(i).doubleValue();\n\n        writer.write(\"\\n---\\nMean purity: \" + FuncUtils.mean(purityValues)\n                + \", standard deviation: \" + FuncUtils.stddev(purityValues));\n\n        writer.write(\"\\nMean NMI: \" + FuncUtils.mean(nmiValues) + \", standard deviation: \"\n                + FuncUtils.stddev(nmiValues));\n\n        System.out.println(\"---\\nMean purity: \" + FuncUtils.mean(purityValues)\n                + \", standard deviation: \" + FuncUtils.stddev(purityValues));\n\n        System.out.println(\"Mean NMI: \" + FuncUtils.mean(nmiValues) + \", standard deviation: \"\n                + FuncUtils.stddev(nmiValues));\n\n        writer.close();\n    }\n\n    public static void main(String[] args)\n        throws Exception\n    {\n        ClusteringEval.evaluate(\"test/corpus.LABEL\", \"test\", \"theta\");\n    }\n}\n"
  },
  {
    "path": "src/models/LFDMM.java",
    "content": "package models;\n\nimport java.io.BufferedReader;\nimport java.io.BufferedWriter;\nimport java.io.FileReader;\nimport java.io.FileWriter;\nimport java.io.IOException;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.Set;\nimport java.util.TreeMap;\n\nimport utility.FuncUtils;\nimport utility.LBFGS;\nimport utility.Parallel;\nimport cc.mallet.optimize.InvalidOptimizableException;\nimport cc.mallet.optimize.Optimizer;\nimport cc.mallet.types.MatrixOps;\nimport cc.mallet.util.Randoms;\n\n/**\n * Implementation of the LF-DMM latent feature topic model, using collapsed Gibbs sampling, as\n * described in:\n * \n * Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with\n * Latent Feature Word Representations. Transactions of the Association for Computational\n * Linguistics, vol. 3, pp. 299-313.\n * \n * @author Dat Quoc Nguyen\n */\n\npublic class LFDMM\n{\n    public double alpha; // Hyper-parameter alpha\n    public double beta; // Hyper-parameter alpha\n    // public double alphaSum; // alpha * numTopics\n    public double betaSum; // beta * vocabularySize\n\n    public int numTopics; // Number of topics\n    public int topWords; // Number of most probable words for each topic\n\n    public double lambda; // Mixture weight value\n    public int numInitIterations;\n    public int numIterations; // Number of EM-style sampling iterations\n\n    public List<List<Integer>> corpus; // Word ID-based corpus\n    public List<List<Integer>> topicAssignments; // Topics assignments for words\n                                                 // in the corpus\n    public int numDocuments; // Number of documents in the corpus\n    public int numWordsInCorpus; // Number of words in the corpus\n\n    public HashMap<String, Integer> word2IdVocabulary; // Vocabulary to get ID\n                                                       // given a word\n    public HashMap<Integer, String> id2WordVocabulary; // Vocabulary to get word\n                                                       // given an ID\n    public int vocabularySize; // The number of word types in the corpus\n\n    // Number of documents assigned to a topic\n    public int[] docTopicCount;\n    // numTopics * vocabularySize matrix\n    // Given a topic: number of times a word type generated from the topic by\n    // the Dirichlet multinomial component\n    public int[][] topicWordCountDMM;\n    // Total number of words generated from each topic by the Dirichlet\n    // multinomial component\n    public int[] sumTopicWordCountDMM;\n    // numTopics * vocabularySize matrix\n    // Given a topic: number of times a word type generated from the topic by\n    // the latent feature component\n    public int[][] topicWordCountLF;\n    // Total number of words generated from each topic by the latent feature\n    // component\n    public int[] sumTopicWordCountLF;\n\n    // Double array used to sample a topic\n    public double[] multiPros;\n    // Path to the directory containing the corpus\n    public String folderPath;\n    // Path to the topic modeling corpus\n    public String corpusPath;\n    public String vectorFilePath;\n\n    public double[][] wordVectors; // Vector representations for words\n    public double[][] topicVectors;// Vector representations for topics\n    public int vectorSize; // Number of vector dimensions\n    public double[][] dotProductValues;\n    public double[][] expDotProductValues;\n    public double[] sumExpValues; // Partition function values\n\n    public final double l2Regularizer = 0.01; // L2 regularizer value for learning topic vectors\n    public final double tolerance = 0.05; // Tolerance value for LBFGS convergence\n\n    public String expName = \"LFDMM\";\n    public String orgExpName = \"LFDMM\";\n    public String tAssignsFilePath = \"\";\n    public int savestep = 0;\n\n    public LFDMM(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, \"LFDMM\");\n    }\n\n    public LFDMM(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, inExpName, \"\", 0);\n    }\n\n    public LFDMM(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName, String pathToTAfile)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, inExpName, pathToTAfile, 0);\n    }\n\n    public LFDMM(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName, int inSaveStep)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, inExpName, \"\", inSaveStep);\n    }\n\n    public LFDMM(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName, String pathToTAfile,\n            int inSaveStep)\n        throws Exception\n    {\n        alpha = inAlpha;\n        beta = inBeta;\n        lambda = inLambda;\n        numTopics = inNumTopics;\n        numIterations = inNumIterations;\n        numInitIterations = inNumInitIterations;\n        topWords = inTopWords;\n        savestep = inSaveStep;\n        expName = inExpName;\n        orgExpName = expName;\n        vectorFilePath = pathToWordVectorsFile;\n        corpusPath = pathToCorpus;\n        folderPath = pathToCorpus.substring(0,\n                Math.max(pathToCorpus.lastIndexOf(\"/\"), pathToCorpus.lastIndexOf(\"\\\\\")) + 1);\n\n        System.out.println(\"Reading topic modeling corpus: \" + pathToCorpus);\n\n        word2IdVocabulary = new HashMap<String, Integer>();\n        id2WordVocabulary = new HashMap<Integer, String>();\n        corpus = new ArrayList<List<Integer>>();\n        numDocuments = 0;\n        numWordsInCorpus = 0;\n\n        BufferedReader br = null;\n        try {\n            int indexWord = -1;\n            br = new BufferedReader(new FileReader(pathToCorpus));\n            for (String doc; (doc = br.readLine()) != null;) {\n\n                if (doc.trim().length() == 0)\n                    continue;\n\n                String[] words = doc.trim().split(\"\\\\s+\");\n                List<Integer> document = new ArrayList<Integer>();\n\n                for (String word : words) {\n                    if (word2IdVocabulary.containsKey(word)) {\n                        document.add(word2IdVocabulary.get(word));\n                    }\n                    else {\n                        indexWord += 1;\n                        word2IdVocabulary.put(word, indexWord);\n                        id2WordVocabulary.put(indexWord, word);\n                        document.add(indexWord);\n                    }\n                }\n\n                numDocuments++;\n                numWordsInCorpus += document.size();\n                corpus.add(document);\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n\n        vocabularySize = word2IdVocabulary.size();\n        docTopicCount = new int[numTopics];\n        topicWordCountDMM = new int[numTopics][vocabularySize];\n        sumTopicWordCountDMM = new int[numTopics];\n        topicWordCountLF = new int[numTopics][vocabularySize];\n        sumTopicWordCountLF = new int[numTopics];\n\n        multiPros = new double[numTopics];\n        for (int i = 0; i < numTopics; i++) {\n            multiPros[i] = 1.0 / numTopics;\n        }\n\n        // alphaSum = numTopics * alpha;\n        betaSum = vocabularySize * beta;\n\n        readWordVectorsFile(vectorFilePath);\n        topicVectors = new double[numTopics][vectorSize];\n        dotProductValues = new double[numTopics][vocabularySize];\n        expDotProductValues = new double[numTopics][vocabularySize];\n        sumExpValues = new double[numTopics];\n\n        System.out\n                .println(\"Corpus size: \" + numDocuments + \" docs, \" + numWordsInCorpus + \" words\");\n        System.out.println(\"Vocabuary size: \" + vocabularySize);\n        System.out.println(\"Number of topics: \" + numTopics);\n        System.out.println(\"alpha: \" + alpha);\n        System.out.println(\"beta: \" + beta);\n        System.out.println(\"lambda: \" + lambda);\n        System.out.println(\"Number of initial sampling iterations: \" + numInitIterations);\n        System.out.println(\"Number of EM-style sampling iterations for the LF-DMM model: \"\n                + numIterations);\n        System.out.println(\"Number of top topical words: \" + topWords);\n\n        tAssignsFilePath = pathToTAfile;\n        if (tAssignsFilePath.length() > 0)\n            initialize(tAssignsFilePath);\n        else\n            initialize();\n\n    }\n\n    public void readWordVectorsFile(String pathToWordVectorsFile)\n        throws Exception\n    {\n        System.out.println(\"Reading word vectors from word-vectors file \" + pathToWordVectorsFile\n                + \"...\");\n\n        BufferedReader br = null;\n        try {\n            br = new BufferedReader(new FileReader(pathToWordVectorsFile));\n            String[] elements = br.readLine().trim().split(\"\\\\s+\");\n            vectorSize = elements.length - 1;\n            wordVectors = new double[vocabularySize][vectorSize];\n            String word = elements[0];\n            if (word2IdVocabulary.containsKey(word)) {\n                for (int j = 0; j < vectorSize; j++) {\n                    wordVectors[word2IdVocabulary.get(word)][j] = new Double(elements[j + 1]);\n                }\n            }\n            for (String line; (line = br.readLine()) != null;) {\n                elements = line.trim().split(\"\\\\s+\");\n                word = elements[0];\n                if (word2IdVocabulary.containsKey(word)) {\n                    for (int j = 0; j < vectorSize; j++) {\n                        wordVectors[word2IdVocabulary.get(word)][j] = new Double(elements[j + 1]);\n                    }\n                }\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n\n        for (int i = 0; i < vocabularySize; i++) {\n            if (MatrixOps.absNorm(wordVectors[i]) == 0.0) {\n                System.out.println(\"The word \\\"\" + id2WordVocabulary.get(i)\n                        + \"\\\" doesn't have a corresponding vector!!!\");\n                throw new Exception();\n            }\n        }\n    }\n\n    public void initialize()\n        throws IOException\n    {\n        System.out.println(\"Randomly initialzing topic assignments ...\");\n        topicAssignments = new ArrayList<List<Integer>>();\n\n        for (int docId = 0; docId < numDocuments; docId++) {\n            List<Integer> topics = new ArrayList<Integer>();\n            int topic = FuncUtils.nextDiscrete(multiPros);\n            docTopicCount[topic] += 1;\n            int docSize = corpus.get(docId).size();\n            for (int j = 0; j < docSize; j++) {\n                int wordId = corpus.get(docId).get(j);\n                boolean component = new Randoms().nextBoolean();\n                int subtopic = topic;\n                if (!component) { // Generated from the latent feature component\n                    topicWordCountLF[topic][wordId] += 1;\n                    sumTopicWordCountLF[topic] += 1;\n                }\n                else {// Generated from the Dirichlet multinomial component\n                    topicWordCountDMM[topic][wordId] += 1;\n                    sumTopicWordCountDMM[topic] += 1;\n                    subtopic = subtopic + numTopics;\n                }\n                topics.add(subtopic);\n            }\n            topicAssignments.add(topics);\n        }\n    }\n\n    public void initialize(String pathToTopicAssignmentFile)\n        throws Exception\n    {\n        System.out.println(\"Reading topic-assignment file: \" + pathToTopicAssignmentFile);\n\n        topicAssignments = new ArrayList<List<Integer>>();\n\n        BufferedReader br = null;\n        try {\n            br = new BufferedReader(new FileReader(pathToTopicAssignmentFile));\n            int docId = 0;\n            int numWords = 0;\n            for (String line; (line = br.readLine()) != null;) {\n                String[] strTopics = line.trim().split(\"\\\\s+\");\n                List<Integer> topics = new ArrayList<Integer>();\n                int topic = new Integer(strTopics[0]) % numTopics;\n                docTopicCount[topic] += 1;\n                for (int j = 0; j < strTopics.length; j++) {\n                    int wordId = corpus.get(docId).get(j);\n                    int subtopic = new Integer(strTopics[j]);\n                    if (subtopic == topic) {\n                        topicWordCountLF[topic][wordId] += 1;\n                        sumTopicWordCountLF[topic] += 1;\n                    }\n                    else {\n                        topicWordCountDMM[topic][wordId] += 1;\n                        sumTopicWordCountDMM[topic] += 1;\n                    }\n                    topics.add(subtopic);\n                    numWords++;\n                }\n                topicAssignments.add(topics);\n                docId++;\n            }\n\n            if ((docId != numDocuments) || (numWords != numWordsInCorpus)) {\n                System.out\n                        .println(\"The topic modeling corpus and topic assignment file are not consistent!!!\");\n                throw new Exception();\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n    }\n\n    public void inference()\n        throws IOException\n    {\n        System.out.println(\"Running Gibbs sampling inference: \");\n\n        for (int iter = 1; iter <= numInitIterations; iter++) {\n\n            System.out.println(\"\\tInitial sampling iteration: \" + (iter));\n\n            sampleSingleInitialIteration();\n        }\n\n        for (int iter = 1; iter <= numIterations; iter++) {\n\n            System.out.println(\"\\tLFDMM sampling iteration: \" + (iter));\n\n            optimizeTopicVectors();\n\n            sampleSingleIteration();\n\n            if ((savestep > 0) && (iter % savestep == 0) && (iter < numIterations)) {\n                System.out.println(\"\\t\\tSaving the output from the \" + iter + \"^{th} sample\");\n                expName = orgExpName + \"-\" + iter;\n                write();\n            }\n        }\n        expName = orgExpName;\n\n        writeParameters();\n        System.out.println(\"Writing output from the last sample ...\");\n        write();\n\n        System.out.println(\"Sampling completed!\");\n    }\n\n    public void optimizeTopicVectors()\n    {\n        System.out.println(\"\\t\\tEstimating topic vectors ...\");\n        sumExpValues = new double[numTopics];\n        dotProductValues = new double[numTopics][vocabularySize];\n        expDotProductValues = new double[numTopics][vocabularySize];\n\n        Parallel.loop(numTopics, new Parallel.LoopInt()\n        {\n            @Override\n            public void compute(int topic)\n            {\n                int rate = 1;\n                boolean check = true;\n                while (check) {\n                    double l2Value = l2Regularizer * rate;\n                    try {\n                        TopicVectorOptimizer optimizer = new TopicVectorOptimizer(\n                                topicVectors[topic], topicWordCountLF[topic], wordVectors, l2Value);\n\n                        Optimizer gd = new LBFGS(optimizer, tolerance);\n                        gd.optimize(600);\n                        optimizer.getParameters(topicVectors[topic]);\n                        sumExpValues[topic] = optimizer.computePartitionFunction(\n                                dotProductValues[topic], expDotProductValues[topic]);\n                        check = false;\n\n                        if (sumExpValues[topic] == 0 || Double.isInfinite(sumExpValues[topic])) {\n                            double max = -1000000000.0;\n                            for (int index = 0; index < vocabularySize; index++) {\n                                if (dotProductValues[topic][index] > max)\n                                    max = dotProductValues[topic][index];\n                            }\n                            for (int index = 0; index < vocabularySize; index++) {\n                                expDotProductValues[topic][index] = Math\n                                        .exp(dotProductValues[topic][index] - max);\n                                sumExpValues[topic] += expDotProductValues[topic][index];\n                            }\n                        }\n                    }\n                    catch (InvalidOptimizableException e) {\n                        e.printStackTrace();\n                        check = true;\n                    }\n                    rate = rate * 10;\n                }\n            }\n        });\n    }\n\n    public void sampleSingleIteration()\n    {\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            List<Integer> document = corpus.get(dIndex);\n            int docSize = document.size();\n            int topic = topicAssignments.get(dIndex).get(0) % numTopics;\n\n            docTopicCount[topic] = docTopicCount[topic] - 1;\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                int word = document.get(wIndex);// wordId\n                int subtopic = topicAssignments.get(dIndex).get(wIndex);\n                if (subtopic == topic) {\n                    topicWordCountLF[topic][word] -= 1;\n                    sumTopicWordCountLF[topic] -= 1;\n                }\n                else {\n                    topicWordCountDMM[topic][word] -= 1;\n                    sumTopicWordCountDMM[topic] -= 1;\n                }\n            }\n\n            // Sample a topic\n            for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n                multiPros[tIndex] = (docTopicCount[tIndex] + alpha);\n                for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                    int word = document.get(wIndex);\n                    multiPros[tIndex] *= (lambda * expDotProductValues[tIndex][word]\n                            / sumExpValues[tIndex] + (1 - lambda)\n                            * (topicWordCountDMM[tIndex][word] + beta)\n                            / (sumTopicWordCountDMM[tIndex] + betaSum));\n                }\n            }\n            topic = FuncUtils.nextDiscrete(multiPros);\n\n            docTopicCount[topic] += 1;\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                int word = document.get(wIndex);\n                int subtopic = topic;\n                if (lambda * expDotProductValues[topic][word] / sumExpValues[topic] > (1 - lambda)\n                        * (topicWordCountDMM[topic][word] + beta)\n                        / (sumTopicWordCountDMM[topic] + betaSum)) {\n                    topicWordCountLF[topic][word] += 1;\n                    sumTopicWordCountLF[topic] += 1;\n                }\n                else {\n                    topicWordCountDMM[topic][word] += 1;\n                    sumTopicWordCountDMM[topic] += 1;\n                    subtopic += numTopics;\n                }\n                // Update topic assignments\n                topicAssignments.get(dIndex).set(wIndex, subtopic);\n            }\n        }\n    }\n\n    public void sampleSingleInitialIteration()\n    {\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            List<Integer> document = corpus.get(dIndex);\n            int docSize = document.size();\n            int topic = topicAssignments.get(dIndex).get(0) % numTopics;\n\n            docTopicCount[topic] = docTopicCount[topic] - 1;\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                int word = document.get(wIndex);\n                int subtopic = topicAssignments.get(dIndex).get(wIndex);\n                if (topic == subtopic) {\n                    topicWordCountLF[topic][word] -= 1;\n                    sumTopicWordCountLF[topic] -= 1;\n                }\n                else {\n                    topicWordCountDMM[topic][word] -= 1;\n                    sumTopicWordCountDMM[topic] -= 1;\n                }\n            }\n\n            // Sample a topic\n            for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n                multiPros[tIndex] = (docTopicCount[tIndex] + alpha);\n                for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                    int word = document.get(wIndex);\n                    multiPros[tIndex] *= (lambda * (topicWordCountLF[tIndex][word] + beta)\n                            / (sumTopicWordCountLF[tIndex] + betaSum) + (1 - lambda)\n                            * (topicWordCountDMM[tIndex][word] + beta)\n                            / (sumTopicWordCountDMM[tIndex] + betaSum));\n                }\n            }\n            topic = FuncUtils.nextDiscrete(multiPros);\n\n            docTopicCount[topic] += 1;\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                int word = document.get(wIndex);// wordID\n                int subtopic = topic;\n                if (lambda * (topicWordCountLF[topic][word] + beta)\n                        / (sumTopicWordCountLF[topic] + betaSum) > (1 - lambda)\n                        * (topicWordCountDMM[topic][word] + beta)\n                        / (sumTopicWordCountDMM[topic] + betaSum)) {\n                    topicWordCountLF[topic][word] += 1;\n                    sumTopicWordCountLF[topic] += 1;\n                }\n                else {\n                    topicWordCountDMM[topic][word] += 1;\n                    sumTopicWordCountDMM[topic] += 1;\n                    subtopic += numTopics;\n                }\n                // Update topic assignments\n                topicAssignments.get(dIndex).set(wIndex, subtopic);\n            }\n        }\n    }\n\n    public void writeParameters()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName + \".paras\"));\n        writer.write(\"-model\" + \"\\t\" + \"LFDMM\");\n        writer.write(\"\\n-corpus\" + \"\\t\" + corpusPath);\n        writer.write(\"\\n-vectors\" + \"\\t\" + vectorFilePath);\n        writer.write(\"\\n-ntopics\" + \"\\t\" + numTopics);\n        writer.write(\"\\n-alpha\" + \"\\t\" + alpha);\n        writer.write(\"\\n-beta\" + \"\\t\" + beta);\n        writer.write(\"\\n-lambda\" + \"\\t\" + lambda);\n        writer.write(\"\\n-initers\" + \"\\t\" + numInitIterations);\n        writer.write(\"\\n-niters\" + \"\\t\" + numIterations);\n        writer.write(\"\\n-twords\" + \"\\t\" + topWords);\n        writer.write(\"\\n-name\" + \"\\t\" + expName);\n        if (tAssignsFilePath.length() > 0)\n            writer.write(\"\\n-initFile\" + \"\\t\" + tAssignsFilePath);\n        if (savestep > 0)\n            writer.write(\"\\n-sstep\" + \"\\t\" + savestep);\n\n        writer.close();\n    }\n\n    public void writeDictionary()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".vocabulary\"));\n        for (String word : word2IdVocabulary.keySet()) {\n            writer.write(word + \" \" + word2IdVocabulary.get(word) + \"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeIDbasedCorpus()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".IDcorpus\"));\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            int docSize = corpus.get(dIndex).size();\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                writer.write(corpus.get(dIndex).get(wIndex) + \" \");\n            }\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeTopicAssignments()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".topicAssignments\"));\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            int docSize = corpus.get(dIndex).size();\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                writer.write(topicAssignments.get(dIndex).get(wIndex) + \" \");\n            }\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeTopicVectors()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".topicVectors\"));\n        for (int i = 0; i < numTopics; i++) {\n            for (int j = 0; j < vectorSize; j++)\n                writer.write(topicVectors[i][j] + \" \");\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeTopTopicalWords()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".topWords\"));\n\n        for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n            writer.write(\"Topic\" + new Integer(tIndex) + \":\");\n\n            Map<Integer, Double> topicWordProbs = new TreeMap<Integer, Double>();\n            for (int wIndex = 0; wIndex < vocabularySize; wIndex++) {\n\n                double pro = lambda * expDotProductValues[tIndex][wIndex] / sumExpValues[tIndex]\n                        + (1 - lambda) * (topicWordCountDMM[tIndex][wIndex] + beta)\n                        / (sumTopicWordCountDMM[tIndex] + betaSum);\n\n                topicWordProbs.put(wIndex, pro);\n            }\n            topicWordProbs = FuncUtils.sortByValueDescending(topicWordProbs);\n\n            Set<Integer> mostLikelyWords = topicWordProbs.keySet();\n            int count = 0;\n            for (Integer index : mostLikelyWords) {\n                if (count < topWords) {\n                    writer.write(\" \" + id2WordVocabulary.get(index));\n                    count += 1;\n                }\n                else {\n                    writer.write(\"\\n\\n\");\n                    break;\n                }\n            }\n        }\n        writer.close();\n    }\n\n    public void writeTopicWordPros()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName + \".phi\"));\n        for (int t = 0; t < numTopics; t++) {\n            for (int w = 0; w < vocabularySize; w++) {\n                double pro = lambda * expDotProductValues[t][w] / sumExpValues[t] + (1 - lambda)\n                        * (topicWordCountDMM[t][w] + beta) / (sumTopicWordCountDMM[t] + betaSum);\n                writer.write(pro + \" \");\n            }\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeDocTopicPros()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName + \".theta\"));\n\n        for (int i = 0; i < numDocuments; i++) {\n            int docSize = corpus.get(i).size();\n            double sum = 0.0;\n            for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n                multiPros[tIndex] = (docTopicCount[tIndex] + alpha);\n                for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                    int word = corpus.get(i).get(wIndex);\n                    multiPros[tIndex] *= (lambda * expDotProductValues[tIndex][word]\n                            / sumExpValues[tIndex] + (1 - lambda)\n                            * (topicWordCountDMM[tIndex][word] + beta)\n                            / (sumTopicWordCountDMM[tIndex] + betaSum));\n                }\n                sum += multiPros[tIndex];\n            }\n            for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n                writer.write((multiPros[tIndex] / sum) + \" \");\n            }\n            writer.write(\"\\n\");\n\n        }\n        writer.close();\n    }\n\n    public void write()\n        throws IOException\n    {\n        writeTopTopicalWords();\n        writeDocTopicPros();\n        writeTopicAssignments();\n        writeTopicWordPros();\n    }\n\n    public static void main(String args[])\n        throws Exception\n    {\n        LFDMM lfdmm = new LFDMM(\"test/corpus.txt\", \"test/wordVectors.txt\", 4, 0.1, 0.01, 0.6, 2000,\n                200, 20, \"testLFDMM\");\n        lfdmm.writeParameters();\n        lfdmm.inference();\n    }\n}\n"
  },
  {
    "path": "src/models/LFDMM_Inf.java",
    "content": "package models;\n\nimport java.io.BufferedReader;\nimport java.io.BufferedWriter;\nimport java.io.FileReader;\nimport java.io.FileWriter;\nimport java.io.IOException;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.Set;\nimport java.util.TreeMap;\n\nimport utility.FuncUtils;\nimport utility.LBFGS;\nimport utility.Parallel;\nimport cc.mallet.optimize.InvalidOptimizableException;\nimport cc.mallet.optimize.Optimizer;\nimport cc.mallet.types.MatrixOps;\nimport cc.mallet.util.Randoms;\n\n/**\n * Implementation of the LF-DMM latent feature topic model, using collapsed\n * Gibbs sampling, as described in:\n * \n * Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015.\n * Improving Topic Models with Latent Feature Word Representations. Transactions\n * of the Association for Computational Linguistics, vol. 3, pp. 299-313.\n * \n * Inference of topic distribution on unseen corpus\n * \n * @author Dat Quoc Nguyen\n */\n\npublic class LFDMM_Inf\n{\n\tpublic double alpha; // Hyper-parameter alpha\n\tpublic double beta; // Hyper-parameter alpha\n\t// public double alphaSum; // alpha * numTopics\n\tpublic double betaSum; // beta * vocabularySize\n\n\tpublic int numTopics; // Number of topics\n\tpublic int topWords; // Number of most probable words for each topic\n\n\tpublic double lambda; // Mixture weight value\n\tpublic int numInitIterations;\n\tpublic int numIterations; // Number of EM-style sampling iterations\n\n\tpublic List<List<Integer>> corpus; // Word ID-based corpus\n\tpublic List<List<Integer>> topicAssignments; // Topics assignments for words\n\t\t\t\t\t\t\t\t\t\t\t\t\t// in the corpus\n\tpublic int numDocuments; // Number of documents in the corpus\n\tpublic int numWordsInCorpus; // Number of words in the corpus\n\n\tpublic HashMap<String, Integer> word2IdVocabulary; // Vocabulary to get ID\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t// given a word\n\tpublic HashMap<Integer, String> id2WordVocabulary; // Vocabulary to get word\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t// given an ID\n\tpublic int vocabularySize; // The number of word types in the corpus\n\n\t// Number of documents assigned to a topic\n\tpublic int[] docTopicCount;\n\t// numTopics * vocabularySize matrix\n\t// Given a topic: number of times a word type generated from the topic by\n\t// the Dirichlet multinomial component\n\tpublic int[][] topicWordCountDMM;\n\t// Total number of words generated from each topic by the Dirichlet\n\t// multinomial component\n\tpublic int[] sumTopicWordCountDMM;\n\t// numTopics * vocabularySize matrix\n\t// Given a topic: number of times a word type generated from the topic by\n\t// the latent feature component\n\tpublic int[][] topicWordCountLF;\n\t// Total number of words generated from each topic by the latent feature\n\t// component\n\tpublic int[] sumTopicWordCountLF;\n\n\t// Double array used to sample a topic\n\tpublic double[] multiPros;\n\t// Path to the directory containing the corpus\n\tpublic String folderPath;\n\t// Path to the topic modeling corpus\n\tpublic String corpusPath;\n\tpublic String vectorFilePath;\n\n\tpublic double[][] wordVectors; // Vector representations for words\n\tpublic double[][] topicVectors;// Vector representations for topics\n\tpublic int vectorSize; // Number of vector dimensions\n\tpublic double[][] dotProductValues;\n\tpublic double[][] expDotProductValues;\n\tpublic double[] sumExpValues; // Partition function values\n\n\tpublic final double l2Regularizer = 0.01; // L2 regularizer value for\n\t\t\t\t\t\t\t\t\t\t\t\t// learning topic vectors\n\tpublic final double tolerance = 0.05; // Tolerance value for LBFGS\n\t\t\t\t\t\t\t\t\t\t\t// convergence\n\n\tpublic String expName = \"LFDMMinf\";\n\tpublic String orgExpName = \"LFDMMinf\";\n\tpublic int savestep = 0;\n\n\tpublic LFDMM_Inf(String pathToTrainingParasFile, String pathToUnseenCorpus,\n\t\tint inNumInitIterations, int inNumIterations, int inTopWords,\n\t\tString inExpName, int inSaveStep)\n\t\tthrows Exception\n\t{\n\t\tHashMap<String, String> paras = parseTrainingParasFile(pathToTrainingParasFile);\n\t\tif (!paras.get(\"-model\").equals(\"LFDMM\")) {\n\t\t\tthrow new Exception(\"Wrong pre-trained model!!!\");\n\t\t}\n\n\t\talpha = new Double(paras.get(\"-alpha\"));\n\t\tbeta = new Double(paras.get(\"-beta\"));\n\t\tlambda = new Double(paras.get(\"-lambda\"));\n\t\tnumTopics = new Integer(paras.get(\"-ntopics\"));\n\t\tnumIterations = inNumIterations;\n\t\tnumInitIterations = inNumInitIterations;\n\t\ttopWords = inTopWords;\n\t\tsavestep = inSaveStep;\n\t\texpName = inExpName;\n\t\torgExpName = expName;\n\t\tvectorFilePath = paras.get(\"-vectors\");\n\n\t\tString trainingCorpus = paras.get(\"-corpus\");\n\t\tString trainingCorpusfolder = trainingCorpus.substring(\n\t\t\t0,\n\t\t\tMath.max(trainingCorpus.lastIndexOf(\"/\"),\n\t\t\t\ttrainingCorpus.lastIndexOf(\"\\\\\")) + 1);\n\t\tString topicAssignment4TrainFile = trainingCorpusfolder\n\t\t\t+ paras.get(\"-name\") + \".topicAssignments\";\n\n\t\tword2IdVocabulary = new HashMap<String, Integer>();\n\t\tid2WordVocabulary = new HashMap<Integer, String>();\n\t\tinitializeWordCount(trainingCorpus, topicAssignment4TrainFile);\n\n\t\tcorpusPath = pathToUnseenCorpus;\n\t\tfolderPath = pathToUnseenCorpus.substring(\n\t\t\t0,\n\t\t\tMath.max(pathToUnseenCorpus.lastIndexOf(\"/\"),\n\t\t\t\tpathToUnseenCorpus.lastIndexOf(\"\\\\\")) + 1);\n\n\t\tSystem.out.println(\"Reading unseen corpus: \" + pathToUnseenCorpus);\n\t\tcorpus = new ArrayList<List<Integer>>();\n\t\tnumDocuments = 0;\n\t\tnumWordsInCorpus = 0;\n\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToUnseenCorpus));\n\t\t\tfor (String doc; (doc = br.readLine()) != null;) {\n\n\t\t\t\tif (doc.trim().length() == 0)\n\t\t\t\t\tcontinue;\n\n\t\t\t\tString[] words = doc.trim().split(\"\\\\s+\");\n\t\t\t\tList<Integer> document = new ArrayList<Integer>();\n\n\t\t\t\tfor (String word : words) {\n\t\t\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\t\t\tdocument.add(word2IdVocabulary.get(word));\n\t\t\t\t\t}\n\t\t\t\t\telse {\n\t\t\t\t\t\t// Skip this unknown-word\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tnumDocuments++;\n\t\t\t\tnumWordsInCorpus += document.size();\n\t\t\t\tcorpus.add(document);\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\n\t\tdocTopicCount = new int[numTopics];\n\t\tmultiPros = new double[numTopics];\n\t\tfor (int i = 0; i < numTopics; i++) {\n\t\t\tmultiPros[i] = 1.0 / numTopics;\n\t\t}\n\n\t\t// alphaSum = numTopics * alpha;\n\t\tbetaSum = vocabularySize * beta;\n\n\t\treadWordVectorsFile(vectorFilePath);\n\t\ttopicVectors = new double[numTopics][vectorSize];\n\t\tdotProductValues = new double[numTopics][vocabularySize];\n\t\texpDotProductValues = new double[numTopics][vocabularySize];\n\t\tsumExpValues = new double[numTopics];\n\n\t\tSystem.out.println(\"Corpus size: \" + numDocuments + \" docs, \"\n\t\t\t+ numWordsInCorpus + \" words\");\n\t\tSystem.out.println(\"Vocabuary size: \" + vocabularySize);\n\t\tSystem.out.println(\"Number of topics: \" + numTopics);\n\t\tSystem.out.println(\"alpha: \" + alpha);\n\t\tSystem.out.println(\"beta: \" + beta);\n\t\tSystem.out.println(\"lambda: \" + lambda);\n\t\tSystem.out.println(\"Number of initial sampling iterations: \"\n\t\t\t+ numInitIterations);\n\t\tSystem.out\n\t\t\t.println(\"Number of EM-style sampling iterations for the LF-DMM model: \"\n\t\t\t\t+ numIterations);\n\t\tSystem.out.println(\"Number of top topical words: \" + topWords);\n\n\t\tinitialize();\n\t}\n\n\tprivate HashMap<String, String> parseTrainingParasFile(\n\t\tString pathToTrainingParasFile)\n\t\tthrows Exception\n\t{\n\t\tHashMap<String, String> paras = new HashMap<String, String>();\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToTrainingParasFile));\n\t\t\tfor (String line; (line = br.readLine()) != null;) {\n\n\t\t\t\tif (line.trim().length() == 0)\n\t\t\t\t\tcontinue;\n\n\t\t\t\tString[] paraOptions = line.trim().split(\"\\\\s+\");\n\t\t\t\tparas.put(paraOptions[0], paraOptions[1]);\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\t\treturn paras;\n\t}\n\n\tprivate void initializeWordCount(String pathToTrainingCorpus,\n\t\tString pathToTopicAssignmentFile)\n\t{\n\t\tSystem.out.println(\"Loading pre-trained model...\");\n\t\tList<List<Integer>> trainCorpus = new ArrayList<List<Integer>>();\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tint indexWord = -1;\n\t\t\tbr = new BufferedReader(new FileReader(pathToTrainingCorpus));\n\t\t\tfor (String doc; (doc = br.readLine()) != null;) {\n\n\t\t\t\tif (doc.trim().length() == 0)\n\t\t\t\t\tcontinue;\n\n\t\t\t\tString[] words = doc.trim().split(\"\\\\s+\");\n\t\t\t\tList<Integer> document = new ArrayList<Integer>();\n\n\t\t\t\tfor (String word : words) {\n\t\t\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\t\t\tdocument.add(word2IdVocabulary.get(word));\n\t\t\t\t\t}\n\t\t\t\t\telse {\n\t\t\t\t\t\tindexWord += 1;\n\t\t\t\t\t\tword2IdVocabulary.put(word, indexWord);\n\t\t\t\t\t\tid2WordVocabulary.put(indexWord, word);\n\t\t\t\t\t\tdocument.add(indexWord);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\ttrainCorpus.add(document);\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\n\t\tvocabularySize = word2IdVocabulary.size();\n\t\ttopicWordCountDMM = new int[numTopics][vocabularySize];\n\t\tsumTopicWordCountDMM = new int[numTopics];\n\t\ttopicWordCountLF = new int[numTopics][vocabularySize];\n\t\tsumTopicWordCountLF = new int[numTopics];\n\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToTopicAssignmentFile));\n\t\t\tint docId = 0;\n\t\t\tfor (String line; (line = br.readLine()) != null;) {\n\t\t\t\tString[] strTopics = line.trim().split(\"\\\\s+\");\n\t\t\t\tint topic = new Integer(strTopics[0]) % numTopics;\n\t\t\t\tfor (int j = 0; j < strTopics.length; j++) {\n\t\t\t\t\tint wordId = trainCorpus.get(docId).get(j);\n\t\t\t\t\tint subtopic = new Integer(strTopics[j]);\n\t\t\t\t\tif (subtopic == topic) {\n\t\t\t\t\t\ttopicWordCountLF[topic][wordId] += 1;\n\t\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t\t}\n\t\t\t\t\telse {\n\t\t\t\t\t\ttopicWordCountDMM[topic][wordId] += 1;\n\t\t\t\t\t\tsumTopicWordCountDMM[topic] += 1;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tdocId++;\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\t}\n\n\tpublic void readWordVectorsFile(String pathToWordVectorsFile)\n\t\tthrows Exception\n\t{\n\t\tSystem.out.println(\"Reading word vectors from word-vectors file \"\n\t\t\t+ pathToWordVectorsFile + \"...\");\n\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToWordVectorsFile));\n\t\t\tString[] elements = br.readLine().trim().split(\"\\\\s+\");\n\t\t\tvectorSize = elements.length - 1;\n\t\t\twordVectors = new double[vocabularySize][vectorSize];\n\t\t\tString word = elements[0];\n\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\tfor (int j = 0; j < vectorSize; j++) {\n\t\t\t\t\twordVectors[word2IdVocabulary.get(word)][j] = new Double(\n\t\t\t\t\t\telements[j + 1]);\n\t\t\t\t}\n\t\t\t}\n\t\t\tfor (String line; (line = br.readLine()) != null;) {\n\t\t\t\telements = line.trim().split(\"\\\\s+\");\n\t\t\t\tword = elements[0];\n\t\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\t\tfor (int j = 0; j < vectorSize; j++) {\n\t\t\t\t\t\twordVectors[word2IdVocabulary.get(word)][j] = new Double(\n\t\t\t\t\t\t\telements[j + 1]);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\n\t\tfor (int i = 0; i < vocabularySize; i++) {\n\t\t\tif (MatrixOps.absNorm(wordVectors[i]) == 0.0) {\n\t\t\t\tSystem.out.println(\"The word \\\"\" + id2WordVocabulary.get(i)\n\t\t\t\t\t+ \"\\\" doesn't have a corresponding vector!!!\");\n\t\t\t\tthrow new Exception();\n\t\t\t}\n\t\t}\n\t}\n\n\tpublic void initialize()\n\t\tthrows IOException\n\t{\n\t\tSystem.out.println(\"Randomly initialzing topic assignments ...\");\n\t\ttopicAssignments = new ArrayList<List<Integer>>();\n\n\t\tfor (int docId = 0; docId < numDocuments; docId++) {\n\t\t\tList<Integer> topics = new ArrayList<Integer>();\n\t\t\tint topic = FuncUtils.nextDiscrete(multiPros);\n\t\t\tdocTopicCount[topic] += 1;\n\t\t\tint docSize = corpus.get(docId).size();\n\t\t\tfor (int j = 0; j < docSize; j++) {\n\t\t\t\tint wordId = corpus.get(docId).get(j);\n\t\t\t\tboolean component = new Randoms().nextBoolean();\n\t\t\t\tint subtopic = topic;\n\t\t\t\tif (!component) { // Generated from the latent feature component\n\t\t\t\t\ttopicWordCountLF[topic][wordId] += 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t}\n\t\t\t\telse {// Generated from the Dirichlet multinomial component\n\t\t\t\t\ttopicWordCountDMM[topic][wordId] += 1;\n\t\t\t\t\tsumTopicWordCountDMM[topic] += 1;\n\t\t\t\t\tsubtopic = subtopic + numTopics;\n\t\t\t\t}\n\t\t\t\ttopics.add(subtopic);\n\t\t\t}\n\t\t\ttopicAssignments.add(topics);\n\t\t}\n\t}\n\n\tpublic void inference()\n\t\tthrows IOException\n\t{\n\t\tSystem.out.println(\"Running Gibbs sampling inference: \");\n\n\t\tfor (int iter = 1; iter <= numInitIterations; iter++) {\n\n\t\t\tSystem.out.println(\"\\tInitial sampling iteration: \" + (iter));\n\n\t\t\tsampleSingleInitialIteration();\n\t\t}\n\n\t\tfor (int iter = 1; iter <= numIterations; iter++) {\n\n\t\t\tSystem.out.println(\"\\tLFDMM sampling iteration: \" + (iter));\n\n\t\t\toptimizeTopicVectors();\n\n\t\t\tsampleSingleIteration();\n\n\t\t\tif ((savestep > 0) && (iter % savestep == 0)\n\t\t\t\t&& (iter < numIterations)) {\n\t\t\t\tSystem.out.println(\"\\t\\tSaving the output from the \" + iter\n\t\t\t\t\t+ \"^{th} sample\");\n\t\t\t\texpName = orgExpName + \"-\" + iter;\n\t\t\t\twrite();\n\t\t\t}\n\t\t}\n\t\texpName = orgExpName;\n\n\t\twriteParameters();\n\t\tSystem.out.println(\"Writing output from the last sample ...\");\n\t\twrite();\n\n\t\tSystem.out.println(\"Sampling completed!\");\n\t}\n\n\tpublic void optimizeTopicVectors()\n\t{\n\t\tSystem.out.println(\"\\t\\tEstimating topic vectors ...\");\n\t\tsumExpValues = new double[numTopics];\n\t\tdotProductValues = new double[numTopics][vocabularySize];\n\t\texpDotProductValues = new double[numTopics][vocabularySize];\n\n\t\tParallel.loop(numTopics, new Parallel.LoopInt()\n\t\t{\n\t\t\t@Override\n\t\t\tpublic void compute(int topic)\n\t\t\t{\n\t\t\t\tint rate = 1;\n\t\t\t\tboolean check = true;\n\t\t\t\twhile (check) {\n\t\t\t\t\tdouble l2Value = l2Regularizer * rate;\n\t\t\t\t\ttry {\n\t\t\t\t\t\tTopicVectorOptimizer optimizer = new TopicVectorOptimizer(\n\t\t\t\t\t\t\ttopicVectors[topic], topicWordCountLF[topic],\n\t\t\t\t\t\t\twordVectors, l2Value);\n\n\t\t\t\t\t\tOptimizer gd = new LBFGS(optimizer, tolerance);\n\t\t\t\t\t\tgd.optimize(600);\n\t\t\t\t\t\toptimizer.getParameters(topicVectors[topic]);\n\t\t\t\t\t\tsumExpValues[topic] = optimizer\n\t\t\t\t\t\t\t.computePartitionFunction(dotProductValues[topic],\n\t\t\t\t\t\t\t\texpDotProductValues[topic]);\n\t\t\t\t\t\tcheck = false;\n\n\t\t\t\t\t\tif (sumExpValues[topic] == 0\n\t\t\t\t\t\t\t|| Double.isInfinite(sumExpValues[topic])) {\n\t\t\t\t\t\t\tdouble max = -1000000000.0;\n\t\t\t\t\t\t\tfor (int index = 0; index < vocabularySize; index++) {\n\t\t\t\t\t\t\t\tif (dotProductValues[topic][index] > max)\n\t\t\t\t\t\t\t\t\tmax = dotProductValues[topic][index];\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tfor (int index = 0; index < vocabularySize; index++) {\n\t\t\t\t\t\t\t\texpDotProductValues[topic][index] = Math\n\t\t\t\t\t\t\t\t\t.exp(dotProductValues[topic][index] - max);\n\t\t\t\t\t\t\t\tsumExpValues[topic] += expDotProductValues[topic][index];\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tcatch (InvalidOptimizableException e) {\n\t\t\t\t\t\te.printStackTrace();\n\t\t\t\t\t\tcheck = true;\n\t\t\t\t\t}\n\t\t\t\t\trate = rate * 10;\n\t\t\t\t}\n\t\t\t}\n\t\t});\n\t}\n\n\tpublic void sampleSingleIteration()\n\t{\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tList<Integer> document = corpus.get(dIndex);\n\t\t\tint docSize = document.size();\n\t\t\tint topic = topicAssignments.get(dIndex).get(0) % numTopics;\n\n\t\t\tdocTopicCount[topic] = docTopicCount[topic] - 1;\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\tint word = document.get(wIndex);// wordId\n\t\t\t\tint subtopic = topicAssignments.get(dIndex).get(wIndex);\n\t\t\t\tif (subtopic == topic) {\n\t\t\t\t\ttopicWordCountLF[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] -= 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\ttopicWordCountDMM[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountDMM[topic] -= 1;\n\t\t\t\t}\n\t\t\t}\n\n\t\t\t// Sample a topic\n\t\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\t\t\t\tmultiPros[tIndex] = (docTopicCount[tIndex] + alpha);\n\t\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\t\tint word = document.get(wIndex);\n\t\t\t\t\tmultiPros[tIndex] *= (lambda\n\t\t\t\t\t\t* expDotProductValues[tIndex][word]\n\t\t\t\t\t\t/ sumExpValues[tIndex] + (1 - lambda)\n\t\t\t\t\t\t* (topicWordCountDMM[tIndex][word] + beta)\n\t\t\t\t\t\t/ (sumTopicWordCountDMM[tIndex] + betaSum));\n\t\t\t\t}\n\t\t\t}\n\t\t\ttopic = FuncUtils.nextDiscrete(multiPros);\n\n\t\t\tdocTopicCount[topic] += 1;\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\tint word = document.get(wIndex);\n\t\t\t\tint subtopic = topic;\n\t\t\t\tif (lambda * expDotProductValues[topic][word]\n\t\t\t\t\t/ sumExpValues[topic] > (1 - lambda)\n\t\t\t\t\t* (topicWordCountDMM[topic][word] + beta)\n\t\t\t\t\t/ (sumTopicWordCountDMM[topic] + betaSum)) {\n\t\t\t\t\ttopicWordCountLF[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\ttopicWordCountDMM[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountDMM[topic] += 1;\n\t\t\t\t\tsubtopic += numTopics;\n\t\t\t\t}\n\t\t\t\t// Update topic assignments\n\t\t\t\ttopicAssignments.get(dIndex).set(wIndex, subtopic);\n\t\t\t}\n\t\t}\n\t}\n\n\tpublic void sampleSingleInitialIteration()\n\t{\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tList<Integer> document = corpus.get(dIndex);\n\t\t\tint docSize = document.size();\n\t\t\tint topic = topicAssignments.get(dIndex).get(0) % numTopics;\n\n\t\t\tdocTopicCount[topic] = docTopicCount[topic] - 1;\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\tint word = document.get(wIndex);\n\t\t\t\tint subtopic = topicAssignments.get(dIndex).get(wIndex);\n\t\t\t\tif (topic == subtopic) {\n\t\t\t\t\ttopicWordCountLF[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] -= 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\ttopicWordCountDMM[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountDMM[topic] -= 1;\n\t\t\t\t}\n\t\t\t}\n\n\t\t\t// Sample a topic\n\t\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\t\t\t\tmultiPros[tIndex] = (docTopicCount[tIndex] + alpha);\n\t\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\t\tint word = document.get(wIndex);\n\t\t\t\t\tmultiPros[tIndex] *= (lambda\n\t\t\t\t\t\t* (topicWordCountLF[tIndex][word] + beta)\n\t\t\t\t\t\t/ (sumTopicWordCountLF[tIndex] + betaSum) + (1 - lambda)\n\t\t\t\t\t\t* (topicWordCountDMM[tIndex][word] + beta)\n\t\t\t\t\t\t/ (sumTopicWordCountDMM[tIndex] + betaSum));\n\t\t\t\t}\n\t\t\t}\n\t\t\ttopic = FuncUtils.nextDiscrete(multiPros);\n\n\t\t\tdocTopicCount[topic] += 1;\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\tint word = document.get(wIndex);// wordID\n\t\t\t\tint subtopic = topic;\n\t\t\t\tif (lambda * (topicWordCountLF[topic][word] + beta)\n\t\t\t\t\t/ (sumTopicWordCountLF[topic] + betaSum) > (1 - lambda)\n\t\t\t\t\t* (topicWordCountDMM[topic][word] + beta)\n\t\t\t\t\t/ (sumTopicWordCountDMM[topic] + betaSum)) {\n\t\t\t\t\ttopicWordCountLF[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\ttopicWordCountDMM[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountDMM[topic] += 1;\n\t\t\t\t\tsubtopic += numTopics;\n\t\t\t\t}\n\t\t\t\t// Update topic assignments\n\t\t\t\ttopicAssignments.get(dIndex).set(wIndex, subtopic);\n\t\t\t}\n\t\t}\n\t}\n\n\tpublic void writeParameters()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".paras\"));\n\t\twriter.write(\"-model\" + \"\\t\" + \"LFDMM\");\n\t\twriter.write(\"\\n-corpus\" + \"\\t\" + corpusPath);\n\t\twriter.write(\"\\n-vectors\" + \"\\t\" + vectorFilePath);\n\t\twriter.write(\"\\n-ntopics\" + \"\\t\" + numTopics);\n\t\twriter.write(\"\\n-alpha\" + \"\\t\" + alpha);\n\t\twriter.write(\"\\n-beta\" + \"\\t\" + beta);\n\t\twriter.write(\"\\n-lambda\" + \"\\t\" + lambda);\n\t\twriter.write(\"\\n-initers\" + \"\\t\" + numInitIterations);\n\t\twriter.write(\"\\n-niters\" + \"\\t\" + numIterations);\n\t\twriter.write(\"\\n-twords\" + \"\\t\" + topWords);\n\t\twriter.write(\"\\n-name\" + \"\\t\" + expName);\n\t\tif (savestep > 0)\n\t\t\twriter.write(\"\\n-sstep\" + \"\\t\" + savestep);\n\n\t\twriter.close();\n\t}\n\n\tpublic void writeDictionary()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".vocabulary\"));\n\t\tfor (String word : word2IdVocabulary.keySet()) {\n\t\t\twriter.write(word + \" \" + word2IdVocabulary.get(word) + \"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeIDbasedCorpus()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".IDcorpus\"));\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tint docSize = corpus.get(dIndex).size();\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\twriter.write(corpus.get(dIndex).get(wIndex) + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopicAssignments()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".topicAssignments\"));\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tint docSize = corpus.get(dIndex).size();\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\twriter.write(topicAssignments.get(dIndex).get(wIndex) + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopicVectors()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".topicVectors\"));\n\t\tfor (int i = 0; i < numTopics; i++) {\n\t\t\tfor (int j = 0; j < vectorSize; j++)\n\t\t\t\twriter.write(topicVectors[i][j] + \" \");\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopTopicalWords()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".topWords\"));\n\n\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\t\t\twriter.write(\"Topic\" + new Integer(tIndex) + \":\");\n\n\t\t\tMap<Integer, Double> topicWordProbs = new TreeMap<Integer, Double>();\n\t\t\tfor (int wIndex = 0; wIndex < vocabularySize; wIndex++) {\n\n\t\t\t\tdouble pro = lambda * expDotProductValues[tIndex][wIndex]\n\t\t\t\t\t/ sumExpValues[tIndex] + (1 - lambda)\n\t\t\t\t\t* (topicWordCountDMM[tIndex][wIndex] + beta)\n\t\t\t\t\t/ (sumTopicWordCountDMM[tIndex] + betaSum);\n\n\t\t\t\ttopicWordProbs.put(wIndex, pro);\n\t\t\t}\n\t\t\ttopicWordProbs = FuncUtils.sortByValueDescending(topicWordProbs);\n\n\t\t\tSet<Integer> mostLikelyWords = topicWordProbs.keySet();\n\t\t\tint count = 0;\n\t\t\tfor (Integer index : mostLikelyWords) {\n\t\t\t\tif (count < topWords) {\n\t\t\t\t\twriter.write(\" \" + id2WordVocabulary.get(index));\n\t\t\t\t\tcount += 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\twriter.write(\"\\n\\n\");\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopicWordPros()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".phi\"));\n\t\tfor (int t = 0; t < numTopics; t++) {\n\t\t\tfor (int w = 0; w < vocabularySize; w++) {\n\t\t\t\tdouble pro = lambda * expDotProductValues[t][w]\n\t\t\t\t\t/ sumExpValues[t] + (1 - lambda)\n\t\t\t\t\t* (topicWordCountDMM[t][w] + beta)\n\t\t\t\t\t/ (sumTopicWordCountDMM[t] + betaSum);\n\t\t\t\twriter.write(pro + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeDocTopicPros()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".theta\"));\n\n\t\tfor (int i = 0; i < numDocuments; i++) {\n\t\t\tint docSize = corpus.get(i).size();\n\t\t\tdouble sum = 0.0;\n\t\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\t\t\t\tmultiPros[tIndex] = (docTopicCount[tIndex] + alpha);\n\t\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\t\tint word = corpus.get(i).get(wIndex);\n\t\t\t\t\tmultiPros[tIndex] *= (lambda\n\t\t\t\t\t\t* expDotProductValues[tIndex][word]\n\t\t\t\t\t\t/ sumExpValues[tIndex] + (1 - lambda)\n\t\t\t\t\t\t* (topicWordCountDMM[tIndex][word] + beta)\n\t\t\t\t\t\t/ (sumTopicWordCountDMM[tIndex] + betaSum));\n\t\t\t\t}\n\t\t\t\tsum += multiPros[tIndex];\n\t\t\t}\n\t\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\t\t\t\twriter.write((multiPros[tIndex] / sum) + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void write()\n\t\tthrows IOException\n\t{\n\t\twriteTopTopicalWords();\n\t\twriteDocTopicPros();\n\t\twriteTopicAssignments();\n\t\twriteTopicWordPros();\n\t}\n\n\tpublic static void main(String args[])\n\t\tthrows Exception\n\t{\n\t\tLFDMM_Inf lfdmm = new LFDMM_Inf(\"test/testLFDMM.paras\",\n\t\t\t\"test/corpus_test.txt\", 2000, 200, 20, \"testLFDMMinf\", 0);\n\t\tlfdmm.inference();\n\t}\n}\n"
  },
  {
    "path": "src/models/LFLDA.java",
    "content": "package models;\n\nimport java.io.BufferedReader;\nimport java.io.BufferedWriter;\nimport java.io.FileReader;\nimport java.io.FileWriter;\nimport java.io.IOException;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.Set;\nimport java.util.TreeMap;\n\nimport utility.FuncUtils;\nimport utility.LBFGS;\nimport utility.Parallel;\nimport cc.mallet.optimize.InvalidOptimizableException;\nimport cc.mallet.optimize.Optimizer;\nimport cc.mallet.types.MatrixOps;\n\n/**\n * Implementation of the LF-LDA latent feature topic model, using collapsed Gibbs sampling, as\n * described in:\n * \n * Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with\n * Latent Feature Word Representations. Transactions of the Association for Computational\n * Linguistics, vol. 3, pp. 299-313.\n * \n * @author Dat Quoc Nguyen\n */\n\npublic class LFLDA\n{\n    public double alpha; // Hyper-parameter alpha\n    public double beta; // Hyper-parameter alpha\n    public double alphaSum; // alpha * numTopics\n    public double betaSum; // beta * vocabularySize\n\n    public int numTopics; // Number of topics\n    public int topWords; // Number of most probable words for each topic\n\n    public double lambda; // Mixture weight value\n    public int numInitIterations;\n    public int numIterations; // Number of EM-style sampling iterations\n\n    public List<List<Integer>> corpus; // Word ID-based corpus\n    public List<List<Integer>> topicAssignments; // Topics assignments for words\n                                                 // in the corpus\n    public int numDocuments; // Number of documents in the corpus\n    public int numWordsInCorpus; // Number of words in the corpus\n\n    public HashMap<String, Integer> word2IdVocabulary; // Vocabulary to get ID\n                                                       // given a word\n    public HashMap<Integer, String> id2WordVocabulary; // Vocabulary to get word\n                                                       // given an ID\n    public int vocabularySize; // The number of word types in the corpus\n\n    // numDocuments * numTopics matrix\n    // Given a document: number of its words assigned to each topic\n    public int[][] docTopicCount;\n    // Number of words in every document\n    public int[] sumDocTopicCount;\n    // numTopics * vocabularySize matrix\n    // Given a topic: number of times a word type generated from the topic by\n    // the Dirichlet multinomial component\n    public int[][] topicWordCountLDA;\n    // Total number of words generated from each topic by the Dirichlet\n    // multinomial component\n    public int[] sumTopicWordCountLDA;\n    // numTopics * vocabularySize matrix\n    // Given a topic: number of times a word type generated from the topic by\n    // the latent feature component\n    public int[][] topicWordCountLF;\n    // Total number of words generated from each topic by the latent feature\n    // component\n    public int[] sumTopicWordCountLF;\n\n    // Double array used to sample a topic\n    public double[] multiPros;\n    // Path to the directory containing the corpus\n    public String folderPath;\n    // Path to the topic modeling corpus\n    public String corpusPath;\n    public String vectorFilePath;\n\n    public double[][] wordVectors; // Vector representations for words\n    public double[][] topicVectors;// Vector representations for topics\n    public int vectorSize; // Number of vector dimensions\n    public double[][] dotProductValues;\n    public double[][] expDotProductValues;\n    public double[] sumExpValues; // Partition function values\n\n    public final double l2Regularizer = 0.01; // L2 regularizer value for learning topic vectors\n    public final double tolerance = 0.05; // Tolerance value for LBFGS convergence\n\n    public String expName = \"LFLDA\";\n    public String orgExpName = \"LFLDA\";\n    public String tAssignsFilePath = \"\";\n    public int savestep = 0;\n\n    public LFLDA(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, \"LFLDA\");\n    }\n\n    public LFLDA(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, inExpName, \"\", 0);\n    }\n\n    public LFLDA(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName, String pathToTAfile)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, inExpName, pathToTAfile, 0);\n    }\n\n    public LFLDA(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName, int inSaveStep)\n        throws Exception\n    {\n        this(pathToCorpus, pathToWordVectorsFile, inNumTopics, inAlpha, inBeta, inLambda,\n                inNumInitIterations, inNumIterations, inTopWords, inExpName, \"\", inSaveStep);\n    }\n\n    public LFLDA(String pathToCorpus, String pathToWordVectorsFile, int inNumTopics,\n            double inAlpha, double inBeta, double inLambda, int inNumInitIterations,\n            int inNumIterations, int inTopWords, String inExpName, String pathToTAfile,\n            int inSaveStep)\n        throws Exception\n    {\n        alpha = inAlpha;\n        beta = inBeta;\n        lambda = inLambda;\n        numTopics = inNumTopics;\n        numIterations = inNumIterations;\n        numInitIterations = inNumInitIterations;\n        topWords = inTopWords;\n        savestep = inSaveStep;\n        expName = inExpName;\n        orgExpName = expName;\n        vectorFilePath = pathToWordVectorsFile;\n        corpusPath = pathToCorpus;\n        folderPath = pathToCorpus.substring(0,\n                Math.max(pathToCorpus.lastIndexOf(\"/\"), pathToCorpus.lastIndexOf(\"\\\\\")) + 1);\n\n        System.out.println(\"Reading topic modeling corpus: \" + pathToCorpus);\n\n        word2IdVocabulary = new HashMap<String, Integer>();\n        id2WordVocabulary = new HashMap<Integer, String>();\n        corpus = new ArrayList<List<Integer>>();\n        numDocuments = 0;\n        numWordsInCorpus = 0;\n\n        BufferedReader br = null;\n        try {\n            int indexWord = -1;\n            br = new BufferedReader(new FileReader(pathToCorpus));\n            for (String doc; (doc = br.readLine()) != null;) {\n\n                if (doc.trim().length() == 0)\n                    continue;\n\n                String[] words = doc.trim().split(\"\\\\s+\");\n                List<Integer> document = new ArrayList<Integer>();\n\n                for (String word : words) {\n                    if (word2IdVocabulary.containsKey(word)) {\n                        document.add(word2IdVocabulary.get(word));\n                    }\n                    else {\n                        indexWord += 1;\n                        word2IdVocabulary.put(word, indexWord);\n                        id2WordVocabulary.put(indexWord, word);\n                        document.add(indexWord);\n                    }\n                }\n\n                numDocuments++;\n                numWordsInCorpus += document.size();\n                corpus.add(document);\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n\n        vocabularySize = word2IdVocabulary.size();\n        docTopicCount = new int[numDocuments][numTopics];\n        sumDocTopicCount = new int[numDocuments];\n        topicWordCountLDA = new int[numTopics][vocabularySize];\n        sumTopicWordCountLDA = new int[numTopics];\n        topicWordCountLF = new int[numTopics][vocabularySize];\n        sumTopicWordCountLF = new int[numTopics];\n\n        multiPros = new double[numTopics * 2];\n        for (int i = 0; i < numTopics * 2; i++) {\n            multiPros[i] = 1.0 / numTopics;\n        }\n\n        alphaSum = numTopics * alpha;\n        betaSum = vocabularySize * beta;\n\n        readWordVectorsFile(vectorFilePath);\n        topicVectors = new double[numTopics][vectorSize];\n        dotProductValues = new double[numTopics][vocabularySize];\n        expDotProductValues = new double[numTopics][vocabularySize];\n        sumExpValues = new double[numTopics];\n\n        System.out\n                .println(\"Corpus size: \" + numDocuments + \" docs, \" + numWordsInCorpus + \" words\");\n        System.out.println(\"Vocabuary size: \" + vocabularySize);\n        System.out.println(\"Number of topics: \" + numTopics);\n        System.out.println(\"alpha: \" + alpha);\n        System.out.println(\"beta: \" + beta);\n        System.out.println(\"lambda: \" + lambda);\n        System.out.println(\"Number of initial sampling iterations: \" + numInitIterations);\n        System.out.println(\"Number of EM-style sampling iterations for the LF-LDA model: \"\n                + numIterations);\n        System.out.println(\"Number of top topical words: \" + topWords);\n\n        tAssignsFilePath = pathToTAfile;\n        if (tAssignsFilePath.length() > 0)\n            initialize(tAssignsFilePath);\n        else\n            initialize();\n\n    }\n\n    public void readWordVectorsFile(String pathToWordVectorsFile)\n        throws Exception\n    {\n        System.out.println(\"Reading word vectors from word-vectors file \" + pathToWordVectorsFile\n                + \"...\");\n\n        BufferedReader br = null;\n        try {\n            br = new BufferedReader(new FileReader(pathToWordVectorsFile));\n            String[] elements = br.readLine().trim().split(\"\\\\s+\");\n            vectorSize = elements.length - 1;\n            wordVectors = new double[vocabularySize][vectorSize];\n            String word = elements[0];\n            if (word2IdVocabulary.containsKey(word)) {\n                for (int j = 0; j < vectorSize; j++) {\n                    wordVectors[word2IdVocabulary.get(word)][j] = new Double(elements[j + 1]);\n                }\n            }\n            for (String line; (line = br.readLine()) != null;) {\n                elements = line.trim().split(\"\\\\s+\");\n                word = elements[0];\n                if (word2IdVocabulary.containsKey(word)) {\n                    for (int j = 0; j < vectorSize; j++) {\n                        wordVectors[word2IdVocabulary.get(word)][j] = new Double(elements[j + 1]);\n                    }\n                }\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n\n        for (int i = 0; i < vocabularySize; i++) {\n            if (MatrixOps.absNorm(wordVectors[i]) == 0.0) {\n                System.out.println(\"The word \\\"\" + id2WordVocabulary.get(i)\n                        + \"\\\" doesn't have a corresponding vector!!!\");\n                throw new Exception();\n            }\n        }\n    }\n\n    public void initialize()\n        throws IOException\n    {\n        System.out.println(\"Randomly initialzing topic assignments ...\");\n        topicAssignments = new ArrayList<List<Integer>>();\n\n        for (int docId = 0; docId < numDocuments; docId++) {\n            List<Integer> topics = new ArrayList<Integer>();\n            int docSize = corpus.get(docId).size();\n            for (int j = 0; j < docSize; j++) {\n                int wordId = corpus.get(docId).get(j);\n\n                int subtopic = FuncUtils.nextDiscrete(multiPros);\n                int topic = subtopic % numTopics;\n                if (topic == subtopic) { // Generated from the latent feature component\n                    topicWordCountLF[topic][wordId] += 1;\n                    sumTopicWordCountLF[topic] += 1;\n                }\n                else {// Generated from the Dirichlet multinomial component\n                    topicWordCountLDA[topic][wordId] += 1;\n                    sumTopicWordCountLDA[topic] += 1;\n                }\n                docTopicCount[docId][topic] += 1;\n                sumDocTopicCount[docId] += 1;\n\n                topics.add(subtopic);\n            }\n            topicAssignments.add(topics);\n        }\n    }\n\n    public void initialize(String pathToTopicAssignmentFile)\n    {\n        System.out.println(\"Reading topic-assignment file: \" + pathToTopicAssignmentFile);\n\n        topicAssignments = new ArrayList<List<Integer>>();\n\n        BufferedReader br = null;\n        try {\n            br = new BufferedReader(new FileReader(pathToTopicAssignmentFile));\n            int docId = 0;\n            int numWords = 0;\n            for (String line; (line = br.readLine()) != null;) {\n                String[] strTopics = line.trim().split(\"\\\\s+\");\n                List<Integer> topics = new ArrayList<Integer>();\n                for (int j = 0; j < strTopics.length; j++) {\n                    int wordId = corpus.get(docId).get(j);\n\n                    int subtopic = new Integer(strTopics[j]);\n                    int topic = subtopic % numTopics;\n\n                    if (topic == subtopic) { // Generated from the latent feature component\n                        topicWordCountLF[topic][wordId] += 1;\n                        sumTopicWordCountLF[topic] += 1;\n                    }\n                    else {// Generated from the Dirichlet multinomial component\n                        topicWordCountLDA[topic][wordId] += 1;\n                        sumTopicWordCountLDA[topic] += 1;\n                    }\n                    docTopicCount[docId][topic] += 1;\n                    sumDocTopicCount[docId] += 1;\n\n                    topics.add(subtopic);\n                    numWords++;\n                }\n                topicAssignments.add(topics);\n                docId++;\n            }\n\n            if ((docId != numDocuments) || (numWords != numWordsInCorpus)) {\n                System.out\n                        .println(\"The topic modeling corpus and topic assignment file are not consistent!!!\");\n                throw new Exception();\n            }\n        }\n        catch (Exception e) {\n            e.printStackTrace();\n        }\n    }\n\n    public void inference()\n        throws IOException\n    {\n        System.out.println(\"Running Gibbs sampling inference: \");\n\n        for (int iter = 1; iter <= numInitIterations; iter++) {\n\n            System.out.println(\"\\tInitial sampling iteration: \" + (iter));\n\n            sampleSingleInitialIteration();\n        }\n\n        for (int iter = 1; iter <= numIterations; iter++) {\n\n            System.out.println(\"\\tLFLDA sampling iteration: \" + (iter));\n\n            optimizeTopicVectors();\n\n            sampleSingleIteration();\n\n            if ((savestep > 0) && (iter % savestep == 0) && (iter < numIterations)) {\n                System.out.println(\"\\t\\tSaving the output from the \" + iter + \"^{th} sample\");\n                expName = orgExpName + \"-\" + iter;\n                write();\n            }\n        }\n        expName = orgExpName;\n\n        writeParameters();\n        System.out.println(\"Writing output from the last sample ...\");\n        write();\n\n        System.out.println(\"Sampling completed!\");\n    }\n\n    public void optimizeTopicVectors()\n    {\n        System.out.println(\"\\t\\tEstimating topic vectors ...\");\n        sumExpValues = new double[numTopics];\n        dotProductValues = new double[numTopics][vocabularySize];\n        expDotProductValues = new double[numTopics][vocabularySize];\n\n        Parallel.loop(numTopics, new Parallel.LoopInt()\n        {\n            @Override\n            public void compute(int topic)\n            {\n                int rate = 1;\n                boolean check = true;\n                while (check) {\n                    double l2Value = l2Regularizer * rate;\n                    try {\n                        TopicVectorOptimizer optimizer = new TopicVectorOptimizer(\n                                topicVectors[topic], topicWordCountLF[topic], wordVectors, l2Value);\n\n                        Optimizer gd = new LBFGS(optimizer, tolerance);\n                        gd.optimize(600);\n                        optimizer.getParameters(topicVectors[topic]);\n                        sumExpValues[topic] = optimizer.computePartitionFunction(\n                                dotProductValues[topic], expDotProductValues[topic]);\n                        check = false;\n\n                        if (sumExpValues[topic] == 0 || Double.isInfinite(sumExpValues[topic])) {\n                            double max = -1000000000.0;\n                            for (int index = 0; index < vocabularySize; index++) {\n                                if (dotProductValues[topic][index] > max)\n                                    max = dotProductValues[topic][index];\n                            }\n                            for (int index = 0; index < vocabularySize; index++) {\n                                expDotProductValues[topic][index] = Math\n                                        .exp(dotProductValues[topic][index] - max);\n                                sumExpValues[topic] += expDotProductValues[topic][index];\n                            }\n                        }\n                    }\n                    catch (InvalidOptimizableException e) {\n                        e.printStackTrace();\n                        check = true;\n                    }\n                    rate = rate * 10;\n                }\n            }\n        });\n    }\n\n    public void sampleSingleIteration()\n    {\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            int docSize = corpus.get(dIndex).size();\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                // Get current word\n                int word = corpus.get(dIndex).get(wIndex);// wordID\n                int subtopic = topicAssignments.get(dIndex).get(wIndex);\n                int topic = subtopic % numTopics;\n\n                docTopicCount[dIndex][topic] -= 1;\n                if (subtopic == topic) {\n                    topicWordCountLF[topic][word] -= 1;\n                    sumTopicWordCountLF[topic] -= 1;\n                }\n                else {\n                    topicWordCountLDA[topic][word] -= 1;\n                    sumTopicWordCountLDA[topic] -= 1;\n                }\n\n                // Sample a pair of topic z and binary indicator variable s\n                for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\n                    multiPros[tIndex] = (docTopicCount[dIndex][tIndex] + alpha) * lambda\n                            * expDotProductValues[tIndex][word] / sumExpValues[tIndex];\n\n                    multiPros[tIndex + numTopics] = (docTopicCount[dIndex][tIndex] + alpha)\n                            * (1 - lambda) * (topicWordCountLDA[tIndex][word] + beta)\n                            / (sumTopicWordCountLDA[tIndex] + betaSum);\n\n                }\n                subtopic = FuncUtils.nextDiscrete(multiPros);\n                topic = subtopic % numTopics;\n\n                docTopicCount[dIndex][topic] += 1;\n                if (subtopic == topic) {\n                    topicWordCountLF[topic][word] += 1;\n                    sumTopicWordCountLF[topic] += 1;\n                }\n                else {\n                    topicWordCountLDA[topic][word] += 1;\n                    sumTopicWordCountLDA[topic] += 1;\n                }\n                // Update topic assignments\n                topicAssignments.get(dIndex).set(wIndex, subtopic);\n            }\n        }\n    }\n\n    public void sampleSingleInitialIteration()\n    {\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            int docSize = corpus.get(dIndex).size();\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                int word = corpus.get(dIndex).get(wIndex);// wordID\n                int subtopic = topicAssignments.get(dIndex).get(wIndex);\n                int topic = subtopic % numTopics;\n\n                docTopicCount[dIndex][topic] -= 1;\n                if (subtopic == topic) { // LF(w|t) + LDA(t|d)\n                    topicWordCountLF[topic][word] -= 1;\n                    sumTopicWordCountLF[topic] -= 1;\n                }\n                else { // LDA(w|t) + LDA(t|d)\n                    topicWordCountLDA[topic][word] -= 1;\n                    sumTopicWordCountLDA[topic] -= 1;\n                }\n\n                // Sample a pair of topic z and binary indicator variable s\n                for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\n                    multiPros[tIndex] = (docTopicCount[dIndex][tIndex] + alpha) * lambda\n                            * (topicWordCountLF[tIndex][word] + beta)\n                            / (sumTopicWordCountLF[tIndex] + betaSum);\n\n                    multiPros[tIndex + numTopics] = (docTopicCount[dIndex][tIndex] + alpha)\n                            * (1 - lambda) * (topicWordCountLDA[tIndex][word] + beta)\n                            / (sumTopicWordCountLDA[tIndex] + betaSum);\n\n                }\n                subtopic = FuncUtils.nextDiscrete(multiPros);\n                topic = subtopic % numTopics;\n\n                docTopicCount[dIndex][topic] += 1;\n                if (topic == subtopic) {\n                    topicWordCountLF[topic][word] += 1;\n                    sumTopicWordCountLF[topic] += 1;\n                }\n                else {\n                    topicWordCountLDA[topic][word] += 1;\n                    sumTopicWordCountLDA[topic] += 1;\n                }\n                // Update topic assignments\n                topicAssignments.get(dIndex).set(wIndex, subtopic);\n            }\n\n        }\n    }\n\n    public void writeParameters()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName + \".paras\"));\n        writer.write(\"-model\" + \"\\t\" + \"LFLDA\");\n        writer.write(\"\\n-corpus\" + \"\\t\" + corpusPath);\n        writer.write(\"\\n-vectors\" + \"\\t\" + vectorFilePath);\n        writer.write(\"\\n-ntopics\" + \"\\t\" + numTopics);\n        writer.write(\"\\n-alpha\" + \"\\t\" + alpha);\n        writer.write(\"\\n-beta\" + \"\\t\" + beta);\n        writer.write(\"\\n-lambda\" + \"\\t\" + lambda);\n        writer.write(\"\\n-initers\" + \"\\t\" + numInitIterations);\n        writer.write(\"\\n-niters\" + \"\\t\" + numIterations);\n        writer.write(\"\\n-twords\" + \"\\t\" + topWords);\n        writer.write(\"\\n-name\" + \"\\t\" + expName);\n        if (tAssignsFilePath.length() > 0)\n            writer.write(\"\\n-initFile\" + \"\\t\" + tAssignsFilePath);\n        if (savestep > 0)\n            writer.write(\"\\n-sstep\" + \"\\t\" + savestep);\n\n        writer.close();\n    }\n\n    public void writeDictionary()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".vocabulary\"));\n        for (String word : word2IdVocabulary.keySet()) {\n            writer.write(word + \" \" + word2IdVocabulary.get(word) + \"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeIDbasedCorpus()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".IDcorpus\"));\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            int docSize = corpus.get(dIndex).size();\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                writer.write(corpus.get(dIndex).get(wIndex) + \" \");\n            }\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeTopicAssignments()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".topicAssignments\"));\n        for (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n            int docSize = corpus.get(dIndex).size();\n            for (int wIndex = 0; wIndex < docSize; wIndex++) {\n                writer.write(topicAssignments.get(dIndex).get(wIndex) + \" \");\n            }\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeTopicVectors()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".topicVectors\"));\n        for (int i = 0; i < numTopics; i++) {\n            for (int j = 0; j < vectorSize; j++)\n                writer.write(topicVectors[i][j] + \" \");\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeTopTopicalWords()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName\n                + \".topWords\"));\n\n        for (int tIndex = 0; tIndex < numTopics; tIndex++) {\n            writer.write(\"Topic\" + new Integer(tIndex) + \":\");\n\n            Map<Integer, Double> topicWordProbs = new TreeMap<Integer, Double>();\n            for (int wIndex = 0; wIndex < vocabularySize; wIndex++) {\n\n                double pro = lambda * expDotProductValues[tIndex][wIndex] / sumExpValues[tIndex]\n                        + (1 - lambda) * (topicWordCountLDA[tIndex][wIndex] + beta)\n                        / (sumTopicWordCountLDA[tIndex] + betaSum);\n\n                topicWordProbs.put(wIndex, pro);\n            }\n            topicWordProbs = FuncUtils.sortByValueDescending(topicWordProbs);\n\n            Set<Integer> mostLikelyWords = topicWordProbs.keySet();\n            int count = 0;\n            for (Integer index : mostLikelyWords) {\n                if (count < topWords) {\n                    writer.write(\" \" + id2WordVocabulary.get(index));\n                    count += 1;\n                }\n                else {\n                    writer.write(\"\\n\\n\");\n                    break;\n                }\n            }\n        }\n        writer.close();\n    }\n\n    public void writeTopicWordPros()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName + \".phi\"));\n        for (int t = 0; t < numTopics; t++) {\n            for (int w = 0; w < vocabularySize; w++) {\n                double pro = lambda * expDotProductValues[t][w] / sumExpValues[t] + (1 - lambda)\n                        * (topicWordCountLDA[t][w] + beta) / (sumTopicWordCountLDA[t] + betaSum);\n                writer.write(pro + \" \");\n            }\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void writeDocTopicPros()\n        throws IOException\n    {\n        BufferedWriter writer = new BufferedWriter(new FileWriter(folderPath + expName + \".theta\"));\n\n        for (int i = 0; i < numDocuments; i++) {\n            for (int j = 0; j < numTopics; j++) {\n                double pro = (docTopicCount[i][j] + alpha) / (sumDocTopicCount[i] + alphaSum);\n                writer.write(pro + \" \");\n            }\n            writer.write(\"\\n\");\n        }\n        writer.close();\n    }\n\n    public void write()\n        throws IOException\n    {\n        writeTopTopicalWords();\n        writeDocTopicPros();\n        writeTopicAssignments();\n        writeTopicWordPros();\n    }\n\n    public static void main(String args[])\n        throws Exception\n    {\n        LFLDA lflda = new LFLDA(\"test/corpus.txt\", \"test/wordVectors.txt\", 4, 0.1, 0.01, 0.6, 2000,\n                200, 20, \"testLFLDA\");\n        lflda.writeParameters();\n        lflda.inference();\n    }\n}\n"
  },
  {
    "path": "src/models/LFLDA_Inf.java",
    "content": "package models;\n\nimport java.io.BufferedReader;\nimport java.io.BufferedWriter;\nimport java.io.FileReader;\nimport java.io.FileWriter;\nimport java.io.IOException;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.Set;\nimport java.util.TreeMap;\n\nimport utility.FuncUtils;\nimport utility.LBFGS;\nimport utility.Parallel;\nimport cc.mallet.optimize.InvalidOptimizableException;\nimport cc.mallet.optimize.Optimizer;\nimport cc.mallet.types.MatrixOps;\n\n/**\n * Implementation of the LF-LDA latent feature topic model, using collapsed\n * Gibbs sampling, as described in:\n * \n * Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015.\n * Improving Topic Models with Latent Feature Word Representations. Transactions\n * of the Association for Computational Linguistics, vol. 3, pp. 299-313.\n * \n * Inference of topic distribution on unseen corpus\n * \n * @author Dat Quoc Nguyen\n */\n\npublic class LFLDA_Inf\n{\n\tpublic double alpha; // Hyper-parameter alpha\n\tpublic double beta; // Hyper-parameter alpha\n\tpublic double alphaSum; // alpha * numTopics\n\tpublic double betaSum; // beta * vocabularySize\n\n\tpublic int numTopics; // Number of topics\n\tpublic int topWords; // Number of most probable words for each topic\n\n\tpublic double lambda; // Mixture weight value\n\tpublic int numInitIterations;\n\tpublic int numIterations; // Number of EM-style sampling iterations\n\n\tpublic List<List<Integer>> corpus; // Word ID-based corpus\n\tpublic List<List<Integer>> topicAssignments; // Topics assignments for words\n\t\t\t\t\t\t\t\t\t\t\t\t\t// in the corpus\n\tpublic int numDocuments; // Number of documents in the corpus\n\tpublic int numWordsInCorpus; // Number of words in the corpus\n\n\tpublic HashMap<String, Integer> word2IdVocabulary; // Vocabulary to get ID\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t// given a word\n\tpublic HashMap<Integer, String> id2WordVocabulary; // Vocabulary to get word\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t// given an ID\n\tpublic int vocabularySize; // The number of word types in the corpus\n\n\t// numDocuments * numTopics matrix\n\t// Given a document: number of its words assigned to each topic\n\tpublic int[][] docTopicCount;\n\t// Number of words in every document\n\tpublic int[] sumDocTopicCount;\n\t// numTopics * vocabularySize matrix\n\t// Given a topic: number of times a word type generated from the topic by\n\t// the Dirichlet multinomial component\n\tpublic int[][] topicWordCountLDA;\n\t// Total number of words generated from each topic by the Dirichlet\n\t// multinomial component\n\tpublic int[] sumTopicWordCountLDA;\n\t// numTopics * vocabularySize matrix\n\t// Given a topic: number of times a word type generated from the topic by\n\t// the latent feature component\n\tpublic int[][] topicWordCountLF;\n\t// Total number of words generated from each topic by the latent feature\n\t// component\n\tpublic int[] sumTopicWordCountLF;\n\n\t// Double array used to sample a topic\n\tpublic double[] multiPros;\n\t// Path to the directory containing the corpus\n\tpublic String folderPath;\n\t// Path to the topic modeling corpus\n\tpublic String corpusPath;\n\tpublic String vectorFilePath;\n\n\tpublic double[][] wordVectors; // Vector representations for words\n\tpublic double[][] topicVectors;// Vector representations for topics\n\tpublic int vectorSize; // Number of vector dimensions\n\tpublic double[][] dotProductValues;\n\tpublic double[][] expDotProductValues;\n\tpublic double[] sumExpValues; // Partition function values\n\n\tpublic final double l2Regularizer = 0.01; // L2 regularizer value for\n\t\t\t\t\t\t\t\t\t\t\t\t// learning topic vectors\n\tpublic final double tolerance = 0.05; // Tolerance value for LBFGS\n\t\t\t\t\t\t\t\t\t\t\t// convergence\n\n\tpublic String expName = \"LFLDAinf\";\n\tpublic String orgExpName = \"LFLDAinf\";\n\tpublic int savestep = 0;\n\n\tpublic LFLDA_Inf(String pathToTrainingParasFile, String pathToUnseenCorpus,\n\t\tint inNumInitIterations, int inNumIterations, int inTopWords,\n\t\tString inExpName, int inSaveStep)\n\t\tthrows Exception\n\t{\n\t\tHashMap<String, String> paras = parseTrainingParasFile(pathToTrainingParasFile);\n\t\tif (!paras.get(\"-model\").equals(\"LFLDA\")) {\n\t\t\tthrow new Exception(\"Wrong pre-trained model!!!\");\n\t\t}\n\n\t\talpha = new Double(paras.get(\"-alpha\"));\n\t\tbeta = new Double(paras.get(\"-beta\"));\n\t\tlambda = new Double(paras.get(\"-lambda\"));\n\t\tnumTopics = new Integer(paras.get(\"-ntopics\"));\n\t\tnumIterations = inNumIterations;\n\t\tnumInitIterations = inNumInitIterations;\n\t\ttopWords = inTopWords;\n\t\tsavestep = inSaveStep;\n\t\texpName = inExpName;\n\t\torgExpName = expName;\n\t\tvectorFilePath = paras.get(\"-vectors\");\n\n\t\tString trainingCorpus = paras.get(\"-corpus\");\n\t\tString trainingCorpusfolder = trainingCorpus.substring(\n\t\t\t0,\n\t\t\tMath.max(trainingCorpus.lastIndexOf(\"/\"),\n\t\t\t\ttrainingCorpus.lastIndexOf(\"\\\\\")) + 1);\n\t\tString topicAssignment4TrainFile = trainingCorpusfolder\n\t\t\t+ paras.get(\"-name\") + \".topicAssignments\";\n\n\t\tword2IdVocabulary = new HashMap<String, Integer>();\n\t\tid2WordVocabulary = new HashMap<Integer, String>();\n\t\tinitializeWordCount(trainingCorpus, topicAssignment4TrainFile);\n\n\t\tcorpusPath = pathToUnseenCorpus;\n\t\tfolderPath = pathToUnseenCorpus.substring(\n\t\t\t0,\n\t\t\tMath.max(pathToUnseenCorpus.lastIndexOf(\"/\"),\n\t\t\t\tpathToUnseenCorpus.lastIndexOf(\"\\\\\")) + 1);\n\n\t\tSystem.out.println(\"Reading unseen corpus: \" + pathToUnseenCorpus);\n\t\tcorpus = new ArrayList<List<Integer>>();\n\t\tnumDocuments = 0;\n\t\tnumWordsInCorpus = 0;\n\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToUnseenCorpus));\n\t\t\tfor (String doc; (doc = br.readLine()) != null;) {\n\n\t\t\t\tif (doc.trim().length() == 0)\n\t\t\t\t\tcontinue;\n\n\t\t\t\tString[] words = doc.trim().split(\"\\\\s+\");\n\t\t\t\tList<Integer> document = new ArrayList<Integer>();\n\n\t\t\t\tfor (String word : words) {\n\t\t\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\t\t\tdocument.add(word2IdVocabulary.get(word));\n\t\t\t\t\t}\n\t\t\t\t\telse {\n\t\t\t\t\t\t// Skip this unknown-word\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tnumDocuments++;\n\t\t\t\tnumWordsInCorpus += document.size();\n\t\t\t\tcorpus.add(document);\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\n\t\tdocTopicCount = new int[numDocuments][numTopics];\n\t\tsumDocTopicCount = new int[numDocuments];\n\t\tmultiPros = new double[numTopics * 2];\n\t\tfor (int i = 0; i < numTopics * 2; i++) {\n\t\t\tmultiPros[i] = 1.0 / numTopics;\n\t\t}\n\n\t\talphaSum = numTopics * alpha;\n\t\tbetaSum = vocabularySize * beta;\n\n\t\treadWordVectorsFile(vectorFilePath);\n\t\ttopicVectors = new double[numTopics][vectorSize];\n\t\tdotProductValues = new double[numTopics][vocabularySize];\n\t\texpDotProductValues = new double[numTopics][vocabularySize];\n\t\tsumExpValues = new double[numTopics];\n\n\t\tSystem.out.println(\"Corpus size: \" + numDocuments + \" docs, \"\n\t\t\t+ numWordsInCorpus + \" words\");\n\t\tSystem.out.println(\"Vocabuary size: \" + vocabularySize);\n\t\tSystem.out.println(\"Number of topics: \" + numTopics);\n\t\tSystem.out.println(\"alpha: \" + alpha);\n\t\tSystem.out.println(\"beta: \" + beta);\n\t\tSystem.out.println(\"lambda: \" + lambda);\n\t\tSystem.out.println(\"Number of initial sampling iterations: \"\n\t\t\t+ numInitIterations);\n\t\tSystem.out\n\t\t\t.println(\"Number of EM-style sampling iterations for the LF-LDA model: \"\n\t\t\t\t+ numIterations);\n\t\tSystem.out.println(\"Number of top topical words: \" + topWords);\n\n\t\tinitialize();\n\t}\n\n\tprivate HashMap<String, String> parseTrainingParasFile(\n\t\tString pathToTrainingParasFile)\n\t\tthrows Exception\n\t{\n\t\tHashMap<String, String> paras = new HashMap<String, String>();\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToTrainingParasFile));\n\t\t\tfor (String line; (line = br.readLine()) != null;) {\n\n\t\t\t\tif (line.trim().length() == 0)\n\t\t\t\t\tcontinue;\n\n\t\t\t\tString[] paraOptions = line.trim().split(\"\\\\s+\");\n\t\t\t\tparas.put(paraOptions[0], paraOptions[1]);\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\t\treturn paras;\n\t}\n\n\tprivate void initializeWordCount(String pathToTrainingCorpus,\n\t\tString pathToTopicAssignmentFile)\n\t{\n\t\tSystem.out.println(\"Loading pre-trained model...\");\n\t\tList<List<Integer>> trainCorpus = new ArrayList<List<Integer>>();\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tint indexWord = -1;\n\t\t\tbr = new BufferedReader(new FileReader(pathToTrainingCorpus));\n\t\t\tfor (String doc; (doc = br.readLine()) != null;) {\n\n\t\t\t\tif (doc.trim().length() == 0)\n\t\t\t\t\tcontinue;\n\n\t\t\t\tString[] words = doc.trim().split(\"\\\\s+\");\n\t\t\t\tList<Integer> document = new ArrayList<Integer>();\n\n\t\t\t\tfor (String word : words) {\n\t\t\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\t\t\tdocument.add(word2IdVocabulary.get(word));\n\t\t\t\t\t}\n\t\t\t\t\telse {\n\t\t\t\t\t\tindexWord += 1;\n\t\t\t\t\t\tword2IdVocabulary.put(word, indexWord);\n\t\t\t\t\t\tid2WordVocabulary.put(indexWord, word);\n\t\t\t\t\t\tdocument.add(indexWord);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\ttrainCorpus.add(document);\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\n\t\tvocabularySize = word2IdVocabulary.size();\n\t\ttopicWordCountLDA = new int[numTopics][vocabularySize];\n\t\tsumTopicWordCountLDA = new int[numTopics];\n\t\ttopicWordCountLF = new int[numTopics][vocabularySize];\n\t\tsumTopicWordCountLF = new int[numTopics];\n\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToTopicAssignmentFile));\n\t\t\tint docId = 0;\n\t\t\tfor (String line; (line = br.readLine()) != null;) {\n\t\t\t\tString[] strTopics = line.trim().split(\"\\\\s+\");\n\t\t\t\tfor (int j = 0; j < strTopics.length; j++) {\n\t\t\t\t\tint wordId = trainCorpus.get(docId).get(j);\n\t\t\t\t\tint subtopic = new Integer(strTopics[j]);\n\t\t\t\t\tint topic = subtopic % numTopics;\n\t\t\t\t\tif (topic == subtopic) { // Generated from the latent\n\t\t\t\t\t\t\t\t\t\t\t\t// feature component\n\t\t\t\t\t\ttopicWordCountLF[topic][wordId] += 1;\n\t\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t\t}\n\t\t\t\t\telse {// Generated from the Dirichlet multinomial component\n\t\t\t\t\t\ttopicWordCountLDA[topic][wordId] += 1;\n\t\t\t\t\t\tsumTopicWordCountLDA[topic] += 1;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tdocId++;\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\t}\n\n\tpublic void readWordVectorsFile(String pathToWordVectorsFile)\n\t\tthrows Exception\n\t{\n\t\tSystem.out.println(\"Reading word vectors from word-vectors file \"\n\t\t\t+ pathToWordVectorsFile + \"...\");\n\n\t\tBufferedReader br = null;\n\t\ttry {\n\t\t\tbr = new BufferedReader(new FileReader(pathToWordVectorsFile));\n\t\t\tString[] elements = br.readLine().trim().split(\"\\\\s+\");\n\t\t\tvectorSize = elements.length - 1;\n\t\t\twordVectors = new double[vocabularySize][vectorSize];\n\t\t\tString word = elements[0];\n\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\tfor (int j = 0; j < vectorSize; j++) {\n\t\t\t\t\twordVectors[word2IdVocabulary.get(word)][j] = new Double(\n\t\t\t\t\t\telements[j + 1]);\n\t\t\t\t}\n\t\t\t}\n\t\t\tfor (String line; (line = br.readLine()) != null;) {\n\t\t\t\telements = line.trim().split(\"\\\\s+\");\n\t\t\t\tword = elements[0];\n\t\t\t\tif (word2IdVocabulary.containsKey(word)) {\n\t\t\t\t\tfor (int j = 0; j < vectorSize; j++) {\n\t\t\t\t\t\twordVectors[word2IdVocabulary.get(word)][j] = new Double(\n\t\t\t\t\t\t\telements[j + 1]);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tcatch (Exception e) {\n\t\t\te.printStackTrace();\n\t\t}\n\n\t\tfor (int i = 0; i < vocabularySize; i++) {\n\t\t\tif (MatrixOps.absNorm(wordVectors[i]) == 0.0) {\n\t\t\t\tSystem.out.println(\"The word \\\"\" + id2WordVocabulary.get(i)\n\t\t\t\t\t+ \"\\\" doesn't have a corresponding vector!!!\");\n\t\t\t\tthrow new Exception();\n\t\t\t}\n\t\t}\n\t}\n\n\tpublic void initialize()\n\t\tthrows IOException\n\t{\n\t\tSystem.out.println(\"Randomly initialzing topic assignments ...\");\n\t\ttopicAssignments = new ArrayList<List<Integer>>();\n\n\t\tfor (int docId = 0; docId < numDocuments; docId++) {\n\t\t\tList<Integer> topics = new ArrayList<Integer>();\n\t\t\tint docSize = corpus.get(docId).size();\n\t\t\tfor (int j = 0; j < docSize; j++) {\n\t\t\t\tint wordId = corpus.get(docId).get(j);\n\n\t\t\t\tint subtopic = FuncUtils.nextDiscrete(multiPros);\n\t\t\t\tint topic = subtopic % numTopics;\n\t\t\t\tif (topic == subtopic) { // Generated from the latent feature\n\t\t\t\t\t\t\t\t\t\t\t// component\n\t\t\t\t\ttopicWordCountLF[topic][wordId] += 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t}\n\t\t\t\telse {// Generated from the Dirichlet multinomial component\n\t\t\t\t\ttopicWordCountLDA[topic][wordId] += 1;\n\t\t\t\t\tsumTopicWordCountLDA[topic] += 1;\n\t\t\t\t}\n\t\t\t\tdocTopicCount[docId][topic] += 1;\n\t\t\t\tsumDocTopicCount[docId] += 1;\n\n\t\t\t\ttopics.add(subtopic);\n\t\t\t}\n\t\t\ttopicAssignments.add(topics);\n\t\t}\n\t}\n\n\tpublic void inference()\n\t\tthrows IOException\n\t{\n\t\tSystem.out.println(\"Running Gibbs sampling inference: \");\n\n\t\tfor (int iter = 1; iter <= numInitIterations; iter++) {\n\n\t\t\tSystem.out.println(\"\\tInitial sampling iteration: \" + (iter));\n\n\t\t\tsampleSingleInitialIteration();\n\t\t}\n\n\t\tfor (int iter = 1; iter <= numIterations; iter++) {\n\n\t\t\tSystem.out.println(\"\\tLFLDA sampling iteration: \" + (iter));\n\n\t\t\toptimizeTopicVectors();\n\n\t\t\tsampleSingleIteration();\n\n\t\t\tif ((savestep > 0) && (iter % savestep == 0)\n\t\t\t\t&& (iter < numIterations)) {\n\t\t\t\tSystem.out.println(\"\\t\\tSaving the output from the \" + iter\n\t\t\t\t\t+ \"^{th} sample\");\n\t\t\t\texpName = orgExpName + \"-\" + iter;\n\t\t\t\twrite();\n\t\t\t}\n\t\t}\n\t\texpName = orgExpName;\n\n\t\twriteParameters();\n\t\tSystem.out.println(\"Writing output from the last sample ...\");\n\t\twrite();\n\n\t\tSystem.out.println(\"Sampling completed!\");\n\t}\n\n\tpublic void optimizeTopicVectors()\n\t{\n\t\tSystem.out.println(\"\\t\\tEstimating topic vectors ...\");\n\t\tsumExpValues = new double[numTopics];\n\t\tdotProductValues = new double[numTopics][vocabularySize];\n\t\texpDotProductValues = new double[numTopics][vocabularySize];\n\n\t\tParallel.loop(numTopics, new Parallel.LoopInt()\n\t\t{\n\t\t\t@Override\n\t\t\tpublic void compute(int topic)\n\t\t\t{\n\t\t\t\tint rate = 1;\n\t\t\t\tboolean check = true;\n\t\t\t\twhile (check) {\n\t\t\t\t\tdouble l2Value = l2Regularizer * rate;\n\t\t\t\t\ttry {\n\t\t\t\t\t\tTopicVectorOptimizer optimizer = new TopicVectorOptimizer(\n\t\t\t\t\t\t\ttopicVectors[topic], topicWordCountLF[topic],\n\t\t\t\t\t\t\twordVectors, l2Value);\n\n\t\t\t\t\t\tOptimizer gd = new LBFGS(optimizer, tolerance);\n\t\t\t\t\t\tgd.optimize(600);\n\t\t\t\t\t\toptimizer.getParameters(topicVectors[topic]);\n\t\t\t\t\t\tsumExpValues[topic] = optimizer\n\t\t\t\t\t\t\t.computePartitionFunction(dotProductValues[topic],\n\t\t\t\t\t\t\t\texpDotProductValues[topic]);\n\t\t\t\t\t\tcheck = false;\n\n\t\t\t\t\t\tif (sumExpValues[topic] == 0\n\t\t\t\t\t\t\t|| Double.isInfinite(sumExpValues[topic])) {\n\t\t\t\t\t\t\tdouble max = -1000000000.0;\n\t\t\t\t\t\t\tfor (int index = 0; index < vocabularySize; index++) {\n\t\t\t\t\t\t\t\tif (dotProductValues[topic][index] > max)\n\t\t\t\t\t\t\t\t\tmax = dotProductValues[topic][index];\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tfor (int index = 0; index < vocabularySize; index++) {\n\t\t\t\t\t\t\t\texpDotProductValues[topic][index] = Math\n\t\t\t\t\t\t\t\t\t.exp(dotProductValues[topic][index] - max);\n\t\t\t\t\t\t\t\tsumExpValues[topic] += expDotProductValues[topic][index];\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tcatch (InvalidOptimizableException e) {\n\t\t\t\t\t\te.printStackTrace();\n\t\t\t\t\t\tcheck = true;\n\t\t\t\t\t}\n\t\t\t\t\trate = rate * 10;\n\t\t\t\t}\n\t\t\t}\n\t\t});\n\t}\n\n\tpublic void sampleSingleIteration()\n\t{\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tint docSize = corpus.get(dIndex).size();\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\t// Get current word\n\t\t\t\tint word = corpus.get(dIndex).get(wIndex);// wordID\n\t\t\t\tint subtopic = topicAssignments.get(dIndex).get(wIndex);\n\t\t\t\tint topic = subtopic % numTopics;\n\n\t\t\t\tdocTopicCount[dIndex][topic] -= 1;\n\t\t\t\tif (subtopic == topic) {\n\t\t\t\t\ttopicWordCountLF[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] -= 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\ttopicWordCountLDA[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountLDA[topic] -= 1;\n\t\t\t\t}\n\n\t\t\t\t// Sample a pair of topic z and binary indicator variable s\n\t\t\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\n\t\t\t\t\tmultiPros[tIndex] = (docTopicCount[dIndex][tIndex] + alpha)\n\t\t\t\t\t\t* lambda * expDotProductValues[tIndex][word]\n\t\t\t\t\t\t/ sumExpValues[tIndex];\n\n\t\t\t\t\tmultiPros[tIndex + numTopics] = (docTopicCount[dIndex][tIndex] + alpha)\n\t\t\t\t\t\t* (1 - lambda)\n\t\t\t\t\t\t* (topicWordCountLDA[tIndex][word] + beta)\n\t\t\t\t\t\t/ (sumTopicWordCountLDA[tIndex] + betaSum);\n\n\t\t\t\t}\n\t\t\t\tsubtopic = FuncUtils.nextDiscrete(multiPros);\n\t\t\t\ttopic = subtopic % numTopics;\n\n\t\t\t\tdocTopicCount[dIndex][topic] += 1;\n\t\t\t\tif (subtopic == topic) {\n\t\t\t\t\ttopicWordCountLF[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\ttopicWordCountLDA[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountLDA[topic] += 1;\n\t\t\t\t}\n\t\t\t\t// Update topic assignments\n\t\t\t\ttopicAssignments.get(dIndex).set(wIndex, subtopic);\n\t\t\t}\n\t\t}\n\t}\n\n\tpublic void sampleSingleInitialIteration()\n\t{\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tint docSize = corpus.get(dIndex).size();\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\tint word = corpus.get(dIndex).get(wIndex);// wordID\n\t\t\t\tint subtopic = topicAssignments.get(dIndex).get(wIndex);\n\t\t\t\tint topic = subtopic % numTopics;\n\n\t\t\t\tdocTopicCount[dIndex][topic] -= 1;\n\t\t\t\tif (subtopic == topic) { // LF(w|t) + LDA(t|d)\n\t\t\t\t\ttopicWordCountLF[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] -= 1;\n\t\t\t\t}\n\t\t\t\telse { // LDA(w|t) + LDA(t|d)\n\t\t\t\t\ttopicWordCountLDA[topic][word] -= 1;\n\t\t\t\t\tsumTopicWordCountLDA[topic] -= 1;\n\t\t\t\t}\n\n\t\t\t\t// Sample a pair of topic z and binary indicator variable s\n\t\t\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\n\t\t\t\t\tmultiPros[tIndex] = (docTopicCount[dIndex][tIndex] + alpha)\n\t\t\t\t\t\t* lambda * (topicWordCountLF[tIndex][word] + beta)\n\t\t\t\t\t\t/ (sumTopicWordCountLF[tIndex] + betaSum);\n\n\t\t\t\t\tmultiPros[tIndex + numTopics] = (docTopicCount[dIndex][tIndex] + alpha)\n\t\t\t\t\t\t* (1 - lambda)\n\t\t\t\t\t\t* (topicWordCountLDA[tIndex][word] + beta)\n\t\t\t\t\t\t/ (sumTopicWordCountLDA[tIndex] + betaSum);\n\n\t\t\t\t}\n\t\t\t\tsubtopic = FuncUtils.nextDiscrete(multiPros);\n\t\t\t\ttopic = subtopic % numTopics;\n\n\t\t\t\tdocTopicCount[dIndex][topic] += 1;\n\t\t\t\tif (topic == subtopic) {\n\t\t\t\t\ttopicWordCountLF[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountLF[topic] += 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\ttopicWordCountLDA[topic][word] += 1;\n\t\t\t\t\tsumTopicWordCountLDA[topic] += 1;\n\t\t\t\t}\n\t\t\t\t// Update topic assignments\n\t\t\t\ttopicAssignments.get(dIndex).set(wIndex, subtopic);\n\t\t\t}\n\n\t\t}\n\t}\n\n\tpublic void writeParameters()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".paras\"));\n\t\twriter.write(\"-model\" + \"\\t\" + \"LFLDA\");\n\t\twriter.write(\"\\n-corpus\" + \"\\t\" + corpusPath);\n\t\twriter.write(\"\\n-vectors\" + \"\\t\" + vectorFilePath);\n\t\twriter.write(\"\\n-ntopics\" + \"\\t\" + numTopics);\n\t\twriter.write(\"\\n-alpha\" + \"\\t\" + alpha);\n\t\twriter.write(\"\\n-beta\" + \"\\t\" + beta);\n\t\twriter.write(\"\\n-lambda\" + \"\\t\" + lambda);\n\t\twriter.write(\"\\n-initers\" + \"\\t\" + numInitIterations);\n\t\twriter.write(\"\\n-niters\" + \"\\t\" + numIterations);\n\t\twriter.write(\"\\n-twords\" + \"\\t\" + topWords);\n\t\twriter.write(\"\\n-name\" + \"\\t\" + expName);\n\t\tif (savestep > 0)\n\t\t\twriter.write(\"\\n-sstep\" + \"\\t\" + savestep);\n\n\t\twriter.close();\n\t}\n\n\tpublic void writeDictionary()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".vocabulary\"));\n\t\tfor (String word : word2IdVocabulary.keySet()) {\n\t\t\twriter.write(word + \" \" + word2IdVocabulary.get(word) + \"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeIDbasedCorpus()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".IDcorpus\"));\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tint docSize = corpus.get(dIndex).size();\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\twriter.write(corpus.get(dIndex).get(wIndex) + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopicAssignments()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".topicAssignments\"));\n\t\tfor (int dIndex = 0; dIndex < numDocuments; dIndex++) {\n\t\t\tint docSize = corpus.get(dIndex).size();\n\t\t\tfor (int wIndex = 0; wIndex < docSize; wIndex++) {\n\t\t\t\twriter.write(topicAssignments.get(dIndex).get(wIndex) + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopicVectors()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".topicVectors\"));\n\t\tfor (int i = 0; i < numTopics; i++) {\n\t\t\tfor (int j = 0; j < vectorSize; j++)\n\t\t\t\twriter.write(topicVectors[i][j] + \" \");\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopTopicalWords()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".topWords\"));\n\n\t\tfor (int tIndex = 0; tIndex < numTopics; tIndex++) {\n\t\t\twriter.write(\"Topic\" + new Integer(tIndex) + \":\");\n\n\t\t\tMap<Integer, Double> topicWordProbs = new TreeMap<Integer, Double>();\n\t\t\tfor (int wIndex = 0; wIndex < vocabularySize; wIndex++) {\n\n\t\t\t\tdouble pro = lambda * expDotProductValues[tIndex][wIndex]\n\t\t\t\t\t/ sumExpValues[tIndex] + (1 - lambda)\n\t\t\t\t\t* (topicWordCountLDA[tIndex][wIndex] + beta)\n\t\t\t\t\t/ (sumTopicWordCountLDA[tIndex] + betaSum);\n\n\t\t\t\ttopicWordProbs.put(wIndex, pro);\n\t\t\t}\n\t\t\ttopicWordProbs = FuncUtils.sortByValueDescending(topicWordProbs);\n\n\t\t\tSet<Integer> mostLikelyWords = topicWordProbs.keySet();\n\t\t\tint count = 0;\n\t\t\tfor (Integer index : mostLikelyWords) {\n\t\t\t\tif (count < topWords) {\n\t\t\t\t\twriter.write(\" \" + id2WordVocabulary.get(index));\n\t\t\t\t\tcount += 1;\n\t\t\t\t}\n\t\t\t\telse {\n\t\t\t\t\twriter.write(\"\\n\\n\");\n\t\t\t\t\tbreak;\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeTopicWordPros()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".phi\"));\n\t\tfor (int t = 0; t < numTopics; t++) {\n\t\t\tfor (int w = 0; w < vocabularySize; w++) {\n\t\t\t\tdouble pro = lambda * expDotProductValues[t][w]\n\t\t\t\t\t/ sumExpValues[t] + (1 - lambda)\n\t\t\t\t\t* (topicWordCountLDA[t][w] + beta)\n\t\t\t\t\t/ (sumTopicWordCountLDA[t] + betaSum);\n\t\t\t\twriter.write(pro + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void writeDocTopicPros()\n\t\tthrows IOException\n\t{\n\t\tBufferedWriter writer = new BufferedWriter(new FileWriter(folderPath\n\t\t\t+ expName + \".theta\"));\n\n\t\tfor (int i = 0; i < numDocuments; i++) {\n\t\t\tfor (int j = 0; j < numTopics; j++) {\n\t\t\t\tdouble pro = (docTopicCount[i][j] + alpha)\n\t\t\t\t\t/ (sumDocTopicCount[i] + alphaSum);\n\t\t\t\twriter.write(pro + \" \");\n\t\t\t}\n\t\t\twriter.write(\"\\n\");\n\t\t}\n\t\twriter.close();\n\t}\n\n\tpublic void write()\n\t\tthrows IOException\n\t{\n\t\twriteTopTopicalWords();\n\t\twriteDocTopicPros();\n\t\twriteTopicAssignments();\n\t\twriteTopicWordPros();\n\t}\n\n\tpublic static void main(String args[])\n\t\tthrows Exception\n\t{\n\t\tLFLDA_Inf lflda = new LFLDA_Inf(\"test/testLFLDA.paras\",\n\t\t\t\"test/corpus_test.txt\", 2000, 200, 20, \"testLFLDAinf\", 0);\n\t\tlflda.inference();\n\t}\n}\n"
  },
  {
    "path": "src/models/TopicVectorOptimizer.java",
    "content": "package models;\n\nimport cc.mallet.optimize.Optimizable;\nimport cc.mallet.types.MatrixOps;\n\n/**\n * Implementation of the MAP estimation for learning topic vectors, as described\n * in section 3.5 in:\n * \n * Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015.\n * Improving Topic Models with Latent Feature Word Representations. Transactions\n * of the Association for Computational Linguistics, vol. 3, pp. 299-313.\n * \n * @author Dat Quoc Nguyen\n */\n\npublic class TopicVectorOptimizer\n\timplements Optimizable.ByGradientValue\n{\n\t// Number of times a word type assigned to the topic\n\tint[] wordCount;\n\tint totalCount; // Total number of words assigned to the topic\n\tint vocaSize; // Size of the vocabulary\n\t// wordCount.length = wordVectors.length = vocaSize\n\tdouble[][] wordVectors;// Vector representations for words\n\tdouble[] topicVector;// Vector representation for a topic\n\tint vectorSize; // vectorSize = topicVector.length\n\n\t// For each i_{th} element of topic vector, compute:\n\t// sum_w wordCount[w] * wordVectors[w][i]\n\tdouble[] expectedCountValues;\n\n\tdouble l2Constant; // L2 regularizer for learning topic vectors\n\tdouble[] dotProductValues;\n\tdouble[] expDotProductValues;\n\n\tpublic TopicVectorOptimizer(double[] inTopicVector, int[] inWordCount,\n\t\tdouble[][] inWordVectors, double inL2Constant)\n\t{\n\t\tvocaSize = inWordCount.length;\n\t\tvectorSize = inWordVectors[0].length;\n\t\tl2Constant = inL2Constant;\n\n\t\ttopicVector = new double[vectorSize];\n\t\tSystem\n\t\t\t.arraycopy(inTopicVector, 0, topicVector, 0, inTopicVector.length);\n\n\t\twordCount = new int[vocaSize];\n\t\tSystem.arraycopy(inWordCount, 0, wordCount, 0, vocaSize);\n\t\twordVectors = new double[vocaSize][vectorSize];\n\t\tfor (int w = 0; w < vocaSize; w++)\n\t\t\tSystem\n\t\t\t\t.arraycopy(inWordVectors[w], 0, wordVectors[w], 0, vectorSize);\n\n\t\ttotalCount = 0;\n\t\tfor (int w = 0; w < vocaSize; w++) {\n\t\t\ttotalCount += wordCount[w];\n\t\t}\n\n\t\texpectedCountValues = new double[vectorSize];\n\t\tfor (int i = 0; i < vectorSize; i++) {\n\t\t\tfor (int w = 0; w < vocaSize; w++) {\n\t\t\t\texpectedCountValues[i] += wordCount[w] * wordVectors[w][i];\n\t\t\t}\n\t\t}\n\n\t\tdotProductValues = new double[vocaSize];\n\t\texpDotProductValues = new double[vocaSize];\n\t}\n\n\t@Override\n\tpublic int getNumParameters()\n\t{\n\t\treturn vectorSize;\n\t}\n\n\t@Override\n\tpublic void getParameters(double[] buffer)\n\t{\n\t\tfor (int i = 0; i < vectorSize; i++)\n\t\t\tbuffer[i] = topicVector[i];\n\t}\n\n\t@Override\n\tpublic double getParameter(int index)\n\t{\n\t\treturn topicVector[index];\n\t}\n\n\t@Override\n\tpublic void setParameters(double[] params)\n\t{\n\t\tfor (int i = 0; i < params.length; i++)\n\t\t\ttopicVector[i] = params[i];\n\t}\n\n\t@Override\n\tpublic void setParameter(int index, double value)\n\t{\n\t\ttopicVector[index] = value;\n\t}\n\n\t@Override\n\tpublic void getValueGradient(double[] buffer)\n\t{\n\t\tdouble partitionFuncValue = computePartitionFunction(dotProductValues,\n\t\t\texpDotProductValues);\n\n\t\tfor (int i = 0; i < vectorSize; i++) {\n\t\t\tbuffer[i] = 0.0;\n\n\t\t\tdouble expectationValue = 0.0;\n\t\t\tfor (int w = 0; w < vocaSize; w++) {\n\t\t\t\texpectationValue += wordVectors[w][i] * expDotProductValues[w];\n\t\t\t}\n\t\t\texpectationValue = expectationValue / partitionFuncValue;\n\n\t\t\tbuffer[i] = expectedCountValues[i] - totalCount * expectationValue\n\t\t\t\t- 2 * l2Constant * topicVector[i];\n\t\t}\n\t}\n\n\t@Override\n\tpublic double getValue()\n\t{\n\t\tdouble logPartitionFuncValue = Math.log(computePartitionFunction(\n\t\t\tdotProductValues, expDotProductValues));\n\n\t\tdouble value = 0.0;\n\t\tfor (int w = 0; w < vocaSize; w++) {\n\t\t\tif (wordCount[w] == 0)\n\t\t\t\tcontinue;\n\t\t\tvalue += wordCount[w] * dotProductValues[w];\n\t\t}\n\t\tvalue = value - totalCount * logPartitionFuncValue - l2Constant\n\t\t\t* MatrixOps.twoNormSquared(topicVector);\n\n\t\treturn value;\n\t}\n\n\t// Compute the partition function\n\tpublic double computePartitionFunction(double[] elements1,\n\t\tdouble[] elements2)\n\t{\n\t\tdouble value = 0.0;\n\t\tfor (int w = 0; w < vocaSize; w++) {\n\t\t\telements1[w] = MatrixOps.dotProduct(wordVectors[w], topicVector);\n\t\t\telements2[w] = Math.exp(elements1[w]);\n\t\t\tvalue += elements2[w];\n\t\t}\n\t\treturn value;\n\t}\n}\n"
  },
  {
    "path": "src/utility/CmdArgs.java",
    "content": "package utility;\n\nimport org.kohsuke.args4j.Option;\n\npublic class CmdArgs\n{\n\n\t@Option(name = \"-model\", usage = \"Specify model\", required = true)\n\tpublic String model = \"\";\n\n\t@Option(name = \"-corpus\", usage = \"Specify path to topic modeling corpus\")\n\tpublic String corpus = \"\";\n\n\t@Option(name = \"-vectors\", usage = \"Specify path to the file containing word vectors\")\n\tpublic String vectors = \"\";\n\n\t@Option(name = \"-ntopics\", usage = \"Specify number of topics\")\n\tpublic int ntopics = 20;\n\n\t@Option(name = \"-alpha\", usage = \"Specify alpha\")\n\tpublic double alpha = 0.1;\n\n\t@Option(name = \"-beta\", usage = \"Specify beta\")\n\tpublic double beta = 0.01;\n\n\t@Option(name = \"-lambda\", usage = \"Specify mixture weight lambda\")\n\tpublic double lambda = 0.6;\n\n\t@Option(name = \"-initers\", usage = \"Specify number of initial sampling iterations\")\n\tpublic int initers = 2000;\n\n\t@Option(name = \"-niters\", usage = \"Specify number of EM-style sampling iterations\")\n\tpublic int niters = 200;\n\n\t@Option(name = \"-twords\", usage = \"Specify number of top topical words\")\n\tpublic int twords = 20;\n\n\t@Option(name = \"-name\", usage = \"Specify a name to a topic modeling experiment\")\n\tpublic String expModelName = \"model\";\n\n\t@Option(name = \"-initFile\")\n\tpublic String initTopicAssgns = \"\";\n\n\t@Option(name = \"-sstep\")\n\tpublic int savestep = 0;\n\n\t@Option(name = \"-dir\")\n\tpublic String dir = \"\";\n\n\t@Option(name = \"-label\")\n\tpublic String labelFile = \"\";\n\n\t@Option(name = \"-prob\")\n\tpublic String prob = \"\";\n\n\t@Option(name = \"-paras\", usage = \"Specify path to hyper-parameter file\")\n\tpublic String paras = \"\";\n\n}\n"
  },
  {
    "path": "src/utility/FuncUtils.java",
    "content": "package utility;\n\nimport java.util.Collections;\nimport java.util.Comparator;\nimport java.util.LinkedHashMap;\nimport java.util.LinkedList;\nimport java.util.List;\nimport java.util.Map;\n\npublic class FuncUtils\n{\n    public static <K, V extends Comparable<? super V>> Map<K, V> sortByValueDescending(Map<K, V> map)\n    {\n        List<Map.Entry<K, V>> list = new LinkedList<Map.Entry<K, V>>(map.entrySet());\n        Collections.sort(list, new Comparator<Map.Entry<K, V>>()\n        {\n            @Override\n            public int compare(Map.Entry<K, V> o1, Map.Entry<K, V> o2)\n            {\n                int compare = (o1.getValue()).compareTo(o2.getValue());\n                return -compare;\n            }\n        });\n\n        Map<K, V> result = new LinkedHashMap<K, V>();\n        for (Map.Entry<K, V> entry : list) {\n            result.put(entry.getKey(), entry.getValue());\n        }\n        return result;\n    }\n\n    public static <K, V extends Comparable<? super V>> Map<K, V> sortByValueAscending(Map<K, V> map)\n    {\n        List<Map.Entry<K, V>> list = new LinkedList<Map.Entry<K, V>>(map.entrySet());\n        Collections.sort(list, new Comparator<Map.Entry<K, V>>()\n        {\n            @Override\n            public int compare(Map.Entry<K, V> o1, Map.Entry<K, V> o2)\n            {\n                int compare = (o1.getValue()).compareTo(o2.getValue());\n                return compare;\n            }\n        });\n\n        Map<K, V> result = new LinkedHashMap<K, V>();\n        for (Map.Entry<K, V> entry : list) {\n            result.put(entry.getKey(), entry.getValue());\n        }\n        return result;\n    }\n\n    /**\n     * Sample a value from a double array\n     * \n     * @param probs\n     * @return\n     */\n    public static int nextDiscrete(double[] probs)\n    {\n        double sum = 0.0;\n        for (int i = 0; i < probs.length; i++)\n            sum += probs[i];\n\n        double r = MTRandom.nextDouble() * sum;\n\n        sum = 0.0;\n        for (int i = 0; i < probs.length; i++) {\n            sum += probs[i];\n            if (sum > r)\n                return i;\n        }\n        return probs.length - 1;\n    }\n\n    public static double mean(double[] m)\n    {\n        double sum = 0;\n        for (int i = 0; i < m.length; i++)\n            sum += m[i];\n        return sum / m.length;\n    }\n\n    public static double stddev(double[] m)\n    {\n        double mean = mean(m);\n        double s = 0;\n        for (int i = 0; i < m.length; i++)\n            s += (m[i] - mean) * (m[i] - mean);\n        return Math.sqrt(s / m.length);\n    }\n}\n"
  },
  {
    "path": "src/utility/LBFGS.java",
    "content": "/* Copyright (C) 2002 Univ. of Massachusetts Amherst, Computer Science Dept.\n   This file is part of \"MALLET\" (MAchine Learning for LanguagE Toolkit).\n   http://www.cs.umass.edu/~mccallum/mallet\n   This software is provided under the terms of the Common Public License,\n   version 1.0, as published by http://www.opensource.org.  For further\n   information, see the file `LICENSE' included with this distribution. */\n\n/** \n @author Aron Culotta <a href=\"mailto:culotta@cs.umass.edu\">culotta@cs.umass.edu</a>\n */\n\n/**\n Limited Memory BFGS, as described in Byrd, Nocedal, and Schnabel,\n \"Representations of Quasi-Newton Matrices and Their Use in Limited\n Memory Methods\"\n */\npackage utility;\n\nimport java.util.LinkedList;\nimport java.util.logging.Logger;\n\nimport cc.mallet.optimize.BackTrackLineSearch;\nimport cc.mallet.optimize.InvalidOptimizableException;\nimport cc.mallet.optimize.LineOptimizer;\nimport cc.mallet.optimize.Optimizable;\nimport cc.mallet.optimize.Optimizer;\nimport cc.mallet.optimize.OptimizerEvaluator;\nimport cc.mallet.types.MatrixOps;\nimport cc.mallet.util.MalletLogger;\n\npublic class LBFGS\n    implements Optimizer\n{\n    private static Logger logger = MalletLogger\n            .getLogger(\"edu.umass.cs.mallet.base.ml.maximize.LimitedMemoryBFGS\");\n\n    boolean converged = false;\n    Optimizable.ByGradientValue optimizable;\n    final int maxIterations = 5000;\n    // xxx need a more principled stopping point\n    // final double tolerance = .0001;\n    // private double tolerance = 1.0e-10;\n    // final double gradientTolerance = 1.0e-10;\n\n    double tolerance;// = 1.0e-3;\n    final double gradientTolerance = 1.0e-5;\n    final double eps = 1.0e-10;\n\n    // The number of corrections used in BFGS update\n    // ideally 3 <= m <= 7. Larger m means more cpu time, memory.\n    final int m = 10;\n\n    // Line search function\n    private LineOptimizer.ByGradient lineMaximizer;\n\n    public LBFGS(Optimizable.ByGradientValue function, double inTolerance)\n    {\n        tolerance = inTolerance;\n        this.optimizable = function;\n        lineMaximizer = new BackTrackLineSearch(function);\n    }\n\n    @Override\n    public Optimizable getOptimizable()\n    {\n        return this.optimizable;\n    }\n\n    @Override\n    public boolean isConverged()\n    {\n        return converged;\n    }\n\n    /**\n     * Sets the LineOptimizer.ByGradient to use in L-BFGS optimization.\n     * \n     * @param lineOpt\n     *            line optimizer for L-BFGS\n     */\n    public void setLineOptimizer(LineOptimizer.ByGradient lineOpt)\n    {\n        lineMaximizer = lineOpt;\n    }\n\n    // State of search\n    // g = gradient\n    // s = list of m previous \"parameters\" values\n    // y = list of m previous \"g\" values\n    // rho = intermediate calculation\n    double[] g, oldg, direction, parameters, oldParameters;\n    LinkedList s = new LinkedList();\n    LinkedList y = new LinkedList();\n    LinkedList rho = new LinkedList();\n    double[] alpha;\n    static double step = 1.0;\n    int iterations;\n\n    private OptimizerEvaluator.ByGradient eval = null;\n\n    // CPAL - added this\n    public void setTolerance(double newtol)\n    {\n        this.tolerance = newtol;\n    }\n\n    public void setEvaluator(OptimizerEvaluator.ByGradient eval)\n    {\n        this.eval = eval;\n    }\n\n    public int getIteration()\n    {\n        return iterations;\n    }\n\n    @Override\n    public boolean optimize()\n    {\n        return optimize(Integer.MAX_VALUE);\n    }\n\n    @Override\n    public boolean optimize(int numIterations)\n    {\n\n        double initialValue = optimizable.getValue();\n        logger.fine(\"Entering L-BFGS.optimize(). Initial Value=\" + initialValue);\n\n        if (g == null) { // first time through\n            logger.fine(\"First time through L-BFGS\");\n            iterations = 0;\n            s = new LinkedList();\n            y = new LinkedList();\n            rho = new LinkedList();\n            alpha = new double[m];\n            for (int i = 0; i < m; i++)\n                alpha[i] = 0.0;\n\n            parameters = new double[optimizable.getNumParameters()];\n            oldParameters = new double[optimizable.getNumParameters()];\n            g = new double[optimizable.getNumParameters()];\n            oldg = new double[optimizable.getNumParameters()];\n            direction = new double[optimizable.getNumParameters()];\n\n            optimizable.getParameters(parameters);\n            System.arraycopy(parameters, 0, oldParameters, 0, parameters.length);\n\n            optimizable.getValueGradient(g);\n            System.arraycopy(g, 0, oldg, 0, g.length);\n            System.arraycopy(g, 0, direction, 0, g.length);\n\n            if (MatrixOps.absNormalize(direction) == 0) {\n                logger.info(\"L-BFGS initial gradient is zero; saying converged\");\n                g = null;\n                converged = true;\n                return true;\n            }\n            logger.fine(\"direction.2norm: \" + MatrixOps.twoNorm(direction));\n            MatrixOps.timesEquals(direction, 1.0 / MatrixOps.twoNorm(direction));\n            // make initial jump\n            logger.fine(\"before initial jump: \\ndirection.2norm: \" + MatrixOps.twoNorm(direction)\n                    + \" \\ngradient.2norm: \" + MatrixOps.twoNorm(g) + \"\\nparameters.2norm: \"\n                    + MatrixOps.twoNorm(parameters));\n\n            // TestMaximizable.testValueAndGradientInDirection (maxable,\n            // direction);\n            step = lineMaximizer.optimize(direction, step);\n            if (step == 0.0) {// could not step in this direction.\n                // // give up and say converged.\n                // g = null; // reset search\n                // step = 1.0;\n                // throw new OptimizationException(\n                // \"Line search could not step in the current direction. \"\n                // +\n                // \"(This is not necessarily cause for alarm. Sometimes this happens close to the maximum,\"\n                // + \" where the function may be very flat.)\");\n\n                return false;\n            }\n            optimizable.getParameters(parameters);\n            optimizable.getValueGradient(g);\n            logger.fine(\"after initial jump: \\ndirection.2norm: \" + MatrixOps.twoNorm(direction)\n                    + \" \\ngradient.2norm: \" + MatrixOps.twoNorm(g));\n        }\n\n        double value = optimizable.getValue();\n\n        for (int iterationCount = 0; iterationCount < numIterations; iterationCount++) {\n\n            logger.fine(\"L-BFGS iteration=\" + iterationCount + \", value=\" + value + \" g.twoNorm: \"\n                    + MatrixOps.twoNorm(g) + \" oldg.twoNorm: \" + MatrixOps.twoNorm(oldg));\n\n            // if (iterationCount % 10 == 0)\n            // System.out.println(\"\\t\\tL-BFGS iteration=\" + iterationCount\n            // + \", value=\" + value + \" g.twoNorm: \"\n            // + MatrixOps.twoNorm(g) + \" oldg.twoNorm: \"\n            // + MatrixOps.twoNorm(oldg));\n\n            // get difference between previous 2 gradients and parameters\n            double sy = 0.0;\n            double yy = 0.0;\n            for (int i = 0; i < oldParameters.length; i++) {\n                // -inf - (-inf) = 0; inf - inf = 0\n                if (Double.isInfinite(parameters[i]) && Double.isInfinite(oldParameters[i])\n                        && (parameters[i] * oldParameters[i] > 0))\n                    oldParameters[i] = 0.0;\n                else\n                    oldParameters[i] = parameters[i] - oldParameters[i];\n                if (Double.isInfinite(g[i]) && Double.isInfinite(oldg[i]) && (g[i] * oldg[i] > 0))\n                    oldg[i] = 0.0;\n                else\n                    oldg[i] = g[i] - oldg[i];\n                sy += oldParameters[i] * oldg[i]; // si * yi\n                yy += oldg[i] * oldg[i];\n                direction[i] = g[i];\n            }\n\n            if (sy > 0) {\n                throw new InvalidOptimizableException(\"sy = \" + sy + \" > 0\");\n            }\n\n            double gamma = sy / yy; // scaling factor\n            if (gamma > 0)\n                throw new InvalidOptimizableException(\"gamma = \" + gamma + \" > 0\");\n\n            push(rho, 1.0 / sy);\n            push(s, oldParameters);\n            push(y, oldg);\n            // calculate new direction\n            assert (s.size() == y.size()) : \"s.size: \" + s.size() + \" y.size: \" + y.size();\n            for (int i = s.size() - 1; i >= 0; i--) {\n                alpha[i] = ((Double) rho.get(i)).doubleValue()\n                        * MatrixOps.dotProduct((double[]) s.get(i), direction);\n                MatrixOps.plusEquals(direction, (double[]) y.get(i), -1.0 * alpha[i]);\n            }\n            MatrixOps.timesEquals(direction, gamma);\n            for (int i = 0; i < y.size(); i++) {\n                double beta = (((Double) rho.get(i)).doubleValue())\n                        * MatrixOps.dotProduct((double[]) y.get(i), direction);\n                MatrixOps.plusEquals(direction, (double[]) s.get(i), alpha[i] - beta);\n            }\n\n            for (int i = 0; i < oldg.length; i++) {\n                oldParameters[i] = parameters[i];\n                oldg[i] = g[i];\n                direction[i] *= -1.0;\n            }\n            logger.fine(\"before linesearch: direction.gradient.dotprod: \"\n                    + MatrixOps.dotProduct(direction, g) + \"\\ndirection.2norm: \"\n                    + MatrixOps.twoNorm(direction) + \"\\nparameters.2norm: \"\n                    + MatrixOps.twoNorm(parameters));\n            // TestMaximizable.testValueAndGradientInDirection (maxable,\n            // direction);\n            step = lineMaximizer.optimize(direction, step);\n            if (step == 0.0) { // could not step in this direction.\n                g = null; // reset search\n                step = 1.0;\n                // xxx Temporary test; passed OK\n                // TestMaximizable.testValueAndGradientInDirection (maxable,\n                // direction);\n                // System.out\n                // .println(\"\\t\\tLine search could not step in the current direction.\");\n                // throw new OptimizationException(\n                // \"Line search could not step in the current direction. \"\n                // +\n                // \"(This is not necessarily cause for alarm. Sometimes this happens close to the maximum,\"\n                // + \" where the function may be very flat.)\");\n                return false;\n            }\n            optimizable.getParameters(parameters);\n            optimizable.getValueGradient(g);\n            logger.fine(\"after linesearch: direction.2norm: \" + MatrixOps.twoNorm(direction));\n            double newValue = optimizable.getValue();\n\n            // Test for terminations\n            // if(2.0*Math.abs(newValue-value) <= tolerance*\n            // (Math.abs(newValue)+Math.abs(value) + eps)){\n            if (Math.abs(newValue - value) <= tolerance) {\n                // System.out.println(\"\\t\\tNumber of iterations: \"\n                // + iterationCount);\n                // System.out\n                // .println(\"\\t\\tExiting L-BFGS on termination #1:\\n\\t\\tvalue difference below \"\n                // + tolerance\n                // + \" (oldValue: \"\n                // + value\n                // + \" newValue: \"\n                // + newValue\n                // + \" gradient.twoNorm: \"\n                // + MatrixOps.twoNorm(g) + \")\");\n                converged = true;\n                return true;\n            }\n\n            value = newValue;\n\n            double gg = MatrixOps.twoNorm(g);\n            if (gg < gradientTolerance) {\n                logger.fine(\"Exiting L-BFGS on termination #2: \\ngradient=\" + gg + \" < \"\n                        + gradientTolerance);\n                converged = true;\n                return true;\n            }\n            if (gg == 0.0) {\n                logger.fine(\"Exiting L-BFGS on termination #3: \\ngradient==0.0\");\n                converged = true;\n                return true;\n            }\n            logger.fine(\"Gradient = \" + gg);\n            iterations++;\n            if (iterations > maxIterations) {\n                System.err\n                        .println(\"Too many iterations in L-BFGS.java. Continuing with current parameters.\");\n                converged = true;\n                return true;\n                // throw new IllegalStateException (\"Too many iterations.\");\n            }\n\n            // end of iteration. call evaluator\n            if (eval != null && !eval.evaluate(optimizable, iterationCount)) {\n                logger.fine(\"Exiting L-BFGS on termination #4: evaluator returned false.\");\n                converged = true;\n                return false;\n            }\n        }\n        return false;\n    }\n\n    /**\n     * Resets the previous gradients and values that are used to approximate the Hessian. NOTE - If\n     * the {@link Optimizable} object is modified externally, this method should be called to avoid\n     * IllegalStateExceptions.\n     */\n    public void reset()\n    {\n        g = null;\n    }\n\n    /**\n     * Pushes a new object onto the queue l\n     * \n     * @param l\n     *            linked list queue of Matrix obj's\n     * @param toadd\n     *            matrix to push onto queue\n     */\n    private void push(LinkedList l, double[] toadd)\n    {\n        assert (l.size() <= m);\n        if (l.size() == m) {\n            // remove oldest matrix and add newset to end of list.\n            // to make this more efficient, actually overwrite\n            // memory of oldest matrix\n\n            // this overwrites the oldest matrix\n            double[] last = (double[]) l.get(0);\n            System.arraycopy(toadd, 0, last, 0, toadd.length);\n            Object ptr = last;\n            // this readjusts the pointers in the list\n            for (int i = 0; i < l.size() - 1; i++)\n                l.set(i, l.get(i + 1));\n            l.set(m - 1, ptr);\n        }\n        else {\n            double[] newArray = new double[toadd.length];\n            System.arraycopy(toadd, 0, newArray, 0, toadd.length);\n            l.addLast(newArray);\n        }\n    }\n\n    /**\n     * Pushes a new object onto the queue l\n     * \n     * @param l\n     *            linked list queue of Double obj's\n     * @param toadd\n     *            double value to push onto queue\n     */\n    private void push(LinkedList l, double toadd)\n    {\n        assert (l.size() <= m);\n        if (l.size() == m) { // pop old double and add new\n            l.removeFirst();\n            l.addLast(new Double(toadd));\n        }\n        else\n            l.addLast(new Double(toadd));\n    }\n\n}\n"
  },
  {
    "path": "src/utility/MTRandom.java",
    "content": "package utility;\n\npublic class MTRandom\n{\n\n    private static MersenneTwister rand = new MersenneTwister();\n\n    public static void setSeed(long seed)\n    {\n        rand.setSeed(seed);\n    }\n\n    public static double nextDouble()\n    {\n        return rand.nextDouble();\n    }\n\n    public static int nextInt(int n)\n    {\n        return rand.nextInt(n);\n    }\n\n    public static boolean nextBoolean()\n    {\n        return rand.nextBoolean();\n    }\n}\n"
  },
  {
    "path": "src/utility/MersenneTwister.java",
    "content": "package utility;\n\nimport java.io.DataInputStream;\nimport java.io.DataOutputStream;\nimport java.io.IOException;\nimport java.io.ObjectInputStream;\nimport java.io.ObjectOutputStream;\nimport java.io.Serializable;\n\n/**\n * <h3>MersenneTwister and MersenneTwisterFast</h3>\n * <p>\n * <b>Version 20</b>, based on version MT199937(99/10/29) of the Mersenne Twister algorithm found at\n * <a href=\"http://www.math.keio.ac.jp/matumoto/emt.html\"> The Mersenne Twister Home Page</a>, with\n * the initialization improved using the new 2002/1/26 initialization algorithm By Sean Luke,\n * October 2004.\n * \n * <p>\n * <b>MersenneTwister</b> is a drop-in subclass replacement for java.util.Random. It is properly\n * synchronized and can be used in a multithreaded environment. On modern VMs such as HotSpot, it is\n * approximately 1/3 slower than java.util.Random.\n *\n * <p>\n * <b>MersenneTwisterFast</b> is not a subclass of java.util.Random. It has the same public methods\n * as Random does, however, and it is algorithmically identical to MersenneTwister.\n * MersenneTwisterFast has hard-code inlined all of its methods directly, and made all of them final\n * (well, the ones of consequence anyway). Further, these methods are <i>not</i> synchronized, so\n * the same MersenneTwisterFast instance cannot be shared by multiple threads. But all this helps\n * MersenneTwisterFast achieve well over twice the speed of MersenneTwister. java.util.Random is\n * about 1/3 slower than MersenneTwisterFast.\n *\n * <h3>About the Mersenne Twister</h3>\n * <p>\n * This is a Java version of the C-program for MT19937: Integer version. The MT19937 algorithm was\n * created by Makoto Matsumoto and Takuji Nishimura, who ask: \"When you use this, send an email to:\n * matumoto@math.keio.ac.jp with an appropriate reference to your work\". Indicate that this is a\n * translation of their algorithm into Java.\n *\n * <p>\n * <b>Reference. </b> Makato Matsumoto and Takuji Nishimura, \"Mersenne Twister: A 623-Dimensionally\n * Equidistributed Uniform Pseudo-Random Number Generator\", <i>ACM Transactions on Modeling and.\n * Computer Simulation,</i> Vol. 8, No. 1, January 1998, pp 3--30.\n *\n * <h3>About this Version</h3>\n *\n * <p>\n * <b>Changes since V19:</b> nextFloat(boolean, boolean) now returns float, not double.\n *\n * <p>\n * <b>Changes since V18:</b> Removed old final declarations, which used to potentially speed up the\n * code, but no longer.\n *\n * <p>\n * <b>Changes since V17:</b> Removed vestigial references to &= 0xffffffff which stemmed from the\n * original C code. The C code could not guarantee that ints were 32 bit, hence the masks. The\n * vestigial references in the Java code were likely optimized out anyway.\n *\n * <p>\n * <b>Changes since V16:</b> Added nextDouble(includeZero, includeOne) and nextFloat(includeZero,\n * includeOne) to allow for half-open, fully-closed, and fully-open intervals.\n *\n * <p>\n * <b>Changes Since V15:</b> Added serialVersionUID to quiet compiler warnings from Sun's overly\n * verbose compilers as of JDK 1.5.\n *\n * <p>\n * <b>Changes Since V14:</b> made strictfp, with StrictMath.log and StrictMath.sqrt in nextGaussian\n * instead of Math.log and Math.sqrt. This is largely just to be safe, as it presently makes no\n * difference in the speed, correctness, or results of the algorithm.\n *\n * <p>\n * <b>Changes Since V13:</b> clone() method CloneNotSupportedException removed.\n *\n * <p>\n * <b>Changes Since V12:</b> clone() method added.\n *\n * <p>\n * <b>Changes Since V11:</b> stateEquals(...) method added. MersenneTwisterFast is equal to other\n * MersenneTwisterFasts with identical state; likewise MersenneTwister is equal to other\n * MersenneTwister with identical state. This isn't equals(...) because that requires a contract of\n * immutability to compare by value.\n *\n * <p>\n * <b>Changes Since V10:</b> A documentation error suggested that setSeed(int[]) required an int[]\n * array 624 long. In fact, the array can be any non-zero length. The new version also checks for\n * this fact.\n *\n * <p>\n * <b>Changes Since V9:</b> readState(stream) and writeState(stream) provided.\n *\n * <p>\n * <b>Changes Since V8:</b> setSeed(int) was only using the first 28 bits of the seed; it should\n * have been 32 bits. For small-number seeds the behavior is identical.\n *\n * <p>\n * <b>Changes Since V7:</b> A documentation error in MersenneTwisterFast (but not MersenneTwister)\n * stated that nextDouble selects uniformly from the full-open interval [0,1]. It does not.\n * nextDouble's contract is identical across MersenneTwisterFast, MersenneTwister, and\n * java.util.Random, namely, selection in the half-open interval [0,1). That is, 1.0 should not be\n * returned. A similar contract exists in nextFloat.\n *\n * <p>\n * <b>Changes Since V6:</b> License has changed from LGPL to BSD. New timing information to compare\n * against java.util.Random. Recent versions of HotSpot have helped Random increase in speed to the\n * point where it is faster than MersenneTwister but slower than MersenneTwisterFast (which should\n * be the case, as it's a less complex algorithm but is synchronized).\n * \n * <p>\n * <b>Changes Since V5:</b> New empty constructor made to work the same as java.util.Random --\n * namely, it seeds based on the current time in milliseconds.\n *\n * <p>\n * <b>Changes Since V4:</b> New initialization algorithms. See (see <a\n * href=\"http://www.math.keio.ac.jp/matumoto/MT2002/emt19937ar.html\"</a>\n * http://www.math.keio.ac.jp/matumoto/MT2002/emt19937ar.html</a>)\n *\n * <p>\n * The MersenneTwister code is based on standard MT19937 C/C++ code by Takuji Nishimura, with\n * suggestions from Topher Cooper and Marc Rieffel, July 1997. The code was originally translated\n * into Java by Michael Lecuyer, January 1999, and the original code is Copyright (c) 1999 by\n * Michael Lecuyer.\n *\n * <h3>Java notes</h3>\n * \n * <p>\n * This implementation implements the bug fixes made in Java 1.2's version of Random, which means it\n * can be used with earlier versions of Java. See <a\n * href=\"http://www.javasoft.com/products/jdk/1.2/docs/api/java/util/Random.html\"> the JDK 1.2\n * java.util.Random documentation</a> for further documentation on the random-number generation\n * contracts made. Additionally, there's an undocumented bug in the JDK java.util.Random.nextBytes()\n * method, which this code fixes.\n *\n * <p>\n * Just like java.util.Random, this generator accepts a long seed but doesn't use all of it.\n * java.util.Random uses 48 bits. The Mersenne Twister instead uses 32 bits (int size). So it's best\n * if your seed does not exceed the int range.\n *\n * <p>\n * MersenneTwister can be used reliably on JDK version 1.1.5 or above. Earlier Java versions have\n * serious bugs in java.util.Random; only MersenneTwisterFast (and not MersenneTwister nor\n * java.util.Random) should be used with them.\n *\n * <h3>License</h3>\n *\n * Copyright (c) 2003 by Sean Luke. <br>\n * Portions copyright (c) 1993 by Michael Lecuyer. <br>\n * All rights reserved. <br>\n *\n * <p>\n * Redistribution and use in source and binary forms, with or without modification, are permitted\n * provided that the following conditions are met:\n * <ul>\n * <li>Redistributions of source code must retain the above copyright notice, this list of\n * conditions and the following disclaimer.\n * <li>Redistributions in binary form must reproduce the above copyright notice, this list of\n * conditions and the following disclaimer in the documentation and/or other materials provided with\n * the distribution.\n * <li>Neither the name of the copyright owners, their employers, nor the names of its contributors\n * may be used to endorse or promote products derived from this software without specific prior\n * written permission.\n * </ul>\n * <p>\n * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR\n * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND\n * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNERS OR\n * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\n * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,\n * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,\n * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY\n * WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n *\n * @version 20\n */\n\npublic strictfp class MersenneTwister\n    extends java.util.Random\n    implements Serializable, Cloneable\n{\n    // Serialization\n    private static final long serialVersionUID = -4035832775130174188L; // locked as of Version 15\n\n    // Period parameters\n    private static final int N = 624;\n    private static final int M = 397;\n    private static final int MATRIX_A = 0x9908b0df; // private static final * constant vector a\n    private static final int UPPER_MASK = 0x80000000; // most significant w-r bits\n    private static final int LOWER_MASK = 0x7fffffff; // least significant r bits\n\n    // Tempering parameters\n    private static final int TEMPERING_MASK_B = 0x9d2c5680;\n    private static final int TEMPERING_MASK_C = 0xefc60000;\n\n    private int mt[]; // the array for the state vector\n    private int mti; // mti==N+1 means mt[N] is not initialized\n    private int mag01[];\n\n    // a good initial seed (of int size, though stored in a long)\n    // private static final long GOOD_SEED = 4357;\n\n    /*\n     * implemented here because there's a bug in Random's implementation of the Gaussian code\n     * (divide by zero, and log(0), ugh!), yet its gaussian variables are private so we can't access\n     * them here. :-(\n     */\n\n    private double __nextNextGaussian;\n    private boolean __haveNextNextGaussian;\n\n    /* We're overriding all internal data, to my knowledge, so this should be okay */\n    public Object clone()\n    {\n        try {\n            MersenneTwister f = (MersenneTwister) (super.clone());\n            f.mt = (int[]) (mt.clone());\n            f.mag01 = (int[]) (mag01.clone());\n            return f;\n        }\n        catch (CloneNotSupportedException e) {\n            throw new InternalError();\n        } // should never happen\n    }\n\n    public boolean stateEquals(Object o)\n    {\n        if (o == this)\n            return true;\n        if (o == null || !(o instanceof MersenneTwister))\n            return false;\n        MersenneTwister other = (MersenneTwister) o;\n        if (mti != other.mti)\n            return false;\n        for (int x = 0; x < mag01.length; x++)\n            if (mag01[x] != other.mag01[x])\n                return false;\n        for (int x = 0; x < mt.length; x++)\n            if (mt[x] != other.mt[x])\n                return false;\n        return true;\n    }\n\n    /** Reads the entire state of the MersenneTwister RNG from the stream */\n    public void readState(DataInputStream stream)\n        throws IOException\n    {\n        int len = mt.length;\n        for (int x = 0; x < len; x++)\n            mt[x] = stream.readInt();\n\n        len = mag01.length;\n        for (int x = 0; x < len; x++)\n            mag01[x] = stream.readInt();\n\n        mti = stream.readInt();\n        __nextNextGaussian = stream.readDouble();\n        __haveNextNextGaussian = stream.readBoolean();\n    }\n\n    /** Writes the entire state of the MersenneTwister RNG to the stream */\n    public void writeState(DataOutputStream stream)\n        throws IOException\n    {\n        int len = mt.length;\n        for (int x = 0; x < len; x++)\n            stream.writeInt(mt[x]);\n\n        len = mag01.length;\n        for (int x = 0; x < len; x++)\n            stream.writeInt(mag01[x]);\n\n        stream.writeInt(mti);\n        stream.writeDouble(__nextNextGaussian);\n        stream.writeBoolean(__haveNextNextGaussian);\n    }\n\n    /**\n     * Constructor using the default seed.\n     */\n    public MersenneTwister()\n    {\n        this(System.currentTimeMillis());\n    }\n\n    /**\n     * Constructor using a given seed. Though you pass this seed in as a long, it's best to make\n     * sure it's actually an integer.\n     */\n    public MersenneTwister(long seed)\n    {\n        super(seed); /* just in case */\n        setSeed(seed);\n    }\n\n    /**\n     * Constructor using an array of integers as seed. Your array must have a non-zero length. Only\n     * the first 624 integers in the array are used; if the array is shorter than this then integers\n     * are repeatedly used in a wrap-around fashion.\n     */\n    public MersenneTwister(int[] array)\n    {\n        super(System.currentTimeMillis()); /* pick something at random just in case */\n        setSeed(array);\n    }\n\n    /**\n     * Initalize the pseudo random number generator. Don't pass in a long that's bigger than an int\n     * (Mersenne Twister only uses the first 32 bits for its seed).\n     */\n\n    synchronized public void setSeed(long seed)\n    {\n        // it's always good style to call super\n        super.setSeed(seed);\n\n        // Due to a bug in java.util.Random clear up to 1.2, we're\n        // doing our own Gaussian variable.\n        __haveNextNextGaussian = false;\n\n        mt = new int[N];\n\n        mag01 = new int[2];\n        mag01[0] = 0x0;\n        mag01[1] = MATRIX_A;\n\n        mt[0] = (int) (seed & 0xffffffff);\n        mt[0] = (int) seed;\n        for (mti = 1; mti < N; mti++) {\n            mt[mti] = (1812433253 * (mt[mti - 1] ^ (mt[mti - 1] >>> 30)) + mti);\n            /* See Knuth TAOCP Vol2. 3rd Ed. P.106 for multiplier. */\n            /* In the previous versions, MSBs of the seed affect */\n            /* only MSBs of the array mt[]. */\n            /* 2002/01/09 modified by Makoto Matsumoto */\n            // mt[mti] &= 0xffffffff;\n            /* for >32 bit machines */\n        }\n    }\n\n    /**\n     * Sets the seed of the MersenneTwister using an array of integers. Your array must have a\n     * non-zero length. Only the first 624 integers in the array are used; if the array is shorter\n     * than this then integers are repeatedly used in a wrap-around fashion.\n     */\n\n    synchronized public void setSeed(int[] array)\n    {\n        if (array.length == 0)\n            throw new IllegalArgumentException(\"Array length must be greater than zero\");\n        int i, j, k;\n        setSeed(19650218);\n        i = 1;\n        j = 0;\n        k = (N > array.length ? N : array.length);\n        for (; k != 0; k--) {\n            mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >>> 30)) * 1664525)) + array[j] + j; /*\n                                                                                            * non\n                                                                                            * linear\n                                                                                            */\n            // mt[i] &= 0xffffffff; /* for WORDSIZE > 32 machines */\n            i++;\n            j++;\n            if (i >= N) {\n                mt[0] = mt[N - 1];\n                i = 1;\n            }\n            if (j >= array.length)\n                j = 0;\n        }\n        for (k = N - 1; k != 0; k--) {\n            mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >>> 30)) * 1566083941)) - i; /* non linear */\n            // mt[i] &= 0xffffffff; /* for WORDSIZE > 32 machines */\n            i++;\n            if (i >= N) {\n                mt[0] = mt[N - 1];\n                i = 1;\n            }\n        }\n        mt[0] = 0x80000000; /* MSB is 1; assuring non-zero initial array */\n    }\n\n    /**\n     * Returns an integer with <i>bits</i> bits filled with a random number.\n     */\n    synchronized protected int next(int bits)\n    {\n        int y;\n\n        if (mti >= N) // generate N words at one time\n        {\n            int kk;\n            final int[] mt = this.mt; // locals are slightly faster\n            final int[] mag01 = this.mag01; // locals are slightly faster\n\n            for (kk = 0; kk < N - M; kk++) {\n                y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK);\n                mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1];\n            }\n            for (; kk < N - 1; kk++) {\n                y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK);\n                mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1];\n            }\n            y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK);\n            mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1];\n\n            mti = 0;\n        }\n\n        y = mt[mti++];\n        y ^= y >>> 11; // TEMPERING_SHIFT_U(y)\n        y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y)\n        y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y)\n        y ^= (y >>> 18); // TEMPERING_SHIFT_L(y)\n\n        return y >>> (32 - bits); // hope that's right!\n    }\n\n    /*\n     * If you've got a truly old version of Java, you can omit these two next methods.\n     */\n\n    private synchronized void writeObject(ObjectOutputStream out)\n        throws IOException\n    {\n        // just so we're synchronized.\n        out.defaultWriteObject();\n    }\n\n    private synchronized void readObject(ObjectInputStream in)\n        throws IOException, ClassNotFoundException\n    {\n        // just so we're synchronized.\n        in.defaultReadObject();\n    }\n\n    /**\n     * This method is missing from jdk 1.0.x and below. JDK 1.1 includes this for us, but what the\n     * heck.\n     */\n    public boolean nextBoolean()\n    {\n        return next(1) != 0;\n    }\n\n    /**\n     * This generates a coin flip with a probability <tt>probability</tt> of returning true, else\n     * returning false. <tt>probability</tt> must be between 0.0 and 1.0, inclusive. Not as precise\n     * a random real event as nextBoolean(double), but twice as fast. To explicitly use this,\n     * remember you may need to cast to float first.\n     */\n\n    public boolean nextBoolean(float probability)\n    {\n        if (probability < 0.0f || probability > 1.0f)\n            throw new IllegalArgumentException(\"probability must be between 0.0 and 1.0 inclusive.\");\n        if (probability == 0.0f)\n            return false; // fix half-open issues\n        else if (probability == 1.0f)\n            return true; // fix half-open issues\n        return nextFloat() < probability;\n    }\n\n    /**\n     * This generates a coin flip with a probability <tt>probability</tt> of returning true, else\n     * returning false. <tt>probability</tt> must be between 0.0 and 1.0, inclusive.\n     */\n\n    public boolean nextBoolean(double probability)\n    {\n        if (probability < 0.0 || probability > 1.0)\n            throw new IllegalArgumentException(\"probability must be between 0.0 and 1.0 inclusive.\");\n        if (probability == 0.0)\n            return false; // fix half-open issues\n        else if (probability == 1.0)\n            return true; // fix half-open issues\n        return nextDouble() < probability;\n    }\n\n    /**\n     * This method is missing from JDK 1.1 and below. JDK 1.2 includes this for us, but what the\n     * heck.\n     */\n\n    public int nextInt(int n)\n    {\n        if (n <= 0)\n            throw new IllegalArgumentException(\"n must be positive, got: \" + n);\n\n        if ((n & -n) == n)\n            return (int) ((n * (long) next(31)) >> 31);\n\n        int bits, val;\n        do {\n            bits = next(31);\n            val = bits % n;\n        }\n        while (bits - val + (n - 1) < 0);\n        return val;\n    }\n\n    /**\n     * This method is for completness' sake. Returns a long drawn uniformly from 0 to n-1. Suffice\n     * it to say, n must be > 0, or an IllegalArgumentException is raised.\n     */\n\n    public long nextLong(long n)\n    {\n        if (n <= 0)\n            throw new IllegalArgumentException(\"n must be positive, got: \" + n);\n\n        long bits, val;\n        do {\n            bits = (nextLong() >>> 1);\n            val = bits % n;\n        }\n        while (bits - val + (n - 1) < 0);\n        return val;\n    }\n\n    /**\n     * A bug fix for versions of JDK 1.1 and below. JDK 1.2 fixes this for us, but what the heck.\n     */\n    public double nextDouble()\n    {\n        return (((long) next(26) << 27) + next(27)) / (double) (1L << 53);\n    }\n\n    /**\n     * Returns a double in the range from 0.0 to 1.0, possibly inclusive of 0.0 and 1.0 themselves.\n     * Thus:\n     * \n     * <p>\n     * <table border=0>\n     * <th>\n     * <td>Expression\n     * <td>Interval\n     * <tr>\n     * <td>nextDouble(false, false)\n     * <td>(0.0, 1.0)\n     * <tr>\n     * <td>nextDouble(true, false)\n     * <td>[0.0, 1.0)\n     * <tr>\n     * <td>nextDouble(false, true)\n     * <td>(0.0, 1.0]\n     * <tr>\n     * <td>nextDouble(true, true)\n     * <td>[0.0, 1.0]\n     * </table>\n     * \n     * <p>\n     * This version preserves all possible random values in the double range.\n     */\n    public double nextDouble(boolean includeZero, boolean includeOne)\n    {\n        double d = 0.0;\n        do {\n            d = nextDouble(); // grab a value, initially from half-open [0.0, 1.0)\n            if (includeOne && nextBoolean())\n                d += 1.0; // if includeOne, with 1/2 probability, push to [1.0, 2.0)\n        }\n        while ((d > 1.0) || // everything above 1.0 is always invalid\n                (!includeZero && d == 0.0)); // if we're not including zero, 0.0 is invalid\n        return d;\n    }\n\n    /**\n     * A bug fix for versions of JDK 1.1 and below. JDK 1.2 fixes this for us, but what the heck.\n     */\n\n    public float nextFloat()\n    {\n        return next(24) / ((float) (1 << 24));\n    }\n\n    /**\n     * Returns a float in the range from 0.0f to 1.0f, possibly inclusive of 0.0f and 1.0f\n     * themselves. Thus:\n     * \n     * <p>\n     * <table border=0>\n     * <th>\n     * <td>Expression\n     * <td>Interval\n     * <tr>\n     * <td>nextFloat(false, false)\n     * <td>(0.0f, 1.0f)\n     * <tr>\n     * <td>nextFloat(true, false)\n     * <td>[0.0f, 1.0f)\n     * <tr>\n     * <td>nextFloat(false, true)\n     * <td>(0.0f, 1.0f]\n     * <tr>\n     * <td>nextFloat(true, true)\n     * <td>[0.0f, 1.0f]\n     * </table>\n     * \n     * <p>\n     * This version preserves all possible random values in the float range.\n     */\n    public float nextFloat(boolean includeZero, boolean includeOne)\n    {\n        float d = 0.0f;\n        do {\n            d = nextFloat(); // grab a value, initially from half-open [0.0f, 1.0f)\n            if (includeOne && nextBoolean())\n                d += 1.0f; // if includeOne, with 1/2 probability, push to [1.0f, 2.0f)\n        }\n        while ((d > 1.0f) || // everything above 1.0f is always invalid\n                (!includeZero && d == 0.0f)); // if we're not including zero, 0.0f is invalid\n        return d;\n    }\n\n    /**\n     * A bug fix for all versions of the JDK. The JDK appears to use all four bytes in an integer as\n     * independent byte values! Totally wrong. I've submitted a bug report.\n     */\n\n    public void nextBytes(byte[] bytes)\n    {\n        for (int x = 0; x < bytes.length; x++)\n            bytes[x] = (byte) next(8);\n    }\n\n    /** For completeness' sake, though it's not in java.util.Random. */\n\n    public char nextChar()\n    {\n        // chars are 16-bit UniCode values\n        return (char) (next(16));\n    }\n\n    /** For completeness' sake, though it's not in java.util.Random. */\n\n    public short nextShort()\n    {\n        return (short) (next(16));\n    }\n\n    /** For completeness' sake, though it's not in java.util.Random. */\n\n    public byte nextByte()\n    {\n        return (byte) (next(8));\n    }\n\n    /**\n     * A bug fix for all JDK code including 1.2. nextGaussian can theoretically ask for the log of 0\n     * and divide it by 0! See Java bug <a\n     * href=\"http://developer.java.sun.com/developer/bugParade/bugs/4254501.html\">\n     * http://developer.java.sun.com/developer/bugParade/bugs/4254501.html</a>\n     */\n\n    synchronized public double nextGaussian()\n    {\n        if (__haveNextNextGaussian) {\n            __haveNextNextGaussian = false;\n            return __nextNextGaussian;\n        }\n        else {\n            double v1, v2, s;\n            do {\n                v1 = 2 * nextDouble() - 1; // between -1.0 and 1.0\n                v2 = 2 * nextDouble() - 1; // between -1.0 and 1.0\n                s = v1 * v1 + v2 * v2;\n            }\n            while (s >= 1 || s == 0);\n            double multiplier = StrictMath.sqrt(-2 * StrictMath.log(s) / s);\n            __nextNextGaussian = v2 * multiplier;\n            __haveNextNextGaussian = true;\n            return v1 * multiplier;\n        }\n    }\n\n    /**\n     * Tests the code.\n     */\n    public static void main(String args[])\n    {\n        int j;\n\n        MersenneTwister r;\n\n        // CORRECTNESS TEST\n        // COMPARE WITH http://www.math.keio.ac.jp/matumoto/CODES/MT2002/mt19937ar.out\n\n        r = new MersenneTwister(new int[] { 0x123, 0x234, 0x345, 0x456 });\n        System.out.println(\"Output of MersenneTwister with new (2002/1/26) seeding mechanism\");\n        for (j = 0; j < 1000; j++) {\n            // first, convert the int from signed to \"unsigned\"\n            long l = (long) r.nextInt();\n            if (l < 0)\n                l += 4294967296L; // max int value\n            String s = String.valueOf(l);\n            while (s.length() < 10)\n                s = \" \" + s; // buffer\n            System.out.print(s + \" \");\n            if (j % 5 == 4)\n                System.out.println();\n        }\n\n        // SPEED TEST\n\n        final long SEED = 4357;\n\n        int xx;\n        long ms;\n        System.out.println(\"\\nTime to test grabbing 100000000 ints\");\n\n        r = new MersenneTwister(SEED);\n        ms = System.currentTimeMillis();\n        xx = 0;\n        for (j = 0; j < 100000000; j++)\n            xx += r.nextInt();\n        System.out.println(\"Mersenne Twister: \" + (System.currentTimeMillis() - ms)\n                + \"          Ignore this: \" + xx);\n\n        System.out\n                .println(\"To compare this with java.util.Random, run this same test on MersenneTwisterFast.\");\n        System.out\n                .println(\"The comparison with Random is removed from MersenneTwister because it is a proper\");\n        System.out\n                .println(\"subclass of Random and this unfairly makes some of Random's methods un-inlinable,\");\n        System.out.println(\"so it would make Random look worse than it is.\");\n\n        // TEST TO COMPARE TYPE CONVERSION BETWEEN\n        // MersenneTwisterFast.java AND MersenneTwister.java\n\n        System.out.println(\"\\nGrab the first 1000 booleans\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextBoolean() + \" \");\n            if (j % 8 == 7)\n                System.out.println();\n        }\n        if (!(j % 8 == 7))\n            System.out.println();\n\n        System.out\n                .println(\"\\nGrab 1000 booleans of increasing probability using nextBoolean(double)\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextBoolean((double) (j / 999.0)) + \" \");\n            if (j % 8 == 7)\n                System.out.println();\n        }\n        if (!(j % 8 == 7))\n            System.out.println();\n\n        System.out\n                .println(\"\\nGrab 1000 booleans of increasing probability using nextBoolean(float)\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextBoolean((float) (j / 999.0f)) + \" \");\n            if (j % 8 == 7)\n                System.out.println();\n        }\n        if (!(j % 8 == 7))\n            System.out.println();\n\n        byte[] bytes = new byte[1000];\n        System.out.println(\"\\nGrab the first 1000 bytes using nextBytes\");\n        r = new MersenneTwister(SEED);\n        r.nextBytes(bytes);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(bytes[j] + \" \");\n            if (j % 16 == 15)\n                System.out.println();\n        }\n        if (!(j % 16 == 15))\n            System.out.println();\n\n        byte b;\n        System.out.println(\"\\nGrab the first 1000 bytes -- must be same as nextBytes\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print((b = r.nextByte()) + \" \");\n            if (b != bytes[j])\n                System.out.print(\"BAD \");\n            if (j % 16 == 15)\n                System.out.println();\n        }\n        if (!(j % 16 == 15))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 shorts\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextShort() + \" \");\n            if (j % 8 == 7)\n                System.out.println();\n        }\n        if (!(j % 8 == 7))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 ints\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextInt() + \" \");\n            if (j % 4 == 3)\n                System.out.println();\n        }\n        if (!(j % 4 == 3))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 ints of different sizes\");\n        r = new MersenneTwister(SEED);\n        int max = 1;\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextInt(max) + \" \");\n            max *= 2;\n            if (max <= 0)\n                max = 1;\n            if (j % 4 == 3)\n                System.out.println();\n        }\n        if (!(j % 4 == 3))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 longs\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextLong() + \" \");\n            if (j % 3 == 2)\n                System.out.println();\n        }\n        if (!(j % 3 == 2))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 longs of different sizes\");\n        r = new MersenneTwister(SEED);\n        long max2 = 1;\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextLong(max2) + \" \");\n            max2 *= 2;\n            if (max2 <= 0)\n                max2 = 1;\n            if (j % 4 == 3)\n                System.out.println();\n        }\n        if (!(j % 4 == 3))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 floats\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextFloat() + \" \");\n            if (j % 4 == 3)\n                System.out.println();\n        }\n        if (!(j % 4 == 3))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 doubles\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextDouble() + \" \");\n            if (j % 3 == 2)\n                System.out.println();\n        }\n        if (!(j % 3 == 2))\n            System.out.println();\n\n        System.out.println(\"\\nGrab the first 1000 gaussian doubles\");\n        r = new MersenneTwister(SEED);\n        for (j = 0; j < 1000; j++) {\n            System.out.print(r.nextGaussian() + \" \");\n            if (j % 3 == 2)\n                System.out.println();\n        }\n        if (!(j % 3 == 2))\n            System.out.println();\n\n    }\n\n}\n"
  },
  {
    "path": "src/utility/Parallel.java",
    "content": "package utility;\n\nimport java.util.Collection;\nimport java.util.concurrent.ConcurrentHashMap;\nimport java.util.concurrent.ForkJoinPool;\nimport java.util.concurrent.RecursiveAction;\nimport java.util.concurrent.RecursiveTask;\n\n/**\n * Utilities for parallel computing in loops over independent tasks. This class\n * provides convenient methods for parallel processing of tasks that involve\n * loops over indices, in which computations for different indices are\n * independent.\n * <p>\n * As a simple example, consider the following function that squares floats in\n * one array and stores the results in a second array.\n * \n * <pre>\n * <code>\n * static void sqr(float[] a, float[] b) {\n *   int n = a.length;\n *   for (int i=0; i&lt;n; ++i)\n *     b[i] = a[i]*a[i];\n * }\n * </code>\n * </pre>\n * \n * A serial version of a similar function for 2D arrays is:\n * \n * <pre>\n * <code>\n * static void sqrSerial(float[][] a, float[][] b) \n * {\n *   int n = a.length;\n *   for (int i=0; i&lt;n; ++i) {\n *     sqr(a[i],b[i]);\n * }\n * </code>\n * </pre>\n * \n * Using this class, the parallel version for 2D arrays is:\n * \n * <pre>\n * <code>\n * static void sqrParallel(final float[][] a, final float[][] b) {\n *   int n = a.length;\n *   Parallel.loop(n,new Parallel.LoopInt() {\n *     public void compute(int i) {\n *       sqr(a[i],b[i]);\n *     }\n *   });\n * }\n * </code>\n * </pre>\n * \n * In the parallel version, the method {@code compute} defined by the interface\n * {@code LoopInt} will be called n times for different indices i in the range\n * [0,n-1]. The order of indices is both indeterminant and irrelevant because\n * the computation for each index i is independent. The arrays a and b are\n * declared final as required for use in the implementation of {@code LoopInt}.\n * <p>\n * Note: because the method {@code loop} and interface {@code LoopInt} are\n * static members of this class, we can omit the class name prefix\n * {@code Parallel} if we first import these names with\n * \n * <pre>\n * <code>\n * import static edu.mines.jtk.util.Parallel.*;\n * </code>\n * </pre>\n * \n * A similar method facilitates tasks that reduce a sequence of indexed values\n * to one or more values. For example, given the following method:\n * \n * <pre>\n * <code>\n * static float sum(float[] a) {\n *   int n = a.length;\n *   float s = 0.0f;\n *   for (int i=0; i&lt;n; ++i)\n *     s += a[i];\n *   return s;\n * }\n * </code>\n * </pre>\n * \n * serial and parallel versions for 2D arrays may be written as:\n * \n * <pre>\n * <code>\n * static float sumSerial(float[][] a) {\n *   int n = a.length;\n *   float s = 0.0f;\n *   for (int i=0; i&lt;n; ++i)\n *     s += sum(a[i]);\n *   return s;\n * }\n * </code>\n * </pre>\n * \n * and\n * \n * <pre>\n * <code>\n * static float sumParallel(final float[][] a) {\n *   int n = a.length;\n *   return Parallel.reduce(n,new Parallel.ReduceInt&lt;Float&gt;() {\n *     public Float compute(int i) {\n *       return sum(a[i]);\n *     }\n *     public Float combine(Float s1, Float s2) {\n *       return s1+s2;\n *     }\n *   });\n * }\n * </code>\n * </pre>\n * \n * In the parallel version, we implement the interface {@code ReduceInt} with\n * two methods, one to {@code compute} sums of array elements and another to\n * {@code combine} two such sums together. The same pattern works for other\n * reduce operations. For example, with similar functions we could compute\n * minimum and maximum values (in a single reduce) for any indexed sequence of\n * values.\n * <p>\n * More general loops are supported, and are equivalent to the following serial\n * code:\n * \n * <pre>\n * <code>\n * for (int i=begin; i&lt;end; i+=step)\n *   // some computation that depends on i\n * </code>\n * </pre>\n * \n * The methods loop and reduce require that begin is less than end and that step\n * is positive. The requirement that begin is less than end ensures that reduce\n * is always well-defined. The requirement that step is positive ensures that\n * the loop terminates.\n * <p>\n * Static methods loop and reduce submit tasks to a fork-join framework that\n * maintains a pool of threads shared by all users of these methods. These\n * methods recursively split tasks so that disjoint sets of indices are\n * processed in parallel by different threads.\n * <p>\n * In addition to the three loop parameters begin, end, and step, a fourth\n * parameter chunk may be specified. This chunk parameter is a threshold for\n * splitting tasks so that they can be performed in parallel. If a range of\n * indices to be processed is smaller than the chunk size, or if too many tasks\n * have already been queued for processing, then the indices are processed\n * serially. Otherwise, the range is split into two parts for processing by new\n * tasks. If specified, the chunk size is a lower bound; the number of indices\n * processed serially will never be lower, but may be higher, than a specified\n * chunk size. The default chunk size is one.\n * <p>\n * The default chunk size is often sufficient, because the test for an excess\n * number of queued tasks prevents tasks from being split needlessly. This test\n * is especially useful when parallel loops are nested, as when looping over\n * elements of multi-dimensional arrays.\n * <p>\n * For example, an implementation of the method {@code sqrParallel} for 3D\n * arrays could simply call the 2D version listed above. Tasks will naturally\n * tend to be split for outer loops, but not inner loops, thereby reducing\n * overhead, time spent splitting and queueing tasks.\n * <p>\n * Reference: A Java Fork/Join Framework, by Doug Lea, describes the framework\n * used to implement this class. This framework will be part of JDK 7.\n * \n * @author Dave Hale, Colorado School of Mines\n * @version 2010.11.23\n */\npublic class Parallel\n{\n\n\t/** A loop body that computes something for an int index. */\n\tpublic interface LoopInt\n\t{\n\n\t\t/**\n\t\t * Computes for the specified loop index.\n\t\t * \n\t\t * @param i\n\t\t *            loop index.\n\t\t */\n\t\tpublic void compute(int i);\n\t}\n\n\t/** A loop body that computes and returns a value for an int index. */\n\tpublic interface ReduceInt<V>\n\t{\n\n\t\t/**\n\t\t * Returns a value computed for the specified loop index.\n\t\t * \n\t\t * @param i\n\t\t *            loop index.\n\t\t * @return the computed value.\n\t\t */\n\t\tpublic V compute(int i);\n\n\t\t/**\n\t\t * Returns the combination of two specified values.\n\t\t * \n\t\t * @param v1\n\t\t *            a value.\n\t\t * @param v2\n\t\t *            a value.\n\t\t * @return the combined value.\n\t\t */\n\t\tpublic V combine(V v1, V v2);\n\t}\n\n\t/**\n\t * A wrapper for objects that are not thread-safe. Such objects have methods\n\t * that cannot safely be executed concurrently in multiple threads. To use\n\t * an unsafe object within a parallel computation, first construct an\n\t * instance of this wrapper. Then, within the compute method, get the unsafe\n\t * object; if null, construct and set a new unsafe object in this wrapper,\n\t * before using the unsafe object to perform the computation. This pattern\n\t * ensures that each thread computes using a distinct unsafe object. For\n\t * example,\n\t * \n\t * <pre>\n\t * <code>\n\t * final Parallel.Unsafe&lt;Worker&gt; nts = new Parallel.Unsafe&lt;Worker&gt;();\n\t * Parallel.loop(count,new Parallel.LoopInt() {\n\t *   public void compute(int i) {\n\t *     Worker w = nts.get(); // get worker for the current thread\n\t *     if (w==null) nts.set(w=new Worker()); // if null, make one\n\t *     w.work(); // the method work need not be thread-safe\n\t *   }\n\t * });\n\t * </code>\n\t * </pre>\n\t * \n\t * This wrapper is most useful when (1) the cost of constructing an unsafe\n\t * object is high, relative to the cost of each call to compute, and (2) the\n\t * number of threads calling compute is significantly lower than the total\n\t * number of such calls. Otherwise, if either of these conditions is false,\n\t * then simply construct a new unsafe object within the compute method.\n\t * <p>\n\t * This wrapper works much like the Java standard class ThreadLocal, except\n\t * that an object within this wrapper can be garbage-collected before its\n\t * thread dies. This difference is important because fork-join worker\n\t * threads are pooled and will typically die only when a program ends.\n\t */\n\tpublic static class Unsafe<T>\n\t{\n\n\t\t/**\n\t\t * Constructs a wrapper for objects that are not thread-safe.\n\t\t */\n\t\tpublic Unsafe()\n\t\t{\n\t\t\tint initialCapacity = 16; // the default initial capacity\n\t\t\tfloat loadFactor = 0.5f; // huge numbers of threads are unlikely\n\t\t\tint concurrencyLevel = 2 * _pool.getParallelism();\n\t\t\t_map = new ConcurrentHashMap<Thread, T>(initialCapacity,\n\t\t\t\tloadFactor, concurrencyLevel);\n\t\t}\n\n\t\t/**\n\t\t * Gets the object in this wrapper for the current thread.\n\t\t * \n\t\t * @return the object; null, of not yet set for the current thread.\n\t\t */\n\t\tpublic T get()\n\t\t{\n\t\t\treturn _map.get(Thread.currentThread());\n\t\t}\n\n\t\t/**\n\t\t * Sets the object in this wrapper for the current thread.\n\t\t * \n\t\t * @param object\n\t\t *            the object.\n\t\t */\n\t\tpublic void set(T object)\n\t\t{\n\t\t\t_map.put(Thread.currentThread(), object);\n\t\t}\n\n\t\t/**\n\t\t * Returns a collection of all unsafe objects in this wrapper. This\n\t\t * method is useful only after parallel loops have ended.\n\t\t * \n\t\t * @return the collection of unsafe objects.\n\t\t */\n\t\tpublic Collection<T> getAll()\n\t\t{\n\t\t\treturn _map.values();\n\t\t}\n\n\t\tprivate final ConcurrentHashMap<Thread, T> _map;\n\t}\n\n\t/**\n\t * Performs a loop <code>for (int i=0; i&lt;end; ++i)</code>.\n\t * \n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param body\n\t *            the loop body.\n\t */\n\tpublic static void loop(int end, LoopInt body)\n\t{\n\t\tloop(0, end, 1, 1, body);\n\t}\n\n\t/**\n\t * Performs a loop <code>for (int i=begin; i&lt;end; ++i)</code>.\n\t * \n\t * @param begin\n\t *            the begin index for the loop; must be less than end.\n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param body\n\t *            the loop body.\n\t */\n\tpublic static void loop(int begin, int end, LoopInt body)\n\t{\n\t\tloop(begin, end, 1, 1, body);\n\t}\n\n\t/**\n\t * Performs a loop <code>for (int i=begin; i&lt;end; i+=step)</code>.\n\t * \n\t * @param begin\n\t *            the begin index for the loop; must be less than end.\n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param step\n\t *            the index increment; must be positive.\n\t * @param body\n\t *            the loop body.\n\t */\n\tpublic static void loop(int begin, int end, int step, LoopInt body)\n\t{\n\t\tloop(begin, end, step, 1, body);\n\t}\n\n\t/**\n\t * Performs a loop <code>for (int i=begin; i&lt;end; i+=step)</code>.\n\t * \n\t * @param begin\n\t *            the begin index for the loop; must be less than end.\n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param step\n\t *            the index increment; must be positive.\n\t * @param chunk\n\t *            the chunk size; must be positive.\n\t * @param body\n\t *            the loop body.\n\t */\n\tpublic static void loop(int begin, int end, int step, int chunk,\n\t\tLoopInt body)\n\t{\n\t\tcheckArgs(begin, end, step, chunk);\n\t\tif (_serial || end <= begin + chunk * step) {\n\t\t\tfor (int i = begin; i < end; i += step) {\n\t\t\t\tbody.compute(i);\n\t\t\t}\n\t\t}\n\t\telse {\n\t\t\tLoopIntAction task = new LoopIntAction(begin, end, step, chunk,\n\t\t\t\tbody);\n\t\t\tif (LoopIntAction.inForkJoinPool()) {\n\t\t\t\ttask.invoke();\n\t\t\t}\n\t\t\telse {\n\t\t\t\t_pool.invoke(task);\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Performs a reduce <code>for (int i=0; i&lt;end; ++i)</code>.\n\t * \n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param body\n\t *            the loop body.\n\t * @return the computed value.\n\t */\n\tpublic static <V> V reduce(int end, ReduceInt<V> body)\n\t{\n\t\treturn reduce(0, end, 1, 1, body);\n\t}\n\n\t/**\n\t * Performs a reduce <code>for (int i=begin; i&lt;end; ++i)</code>.\n\t * \n\t * @param begin\n\t *            the begin index for the loop; must be less than end.\n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param body\n\t *            the loop body.\n\t * @return the computed value.\n\t */\n\tpublic static <V> V reduce(int begin, int end, ReduceInt<V> body)\n\t{\n\t\treturn reduce(begin, end, 1, 1, body);\n\t}\n\n\t/**\n\t * Performs a reduce <code>for (int i=begin; i&lt;end; i+=step)</code>.\n\t * \n\t * @param begin\n\t *            the begin index for the loop; must be less than end.\n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param step\n\t *            the index increment; must be positive.\n\t * @param body\n\t *            the loop body.\n\t * @return the computed value.\n\t */\n\tpublic static <V> V reduce(int begin, int end, int step, ReduceInt<V> body)\n\t{\n\t\treturn reduce(begin, end, step, 1, body);\n\t}\n\n\t/**\n\t * Performs a reduce <code>for (int i=begin; i&lt;end; i+=step)</code>.\n\t * \n\t * @param begin\n\t *            the begin index for the loop; must be less than end.\n\t * @param end\n\t *            the end index (not included) for the loop.\n\t * @param step\n\t *            the index increment; must be positive.\n\t * @param chunk\n\t *            the chunk size; must be positive.\n\t * @param body\n\t *            the loop body.\n\t * @return the computed value.\n\t */\n\tpublic static <V> V reduce(int begin, int end, int step, int chunk,\n\t\tReduceInt<V> body)\n\t{\n\t\tcheckArgs(begin, end, step, chunk);\n\t\tif (_serial || end <= begin + chunk * step) {\n\t\t\tV v = body.compute(begin);\n\t\t\tfor (int i = begin + step; i < end; i += step) {\n\t\t\t\tV vi = body.compute(i);\n\t\t\t\tv = body.combine(v, vi);\n\t\t\t}\n\t\t\treturn v;\n\t\t}\n\t\telse {\n\t\t\tReduceIntTask<V> task = new ReduceIntTask<V>(begin, end, step,\n\t\t\t\tchunk, body);\n\t\t\tif (ReduceIntTask.inForkJoinPool()) {\n\t\t\t\treturn task.invoke();\n\t\t\t}\n\t\t\telse {\n\t\t\t\treturn _pool.invoke(task);\n\t\t\t}\n\t\t}\n\t}\n\n\t/**\n\t * Enables or disables parallel processing by all methods of this class. By\n\t * default, parallel processing is enabled. If disabled, all tasks will be\n\t * executed on the current thread.\n\t * <p>\n\t * <em>Setting this flag to false disables parallel processing for all\n\t * users of this class.</em> This method should therefore be used for\n\t * testing and benchmarking only.\n\t * \n\t * @param parallel\n\t *            true, for parallel processing; false, otherwise.\n\t */\n\tpublic static void setParallel(boolean parallel)\n\t{\n\t\t_serial = !parallel;\n\t}\n\n\t// /////////////////////////////////////////////////////////////////////////\n\t// private\n\n\t// Implementation notes:\n\t// Each fork-join task below has a range of indices to be processed.\n\t// If the range is less than or equal to the chunk size, or if the\n\t// queue for the current thread holds too many tasks already, then\n\t// simply process the range on the current thread. Otherwise, split\n\t// the range into two parts that are approximately equal, ensuring\n\t// that the left part is at least as large as the right part. If the\n\t// right part is not empty, fork a new task. Then compute the left\n\t// part in the current thread, and, if necessary, join the right part.\n\n\t// Threshold for number of surplus queued tasks. Used below to\n\t// determine whether or not to split a task into two subtasks.\n\tprivate static final int NSQT = 6;\n\n\t// The pool shared by all fork-join tasks created through this class.\n\tprivate static ForkJoinPool _pool = new ForkJoinPool();\n\n\t// Serial flag; true for no parallel processing.\n\tprivate static boolean _serial = false;\n\n\t/**\n\t * Checks loop arguments.\n\t */\n\tprivate static void checkArgs(int begin, int end, int step, int chunk)\n\t{\n\t\targument(begin < end, \"begin<end\");\n\t\targument(step > 0, \"step>0\");\n\t\targument(chunk > 0, \"chunk>0\");\n\t}\n\n\tpublic static void argument(boolean condition, String message)\n\t{\n\t\tif (!condition)\n\t\t\tthrow new IllegalArgumentException(\"required condition: \" + message);\n\t}\n\n\t/**\n\t * Splits range [begin:end) into [begin:middle) and [middle:end). The\n\t * returned middle index equals begin plus an integer multiple of step.\n\t */\n\tprivate static int middle(int begin, int end, int step)\n\t{\n\t\treturn begin + step + ((end - begin - 1) / 2) / step * step;\n\t}\n\n\t/**\n\t * Fork-join task for parallel loop.\n\t */\n\tprivate static class LoopIntAction\n\t\textends RecursiveAction\n\t{\n\t\tLoopIntAction(int begin, int end, int step, int chunk, LoopInt body)\n\t\t{\n\t\t\tassert begin < end : \"begin < end\";\n\t\t\t_begin = begin;\n\t\t\t_end = end;\n\t\t\t_step = step;\n\t\t\t_chunk = chunk;\n\t\t\t_body = body;\n\t\t}\n\n\t\t@Override\n\t\tprotected void compute()\n\t\t{\n\t\t\tif (_end <= _begin + _chunk * _step\n\t\t\t\t|| getSurplusQueuedTaskCount() > NSQT) {\n\t\t\t\tfor (int i = _begin; i < _end; i += _step) {\n\t\t\t\t\t_body.compute(i);\n\t\t\t\t}\n\t\t\t}\n\t\t\telse {\n\t\t\t\tint middle = middle(_begin, _end, _step);\n\t\t\t\tLoopIntAction l = new LoopIntAction(_begin, middle, _step,\n\t\t\t\t\t_chunk, _body);\n\t\t\t\tLoopIntAction r = (middle < _end) ? new LoopIntAction(middle,\n\t\t\t\t\t_end, _step, _chunk, _body) : null;\n\t\t\t\tif (r != null)\n\t\t\t\t\tr.fork();\n\t\t\t\tl.compute();\n\t\t\t\tif (r != null)\n\t\t\t\t\tr.join();\n\t\t\t}\n\t\t}\n\n\t\tprivate final int _begin, _end, _step, _chunk;\n\t\tprivate final LoopInt _body;\n\t}\n\n\t/**\n\t * Fork-join task for parallel reduce.\n\t */\n\tprivate static class ReduceIntTask<V>\n\t\textends RecursiveTask<V>\n\t{\n\t\tReduceIntTask(int begin, int end, int step, int chunk, ReduceInt<V> body)\n\t\t{\n\t\t\tassert begin < end : \"begin < end\";\n\t\t\t_begin = begin;\n\t\t\t_end = end;\n\t\t\t_step = step;\n\t\t\t_chunk = chunk;\n\t\t\t_body = body;\n\t\t}\n\n\t\t@Override\n\t\tprotected V compute()\n\t\t{\n\t\t\tif (_end <= _begin + _chunk * _step\n\t\t\t\t|| getSurplusQueuedTaskCount() > NSQT) {\n\t\t\t\tV v = _body.compute(_begin);\n\t\t\t\tfor (int i = _begin + _step; i < _end; i += _step) {\n\t\t\t\t\tV vi = _body.compute(i);\n\t\t\t\t\tv = _body.combine(v, vi);\n\t\t\t\t}\n\t\t\t\treturn v;\n\t\t\t}\n\t\t\telse {\n\t\t\t\tint middle = middle(_begin, _end, _step);\n\t\t\t\tReduceIntTask<V> l = new ReduceIntTask<V>(_begin, middle,\n\t\t\t\t\t_step, _chunk, _body);\n\t\t\t\tReduceIntTask<V> r = (middle < _end) ? new ReduceIntTask<V>(\n\t\t\t\t\tmiddle, _end, _step, _chunk, _body) : null;\n\t\t\t\tif (r != null)\n\t\t\t\t\tr.fork();\n\t\t\t\tV v = l.compute();\n\t\t\t\tif (r != null)\n\t\t\t\t\tv = _body.combine(v, r.join());\n\t\t\t\treturn v;\n\t\t\t}\n\t\t}\n\n\t\tprivate final int _begin, _end, _step, _chunk;\n\t\tprivate final ReduceInt<V> _body;\n\t}\n}"
  },
  {
    "path": "test/corpus.LABEL",
    "content": "apple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\napple\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\ngoogle\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\nmicrosoft\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\ntwitter\n"
  },
  {
    "path": "test/corpus.txt",
    "content": "iphone crack iphone \nadding support iphone announced \nyoutube video guy siri pretty love \nrim made easy switch iphone yeah \nrealized ios \ncurrent blackberry user bit disappointed move android iphone \nthings siri sooo glad gave siri sense humor \ngreat personal event tonight store \ncompanies experience customer service \napply job hope call lol \nlmao siri find hide body \nregistered developer appreciated \nwow great deals ipad gen offers great deals gen ipads \nlearning trip hong kong gotta hand iphones apps \ndark side hey send free iphone publicly burn blackberry \nfind mac air \nmacbook keyboard lunch break today warranty \nipads replace \nsiri amazing \namazing ios feature \nreply featured education apps website today sweet \nreply useless days \niphone yesterday awesome amount info \nquestion brother iphone \npeople iphone phone happy \nceo points ios \nbus iphone \number appstore itunes mobile devices talking desktop application \nbring ipad ipad set red red ipad \nsells million iphone weekend steve jobs lives iphone \napologize \ndownloads ios users \nlmfao argument siri \nincredible million iphone screenshot days iphone iphone \nfixed ios battery drain problem replacement iphone working \nbrand macbook professional macbook years miss time \nsiri dad mom brother girlfriend \nstore amazing call waiting music \nsweet replaced \nbad sells million iphones debut weekend smartphone \nloving technology iphone mac air icloud technology \nloving ios update \nmention store great customer service store \ntime iphone forward man longer paying texts \ngirlfriend iphone great \nicloud set works cloud \nmommy totally email company great service store \nloving ios upgrade iphone \nios ipad \nmaking switch android iphone iphone smartphone store \nincredible people offering water macbook professional wow \nmacbook sick \nplay man loving camera iphone facebook \nyeah ios changed life \nreader worldwide web \nlove service case hand case \nyears jobs iphone iphone iphone \nblackberry years lost service moving iphone \nsells million iphones days \nweekend iphone \nmacbook professional year time selling android \npost card \nputting kind glad hear alive \ngod youtube bad ass system loving \ndays iphone nice gave \nios email lock screen opening unlocking \nword wow iphone weekend sales top million \nlove ios easter eggs pull middle top bottom pulls awesome feature ios \nlove ios easter eggs pull middle top bottom pulls awesome feature \nrun beautiful morning man love ios iphone \nsimply \nmade happy text lol text \nday great customer service received today phone phone \nloving ipod update \nupgraded iphone siri worth upgrade forward siri \ngreat world missed \nloving iphone ios \nlove \niphone great genius \ncards application card arrived local post office today \niphone siri \nmeet siri iphone click link \nwork feel worst \nios upgrade good luck blackberry \nloving ios awesome \niphone addicted club \nguy playing facetime watching game bar \nblackberry boo powered technology work \nipod time iphone good job guys \nsiri year lead lost \nios sweet notifications phone search covers mail wifi sync icloud backup integrated \ngreat james story today times retail success \nworld due ios guys \nimpressive service genius bar metro center power replaced free screen replacement free \nnice guy store replaced phone showed crack screen \niphone battery longer day happened edge iphone nice job \nminutes write blackberry showing \neye phone impressed \niphone space amazing products people things \nmaking ipad feel ios \nnexus good feel bit guess android users android \nnice game helps search \nnice game helps search facebook \nbuild website website free \nandroid ics pretty good worth \nandroid ice cream sandwich nexus android nexus \nexciting day ice cream sandwich day android \nwow nexus beautiful totally gonna market share smart phone market \nintegrated data usage manager brilliant design watching lol \nice cream sandwich android works htc desire \nice cream sandwich sounds android ice cream sandwich \namazing imo android missing \nforget phone nice feature android nexus \nfinally unveiled android ice cream sandwich good \nfinally searches logged users \nrim strategy released hours release ics \nman love galaxy nexus samsung android \ndoubt \nshare winning war \ndear galaxy nexus send email technology \ntelegraph reports biggest threat facebook power users \nsamsung made bad android king \nfacebook power users telegraph socialmedia \nimpressed android update good font design \nvideo wallet wow \ntweet remember spell straight \nandroid samsung nexus \nefficient fun releases infinite digital bookcase \npass social seo facebook \nice cream sandwich stop carriers bullying smartphone users android \nagree freaking awesome \nicecream great \nhelps \nsamsung galaxy nexus iphone \nice cream sandwich delicious iphone launches android aka \nloving \nsamsung push mobile experience forward \nfinally power volume screenshot ics \nnexus press conference slick \nhigh school appreciated \nscream scream scream android job major game mobile space \nthinking ahead \nventurebeat virtual bookcase sharing \nandroid phone keeping iphone \nandroid ice cream sandwich feature closer roboto type face read \nwork samsung android ics impressive \nadd profile webgl project add addthis \nwork company work \ninvention \nwait ice cream sandwich android \nstop nexus \nphone \nandroid device updated galaxy nexus \nandroid introducing ice cream sandwich delicious version android ics \nexcited android features android ics \nwait nexus play \ncheck video introducing galaxy nexus simple beautiful smart youtube android nexus \ncream ice cream phone job \ngreat small businesses platform features thoughts \nloves presentations tool docs adding video \nbrilliant webgl bookcase \nsearches things \nandroid ice cream introducing galaxy nexus simple beautiful smart \nnexus prime android \ninteresting bookcase venturebeat releases infinite digital bookcase \ngood finally focus user experience android \nics awesome phone android motorola \niphone ice cream sandwich android \nnexus line smart move \nandroid beam alright made team team android \nandroid reply font good start ics \nice cream sandwich face unlock works \nready ice cream sandwich ics nexus android android \nice cream sandwich android \ntaste ice cream sandwich bite \nsamsung event live blog gadget haven android \nandroid ice cream sandwich make smartphone operating systems \nphoto sharing people application ice cream sandwich imo ics \nandroid nexus phone makes iphone cheap store android \nsweet ice cream sandwich android ice cream sandwich officially ics \nraise hand android powered phone samsung \nsiri android device replace iphone \nnexus page live nexus android \nexcited android beam face unlock android ics \nlinkedin tools company page contact \nsamsung ice cream sandwich samsung \nintroducing galaxy nexus simple beautiful smart android ics samsung \nglad design android shows waiting \nthoughts android ics excited play features android \nregister galaxy nexus android \nwow webgl infinite bookcase \nics awesome wait face unlock android \ngotta pretty android chrome android \nnovember direct purchase samsung \nnexus wanna awesome \nevent time change android samsung \nios user ics awesome great job \nyeah great job ics \nliterally mind blown samsung \nmotorola verizon perfect \nopens door spanish entrepreneurs project \nintel ibm \nwindows phone mango update process ahead schedule mango \nback smartphone rich \nword works computer \nfree gen stores \nwatch codename data explorer ctp coming \nlunch today vslive \nwatch codename data explorer ctp coming month \ndetails search improvements windows start screen \nmango shows taste smartphone success mango \nawesome moving dev finally local \nstores offer free windows phone devices \nstores offer free windows phone devices neowin \nstore spend hard vslive \nfree west check \nhey parents free tools kids online live family \ncloud offers students free access improve tcn \nawesome bit \ndetails windows search improvements \nyeah taking metro yeah good android \nlove kids tech \nexplains improvements windows start screen search tech \nsearch idea search great \nbing king search search \npowerpoint users power create service bye solutions \nfuture information innovators nov info \ncurate personal history project greenwich month \nbeam research project \ngreat sql server session \nworks days \nballmer thinks computer scientist android tech agree \ngreat time \nwin server works fine vmware \nwow tech turns body touchscreen psfk \nlove love feeling building vslive bringing conference \nresearch shows awesome step closer bit kinect \nresearch shows science science fact cool sound \nresearch shows science science fact \nzune music canada music news \nkinect makes learning playful education \nmango \ncheck change world \ngood world wait \nwatching windows pretty impressive finally mac interesting battle store \nxbox share \ngod \nblog post cool tool mouse tools \nforget siri beating speech commands mango siri \ntests proves appsense enterprise capability users personalization database enterprise \nsoftware good points sap dynamics \ngood dev \nsecure anti \nimpressed creating images \nmac blown marketing \nyahoo sale years back bought glad deal year \nomnitouch impressive technology \ngood bing paying \nipads windows tablets study \nhome day great time \nmango shows taste smartphone success \npicture services cloud love \nwindows net dev \nnice talk community \nomg sharepoint working \ninnovation sad sad \noffice love genius \nlove gates foundation \ngood \nskype family amazing things \nabsolutely loving mouse \nfan cool video turn surface touchscreen \nwow android ics lots talk mango launch people public speaking \nupdated computer windows \nics android kill mango nokia \npeople names mail week \noutlook mac sucks hate \nxbox accounts hack reports \nupdate net \nwindows media center fail \neclipsed \nword upgrade doc doc word won open doc suck \nu.s. antitrust leaving business played dumb \nlync crash issue mac fixed \nbroke played engages racketeering calls respect \nnokia chief executive mole \nfrozen xbox live xbl accounts online games report hacked \ngave windows dev preview good waiting beta windows \npowerpoint fix powerpoint presentations \neclipsed guardian \nkind search \ngreat time family advertising \nwindows forget past antitrust issues \npaying make racketeering \nday talking talk tomorrow waiting \nreader compares albatross neck agree join \nlot word freeze minutes \nlol perfect simple hate windows phones \nmonths months lose \nreader compares albatross neck agree join discussion \nmake sleep plan \nfeel world put facebook blackberry helps \nmiss boo \neverytime leave back back telling lol \napplication ass theme \nsleep sleep \nstarting sending hashtags emails taking lives \nshit lol hell \ntoday introduced social media love \nfacebook \nyeah shows glad \npretty facebook \ngotta love shit round world speed \nbed gonna minute bed \ndear fucking missed today internet \ntweet keeping busy school \ngood thing people left social \nsocial media \nguess addicted university exam questions \ngood thing people left social side \napples facebook content \nbed favorite application facebook \nfacebook change makes excited privacy \nimpressive numbers smm socialmedia \nfuck facebook bullshit bitch \ncool love \nfuck facebook follow \nhaven shit man haven fun \nfind song end television show watched \nliterally back facebook text email technology good \nisn pretty damn amazing hope year fast \ndear missed promise touch \nbored sad mad happy true friend \nfacebook sucks amp \nshit funny haven shit day \nvoice people real life lol \nyeah time bug \nscience hashtags facebook \nfeeling real world \nbiggest \nfacebook messed make add reliable \nfreaking kidding wth \ntomorrow blue ass bird continued \ndead \nemails telling \nsucks follow \npeople reporting retweets working technical problem \nback lol \nretweets broken haven tuesday \ntomorrow blue ass bird ass \nain showing current mentions tweets \ngonna problems fixed asap \nretweets \nman boring \napplication show touch tweet \ntrouble application updating application \nmessed everytime text message \nshow fucking retweets bitch \nsooo trash \nshowing retweets shit \nmom argument pretty \naddicted care \nappreciated start working computer \nretweets section account working hours problem \ngood send bloody tweets \nfeel \nmake account \nfucking late damn \ndear fix shit retweets mentions \ndead fuck \npoint \ngiving tweets tweeted past days lol \nmessed followers numbers \ntimeline mentions shit \ngarbage \nhell television man \nstupid fucking give damn mentions ugh \nfucking \nfacebook television wanna study \nshow retweets ill back facebook \nreply opinions \nforget day time haven \nblogs tumblr \ntalk step game \nreminder fail \njoin follow \nways competition \npeople facebook day life \ndrop follow show love \ntelling reply \nsleep time \nemotions \ncall night \nwork break time yeah \ntumblr love \nage year days hours minutes seconds find \nwanna aye shit living \nshout favorite people happy girls \nfollow back \nsleep good people trip \n"
  },
  {
    "path": "test/corpus_test.txt",
    "content": "making ipad feel ios \nnexus good feel bit guess android users android \nnice game helps search \nnice game helps search facebook \nbuild website website free \nandroid ics pretty good worth \nandroid ice cream sandwich nexus android nexus \nexciting day ice cream sandwich day android \nwow nexus beautiful totally gonna market share smart phone market \nintegrated data usage manager brilliant design watching lol \nice cream sandwich android works htc desire \nice cream sandwich sounds android ice cream sandwich \namazing imo android missing \nforget phone nice feature android nexus \nfinally unveiled android ice cream sandwich good \nfinally searches logged users \nrim strategy released hours release ics \nman love galaxy nexus samsung android \ndoubt \nshare winning war \ndear galaxy nexus send email technology \ntelegraph reports biggest threat facebook power users \nsamsung made bad android king \nfacebook power users telegraph socialmedia \nimpressed android update good font design \nvideo wallet wow \ntweet remember spell straight \nandroid samsung nexus \nefficient fun releases infinite digital bookcase \npass social seo facebook \nice cream sandwich stop carriers bullying smartphone users android \nagree freaking awesome \nicecream great \nhelps \nsamsung galaxy nexus iphone \nice cream sandwich delicious iphone launches android aka \nloving \nsamsung push mobile experience forward \nfinally power volume screenshot ics \nnexus press conference slick \nhigh school appreciated \nscream scream scream android job major game mobile space \nthinking ahead \nventurebeat virtual bookcase sharing \nandroid phone keeping iphone \nandroid ice cream sandwich feature closer roboto type face read \nwork samsung android ics impressive \nadd profile webgl project add addthis \nwork company work \ninvention \nwait ice cream sandwich android \nstop nexus \nphone \nandroid device updated galaxy nexus \nandroid introducing ice cream sandwich delicious version android ics \nexcited android features android ics \nwait nexus play \ncheck video introducing galaxy nexus simple beautiful smart youtube android nexus \ncream ice cream phone job \ngreat small businesses platform features thoughts \nloves presentations tool docs adding video \nbrilliant webgl bookcase \nsearches things \nandroid ice cream introducing galaxy nexus simple beautiful smart \nnexus prime android \ninteresting bookcase venturebeat releases infinite digital bookcase \ngood finally focus user experience android \nics awesome phone android motorola \niphone ice cream sandwich android \nnexus line smart move \nandroid beam alright made team team android \nandroid reply font good start ics \nice cream sandwich face unlock works \nready ice cream sandwich ics nexus android android \nice cream sandwich android \ntaste ice cream sandwich bite \nsamsung event live blog gadget haven android \nandroid ice cream sandwich make smartphone operating systems \nphoto sharing people application ice cream sandwich imo ics \nandroid nexus phone makes iphone cheap store android \nsweet ice cream sandwich android ice cream sandwich officially ics \nraise hand android powered phone samsung \nsiri android device replace iphone \nnexus page live nexus android \nexcited android beam face unlock android ics \nlinkedin tools company page contact \nsamsung ice cream sandwich samsung \nintroducing galaxy nexus simple beautiful smart android ics samsung \nglad design android shows waiting \nthoughts android ics excited play features android \nregister galaxy nexus android \nwow webgl infinite bookcase \nics awesome wait face unlock android \ngotta pretty android chrome android \nnovember direct purchase samsung \nnexus wanna awesome \nevent time change android samsung \nios user ics awesome great job \nyeah great job ics \nliterally mind blown samsung \nmotorola verizon perfect \nopens door spanish entrepreneurs project \nintel ibm \nwindows phone mango update process ahead schedule mango \nback smartphone rich \nword works computer \nfree gen stores \nwatch codename data explorer ctp coming \nlunch today vslive \nwatch codename data explorer ctp coming month \ndetails search improvements windows start screen \nmango shows taste smartphone success mango \nawesome moving dev finally local \nstores offer free windows phone devices \nstores offer free windows phone devices neowin \nstore spend hard vslive \nfree west check \nhey parents free tools kids online live family \ncloud offers students free access improve tcn \nawesome bit \ndetails windows search improvements \nyeah taking metro yeah good android \nlove kids tech \nexplains improvements windows start screen search tech \nsearch idea search great \nbing king search search \npowerpoint users power create service bye solutions \nfuture information innovators nov info \ncurate personal history project greenwich month \nbeam research project \ngreat sql server session \nworks days \nballmer thinks computer scientist android tech agree \ngreat time \nwin server works fine vmware \nwow tech turns body touchscreen psfk \nlove love feeling building vslive bringing conference \nresearch shows awesome step closer bit kinect \nresearch shows science science fact cool sound \nresearch shows science science fact \nzune music canada music news \nkinect makes learning playful education \nmango \ncheck change world \ngood world wait \nwatching windows pretty impressive finally mac interesting battle store \nxbox share \ngod \nblog post cool tool mouse tools \nforget siri beating speech commands mango siri \ntests proves appsense enterprise capability users personalization database enterprise \nsoftware good points sap dynamics \ngood dev \nsecure anti \nimpressed creating images \nmac blown marketing \nyahoo sale years back bought glad deal year \nomnitouch impressive technology \ngood bing paying \nipads windows tablets study \nhome day great time \nmango shows taste smartphone success \npicture services cloud love \nwindows net dev \nnice talk community \nomg sharepoint working \ninnovation sad sad \noffice love genius \nlove gates foundation \ngood \nskype family amazing things \nabsolutely loving mouse \nfan cool video turn surface touchscreen \nwow android ics lots talk mango launch people public speaking \nupdated computer windows \nics android kill mango nokia \npeople names mail week \noutlook mac sucks hate \nxbox accounts hack reports \nupdate net \nwindows media center fail \neclipsed \nword upgrade doc doc word won open doc suck \nu.s. antitrust leaving business played dumb \nlync crash issue mac fixed \nbroke played engages racketeering calls respect \nnokia chief executive mole \nfrozen xbox live xbl accounts online games report hacked \ngave windows dev preview good waiting beta windows \npowerpoint fix powerpoint presentations \neclipsed guardian \nkind search \ngreat time family advertising \nwindows forget past antitrust issues \npaying make racketeering \nday talking talk tomorrow waiting \nreader compares albatross neck agree join \nlot word freeze minutes \nlol perfect simple hate windows phones \nmonths months lose \n"
  }
]