[
  {
    "path": ".gitignore",
    "content": "*.class\n*.log\n*.swp\n*.swo\n\n# sbt specific\ndist/*\ntarget/\nlib_managed/\nsrc_managed/\nproject/boot/\nproject/plugins/project/\n\n# Scala-IDE specific\n.scala_dependencies\n\n#java\n\n*.class\n\n# Package Files #\n*.jar\n*.war\n*.ear\n\n*~\n*\\#\n.history\n.idea\n"
  },
  {
    "path": ".travis.yml",
    "content": "sudo: false\nlanguage: scala\nscript:\n    - sbt +test"
  },
  {
    "path": "LICENSE.md",
    "content": "The MIT License\n===============\n\nCopyright (c) 2009 Anton Grigoryev\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE."
  },
  {
    "path": "README.md",
    "content": "# Conjecture [![Build Status](https://travis-ci.org/etsy/Conjecture.svg?branch=master)](https://travis-ci.org/etsy/Conjecture)\n\nConjecture is a framework for building machine learning models in Hadoop using the Scalding DSL.\nThe goal of this project is to enable the development of statistical models as viable components\nin a wide range of product settings. Applications include classification and categorization,\nrecommender systems, ranking, filtering, and regression (predicting real-valued numbers).\nConjecture has been designed with a primary emphasis on flexibility and can handle a wide variety of inputs.\nIntegration with Hadoop and scalding enable seamless handling of extremely large data volumes,\nand integration with established ETL processes. Predicted labels can either be consumed directly\nby the web stack using the dataset loader, or models can be deployed and consumed by live web code.\nCurrently, binary classification (assigning one of two possible labels to input data points)\nis the most mature component of the Conjecture package.\n\n# Tutorial\nThere are a few stages involved in training a machine learning model using Conjecture.\n\n## Create Training Data\nWe represent the training data as \"feature vectors\" which are just mappings of feature names to real values.\nIn this case we represent them as a java map of strings to doubles\n(although we have a class StringKeyedVector which provides convenience methods for feature vector construction).\nWe also need the true label of each instance, which we represent as 0 and 1\n(the mapping of these binary labels to e.g., \"male\" and \"female\" is up to the user).\nWe construct BinaryLabeledInstances, which are just wrappers for a feature vector and a label.\n\n    val bl = new BinaryLabeledInstance(0.0)\n    bl.addTerm(\"bias\", 1.0)\n    bl.addTerm(\"some_feature\", 0.5)\n\n## Training a Classifier\nClassifiers are essentially trained by presenting the labeled instances to them.  There are several kinds \nof linear classifiers we implement, among them:\n\n* Logistic regression,\n* Perceptron,\n* MIRA (a large margin perceptron model),\n* Passive aggressive.\n\nThese models all have several options, such as learning rate, regularization parameters and so on.  We supply\nreasonable defaults for these parameters although they can be changed readily.  To train a linear model\nsimply call the update function with the labeled instance:\n\n    val p = new LogisticRegression()\n    p.update(bl)\n\nIn order to make this procedure tractable for large datasets, we provided scalding wrappers for the training.\nThese operate by training several small models on mappers, then aggregating them into a final complete model\non the reducers.  This wrapper is called like so:\n\n    new BinaryModelTrainer(args)\n      .train(instances, 'instance, 'model)\n      .write(SequenceFile(\"model\"))\n      .map('model -> 'model){ x : UpdateableBinaryModel => new com.google.gson.Gson.toJson(x) }\n      .write(Tsv(\"model_json\"))\n\nThis code segment will train a model using a pipe called instances which has a field called instance which contains\nthe BinaryLabeledInstance objects.  It produces a pipe with a single field containing the completed model, which can\nthen be written to disk.\n\nThis class uses the command line args object from scalding, in order to let you set some options on the command line.\nSome useful options are:\n\n| Argument                            | Possible values                               | Default            | Meaning                                          |\n|-------------------------------------|-----------------------------------------------|--------------------|--------------------------------------------------|\n| --model                             | mira, logistic_regression, passive_aggressive | passive_aggressive | The type of model to use.                        |\n| --iters                             | 1, 2, 3...                                    | 1                  | The number of iterations of training to perform. |\n| --zero_class_prob, --one_class_prob | [0, 1]                                        | 1                  |                                                  |\n\nTo see all the command line options, see the BinaryModelTrainer class.\n\n## Evaluating a Classifier\nIt is important to get a sense of the performance you can expect out of your classifier on unseen data.\nIn order to do this we recommend to use cross validation.\nIn essence, your input set of instances is split up into testing and training portions (multiple different ways),\nthen a classifier is trained on each training portion, and evaluated (against the true labels which are present)\nusing the testing portion.\nThis is all wrapped up in a class called BinaryCrossValidator, it is used like so:\n\n    new BinaryCrossValidator(args, 5)\n      .crossValidate(instances, 'instance)\n      .write(Tsv(\"model_xval\"))\n\nThis class also takes the command line arguments, which it passes to a model trainer for each fold.\nThis allows the specification of options to the cross validated models on the command line.\nThe output contains statistics about the performance of the model as well as the confusion matrices\nfor each fold.\n\nA script is included which cross validates a logistic regression model on the iris dataset.\n\n\n\n"
  },
  {
    "path": "bin/demo.sh",
    "content": "#!/bin/bash\n\n# - make monolithic conjecture jar.\nsbt clean assembly\n# - make the instances.\njava -cp target/conjecture-assembly-*.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.IrisDataToMulticlassLabeledInstances --input_file data/iris.tsv --output_file iris_model/instances --local\n# - construct the classifier.\njava -cp target/conjecture-assembly-*.jar com.twitter.scalding.Tool com.etsy.conjecture.demo.LearnMulticlassClassifier --input iris_model/instances --output iris_model --class_names Iris-versicolor,Iris-virginica,Iris-setosa --iters 5 --folds 3 --local\n"
  },
  {
    "path": "bin/model_diff.py",
    "content": "import json\nimport sys\nimport math\n\nif __name__ == '__main__':\n  if len(sys.argv) != 3:\n    sys.exit(\"Usage: python \" +  sys.argv[0] + \" [model file] [model file]\")\n  a = json.load(open(sys.argv[1]))['param']['vector']\n  b = json.load(open(sys.argv[2]))['param']['vector']\n  features = set(a.keys()) | set(b.keys())\n  diff = []\n  for f in features:\n    dv = a.get(f, 0.0) - b.get(f, 0.0)\n    if math.fabs(dv) > 0.01:\n      diff.append((f, dv, a.get(f), b.get(f)))\n  diff.sort( key = lambda tup: -math.fabs(tup[1]))\n  for t in diff:\n    print t[0] + \"\\t\" + str(t[2]) + \"\\t\" + str(t[3]) + \"\\t(\" + str(t[1]) + \")\"\n\n"
  },
  {
    "path": "bin/model_param.py",
    "content": "import json\nimport sys\nimport math\n\nif __name__ == '__main__':\n  if len(sys.argv) != 2:\n    sys.exit(\"Usage: python \" +  sys.argv[0] + \" [model file]\")\n  vec = json.load(open(sys.argv[1]))['param']['vector'].items()\n  vec.sort(key = lambda tup: -math.fabs(tup[1]))\n  for v in vec:\n    print v[0] + \"\\t\" + str(v[1])\n"
  },
  {
    "path": "bin/prediction_inspection.py",
    "content": "import json\nimport sys\nfrom optparse import OptionParser\nfrom math import floor\n\ncolors = [\"FF0000\", \"FF1000\", \"FF2000\", \"FF3000\", \"FF4000\", \"FF5000\", \"FF6000\",\n          \"FF7000\", \"FF8000\", \"FF9000\", \"FFA000\", \"FFB000\", \"FFC000\", \"FFD000\",\n          \"FFE000\", \"FFF000\", \"FFFF00\", \"F0FF00\", \"E0FF00\", \"D0FF00\", \"C0FF00\",\n          \"B0FF00\", \"A0FF00\", \"90FF00\", \"80FF00\", \"70FF00\", \"60FF00\", \"50FF00\",\n          \"40FF00\", \"30FF00\", \"20FF00\", \"10FF00\"]\nbins = len(colors)\n\n\nparser = OptionParser(usage=\"\"\"builds a simple web page providing introspection on predictions made by conjecture models.\nDepends on the supporting data provided in the instance itself, currently only supporting binary\nclassification problems\nUsage: %prog [options]\n\"\"\")\n\nparser.add_option('-o', '--out', dest='out', default=False, action='store',\n                  help=\"[optional] destination of the generated html. Defaults to standard out\")\nparser.add_option('-f', '--file', dest='file', default=False, action='store',\n                  help=\"[optional] file storing input predictions and instances. Defaults to standard in\")\nparser.add_option('-l', '--label', dest='label', default=False, action='store',\n                  help=\"[optional] only keep examples with this label\")\nparser.add_option('-L', '--limit', dest='limit', default=1000, action='store',\n                  help=\"maximum number of prediction examples to display. Default: 1000\")\n\n\n(options, args) = parser.parse_args()\n\noutput = open(options.out, 'w') if (options.out) else sys.stdout\ninput = open(options.file, 'r') if(options.file) else sys.stdin\n\nlimit = int(options.limit)\n\noutput.write(\"<html>\")\nct = 0\n\nfor line in input:\n    parts = line.strip().split(\"\\t\")\n    content = json.loads(parts[0])\n    label = int(content['label']['value'])\n    pred = float(parts[2])\n\n    if (options.label and str(label) != options.label):\n        continue\n\n    error = min(1.0, abs(pred-label))\n    bin = bins - int(floor(error*bins)) - 1\n\n    color = \"#\" + colors[bin]\n    out = \"\"\n\n    support = json.loads(content['supporting_data'])\n\n    for key in support.keys():\n        out = out + \"<b>\" + key + \"</b></br>\" + support[key] + \"<br/>\"\n\n    if (len(out) < 10000 and ct < limit):\n        try:\n            output.write(\"<div style='background-color: \"  + color + \"; width: 700px;'>\");\n            output.write(\"%d (%f)<br/>\" %( label, pred))\n            output.write(out)\n            output.write(\"</div><p>\")\n            ct = ct + 1\n        except:\n            pass\n\n    if (ct >= limit):\n        break\n\noutput.write(\"</html>\");\noutput.flush()\noutput.close()\n"
  },
  {
    "path": "build.sbt",
    "content": "import sbt._\n\nname := \"conjecture\"\n\nversion := \"0.3.1-SNAPSHOT\"\n\norganization := \"com.etsy\"\n\nscalaVersion := \"2.11.11\"\ncrossScalaVersions := Seq(\"2.11.11\", \"2.12.4\")\n\nscalacOptions ++= Seq(\"-unchecked\", \"-deprecation\")\n\n//Because some of our (legal!) java code confuses scaladoc, we must skip it for 2.12\n//See: https://github.com/scala/bug/issues/10723\nscalacOptions in (Compile, doc) += {if(scalaBinaryVersion.value == \"2.12\") \"-no-java-comments\" else \"\"}\n\njavacOptions ++= Seq(\"-Xlint:none\", \"-source\", \"1.7\", \"-target\", \"1.7\")\n\ncompileOrder := CompileOrder.JavaThenScala\n\nresolvers ++= {\n  Seq(\n    \"Concurrent Maven Repo\" at \"http://conjars.org/repo\"\n  )\n}\n\nlibraryDependencies ++= Seq(\n  \"cascading\" % \"cascading-core\" % \"2.6.1\",\n  \"cascading\" % \"cascading-local\" % \"2.6.1\" exclude(\"com.google.guava\", \"guava\"),\n  \"cascading\" % \"cascading-hadoop\" % \"2.6.1\",\n  \"com.google.code.gson\" % \"gson\" % \"2.2.2\",\n  \"com.twitter\" %% \"algebird-core\" % \"0.13.0\" excludeAll ExclusionRule(organization=\"org.scala-lang\", name=\"scala-library\"),\n  \"com.twitter\" %% \"scalding-core\" % \"0.17.4\" excludeAll ExclusionRule(organization=\"org.scala-lang\", name=\"scala-library\"),\n  \"commons-lang\" % \"commons-lang\" % \"2.4\",\n  \"com.joestelmach\" % \"natty\" % \"0.7\",\n  \"io.spray\" %% \"spray-json\" % \"1.3.2\" excludeAll ExclusionRule(organization=\"org.scala-lang\", name=\"scala-library\"),\n  \"com.google.guava\" % \"guava\" % \"13.0.1\",\n  \"org.apache.commons\" % \"commons-math3\" % \"3.2\",\n  \"org.apache.hadoop\" % \"hadoop-common\" % \"2.5.0\" excludeAll(\n    ExclusionRule(organization=\"commons-daemon\", name=\"commons-daemon\"),\n    ExclusionRule(organization=\"com.google.guava\", name=\"guava\")\n    ),\n  \"org.apache.hadoop\" % \"hadoop-hdfs\" % \"2.5.0\" excludeAll(\n    ExclusionRule(organization=\"commons-daemon\", name=\"commons-daemon\"),\n    ExclusionRule(organization=\"com.google.guava\", name=\"guava\")\n    ),\n  \"org.scala-lang\" % \"scala-reflect\" % scalaVersion.value,\n  \"net.sf.trove4j\" % \"trove4j\" % \"3.0.3\",\n  \"com.novocode\" % \"junit-interface\" % \"0.10\" % \"test\"\n)\n\nparallelExecution in Test := false\n\npublishArtifact in Test := false\n\nxerial.sbt.Sonatype.sonatypeSettings\n\npublishTo := {\n  if (System.getProperty(\"release\") != null) {\n    publishTo.value\n  } else {\n    val v = version.value\n    val archivaURL = \"http://ivy.etsycorp.com/repository\"\n    if (v.trim.endsWith(\"SNAPSHOT\")) {\n      Some(\"publish-snapshots\" at (archivaURL + \"/snapshots\"))\n    } else {\n      Some(\"publish-releases\"  at (archivaURL + \"/internal\"))\n    }\n  }\n}\n\npublishMavenStyle := true\n\noverridePublishBothSettings\n\npomIncludeRepository := { x => false }\n\npomExtra := <url>https://github.com/etsy/Conjecture</url>\n  <licenses>\n    <license>\n      <name>MIT License</name>\n      <url>http://opensource.org/licenses/MIT</url>\n      <distribution>repo</distribution>\n    </license>\n  </licenses>\n  <scm>\n    <url>git@github.com:etsy/Conjecture.git</url>\n    <connection>scm:git:git@github.com:etsy/Conjecture.git</connection>\n  </scm>\n  <developers>\n    <developer>\n      <id>jattenberg</id>\n      <name>Josh Attenberg</name>\n      <url>github.com/jattenberg</url>\n    </developer>\n    <developer>\n      <id>rjhall</id>\n      <name>Rob Hall</name>\n      <url>github.com/rjhall</url>\n    </developer>\n  </developers>\n\n\npomIncludeRepository := { _ => false }\n\n// Uncomment if you don't want to run all the tests before building assembly\n// test in assembly := {}\n\n// Janino includes a broken signature, and is not needed:\nassemblyExcludedJars in assembly <<= (fullClasspath in assembly) map { cp =>\n  val excludes = Set(\"jsp-api-2.1-6.1.14.jar\", \"jsp-2.1-6.1.14.jar\",\n    \"jasper-compiler-5.5.12.jar\", \"janino-2.5.16.jar\")\n  cp filter { jar => excludes(jar.data.getName)}\n}\n\n// Some of these files have duplicates, let's ignore:\nassemblyMergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>\n{\n  case s if s.endsWith(\".class\") => MergeStrategy.last\n  case s if s.endsWith(\"project.clj\") => MergeStrategy.concat\n  case s if s.endsWith(\".html\") => MergeStrategy.last\n  case s if s.contains(\"servlet\") => MergeStrategy.last\n  case x => old(x)\n}\n}\n"
  },
  {
    "path": "clients/phplib/Conjecture/BinaryClassifier.php",
    "content": "<?php\n\nclass Conjecture_BinaryClassifier {\n    private $param = null;\n\n    function __construct($param_vec) {\n        $this->param = $param_vec;\n    }\n\n    public function dot($instance_vec) {\n        return $this->param->dot($instance_vec);\n    }\n\n    public function predict($instance_vec) {\n        $dot = $this->dot($instance_vec);\n        $exd = exp($dot);\n        return $exd / (1.0 + $exd);\n    }\n\n    public function getParams() {\n        return $this->param->getParams();\n    }\n\n    public function explain($instance_vec, $n = 10) {\n        $keys = array_intersect_key($this->param->getParams(), $instance_vec->getParams());\n        $keys = array_map('abs', $keys);\n        arsort($keys);\n        $res = array_slice($keys, 0, (count($keys) < $n ? count($keys) : $n));\n        foreach ($res as $k => $v) {\n            $res[$k] = \"$k(\" . round($this->param->getParam($k), 2) . \")\";\n        }\n        return implode(\", \", $res);\n    }\n}\n"
  },
  {
    "path": "clients/phplib/Conjecture/Config.php",
    "content": "<?php\n\ninterface Conjecture_Config {\n\n    public function useDummyConjectureModel();\n    public function getConjectureModelPath();\n    public function getMaxFileSize();\n}"
  },
  {
    "path": "clients/phplib/Conjecture/ConjectureException.php",
    "content": "<?php\n\nclass Conjecture_ConjectureException extends Exception{}"
  },
  {
    "path": "clients/phplib/Conjecture/Finder.php",
    "content": "<?php\n\nclass Conjecture_Finder {\n\n    private $config = null;\n\n    public function __construct(Conjecture_Config $config) {\n        $this->config = $config;\n    }\n\n\n    /**\n     * Loads a model local to a user's vm.\n     */\n    public function getLocalModel($local_file_path) {\n        $model = json_decode($this->parseFile($local_file_path));\n        $cv = new Conjecture_Vector($model->param->vector);\n        $binary_classifier = new Conjecture_BinaryClassifier($cv);\n        return $binary_classifier;\n    }\n\n    /**\n     * Decode model json at a given filepath.\n     */\n    private function parseFile($fp) {\n        if (filesize($fp) > $this->config->getMaxFileSize()) {\n            throw new Conjecture_ConjectureException(\"model too big: \" . $fp . \" is \" . filesize($fp) . \"bytes\");\n        }\n\n        $res = file($fp);\n        if ($res) {\n            $res = implode(\"\", $res);\n            $res = stripslashes($res);\n            return $res;\n        } else {\n            throw new Conjecture_ConjectureException(\"model file not found: $fp\");\n        }\n    }\n\n    private function getLatestModelJsonForProblem($file_name) {\n        if ($this->config->useDummyConjectureModel()) {\n            return self::getDummyModel();\n        }\n\n        $fp = $this->config->getConjectureModelPath() . \"/\" . $file_name;\n        return $this->parseFile($fp);\n    }\n\n    public function getLatestModelForProblem($file_name) {\n        $json = $this->getLatestModelJsonForProblem($file_name);\n        return json_decode($json);\n    }\n\n    public function getLatestBinaryClassificationVectorForProblem($file_name) {\n        $model = $this->getLatestModelForProblem($file_name);\n        return new Conjecture_Vector($model->param->vector);\n    }\n\n    public function getLatestBinaryClassifierForProblem($file_name) {\n        return new Conjecture_BinaryClassifier($this->getLatestBinaryClassificationVectorForProblem($file_name));\n    }\n\n    public function getOneVsAllClassifier($file_name) {\n        $model_array = $this->getLatestModelForProblem($file_name);\n\n        foreach ($model_array as $cat => $params) {\n            $category_params[$cat] = new Conjecture_BinaryClassifier(new Conjecture_Vector($params));\n        }\n\n        return new Conjecture_MulticlassOneVsAllClassifier($category_params);\n    }\n\n\n    public function getMulticlassClassifier($file_name) {\n        $model_array = $this->getLatestModelForProblem($file_name);\n        $model_type = $model_array->modelType;\n        $category_params = [];\n\n        foreach ($model_array->param as $cat => $category_model) {\n            $category_params[$cat] = new Conjecture_Vector($category_model->vector);\n        }\n\n        switch ($model_type) {\n            case \"multiclass_logistic_regression\":\n                return new Conjecture_MulticlassLogisticRegressionClassifier($category_params);\n            default:\n                return new Conjecture_MulticlassClassifier($category_params);\n        }\n    }\n\n\n    static function build(Conjecture_Config $config) {\n        return new Conjecture_Finder($config);\n    }\n\n    /**\n     * Creates and returns a JSON dummy model with no vectors\n     * used for development settings where \"real\" JSON models\n     * may not be present\n     */\n    private static function getDummyModel() {\n\n        $dummy_model = array(\"param\" => array(\n            \"vector\" => array(),\n            \"modelType\" => \"dummy\",\n            \"regularizationWeights\" => array(),\n            \"epoch\" => 1,\n            \"period\" => 1,\n            \"truncationUpdate\" => 0,\n            \"truncationThreshold\" => 0,\n            \"initialLearningRate\" => .1,\n            \"useExponentialLearningRate\" => false,\n            \"exponentialLearningRate\" => 1.0,\n            \"examplesPerEpoch\" => 1,\n        ));\n        return json_encode($dummy_model);\n    }\n}\n"
  },
  {
    "path": "clients/phplib/Conjecture/Instance.php",
    "content": "<?php\n\n  /**\n   * container class representing instances that are considered\n   * as input to predictive models in Conjecture. Has a rich set\n   * of adders and setters that mirrors the API of the java code,\n   * https://github.etsycorp.com/Engineering/Conjecture\n   */\nclass Conjecture_Instance extends Conjecture_Vector{\n\n    private static $NAMESPACE_SEP = \"___\";\n\n    private $id = null;\n    private $label = null;\n\n\n    public function __construct(array $vector = array()) {\n        parent::__construct($vector);\n    }\n\n    public function getId() {\n        return $this->id;\n    }\n\n    public function setId($id) {\n        $this->id = $id;\n        return $this;\n    }\n\n    public function put($key, $value = 1.0) {\n        $this->vector[$key] = $value;\n    }\n\n    public function update($key, $value = 1.0) {\n        if (array_key_exists($key, $this->vector)) {\n            $this->vector[$key] = $this->vector[$key] + $value;\n        } else {\n            $this->vector[$key] = $value;\n        }\n        return $this;\n    }\n\n    //some methods to mirror java maps that this class mirrors\n\n    public function putAll(array $vector) {\n        foreach ($vector as $key => $value) {\n            $this->put($key, $value);\n        }\n    }\n\n    public function containsKey($key) {\n        return array_key_exists($key, $this->vector);\n    }\n\n    public function containsValue($key) {\n        return in_array($key, $this->vector);\n    }\n\n\n    public function keySet() {\n        return array_keys($this->vector);\n    }\n\n    public function values() {\n        return array_values($this->vector);\n    }\n\n    public function size() {\n        return count($this->vector);\n    }\n\n    public function isEmpty() {\n        return empty($this->vector);\n    }\n\n    public function remove($key) {\n        unset($this->vector[$key]);\n    }\n\n    public function toString() {\n        return json_encode($this->vector);\n    }\n\n    public function addTerm($term, $featureWeight = 1.0, $namespace = \"\") {\n        $key = $namespace == \"\" ? $term : $namespace . self::$NAMESPACE_SEP . $term;\n        $this->update($key, $featureWeight);\n        return $this;\n    }\n\n    public function addTerms(array $terms, $featureWeight = 1.0, $namespace = \"\") {\n        foreach ($terms as $term) {\n            $this->addTerm($term, $featureWeight, $namespace);\n        }\n        return $this;\n    }\n\n    public function addNumericArray(array $numberValues, $namespace = \"\") {\n        for ($i = 0; $i < count($numberValues); $i++ ) {\n            $this->addTerm((string)$i, $numberValues[$i], $namespace);\n        }\n        return $this;\n    }\n\n}\n"
  },
  {
    "path": "clients/phplib/Conjecture/MulticlassClassifier.php",
    "content": "<?php\n\n\nclass Conjecture_MulticlassClassifier {\n\n    private $param = null;\n\n    /**\n     * each param is a Conjecture_Vector\n     */\n    function __construct($param) {\n        $this->param = $param;\n    }\n\n    public function predict($instance_vec) {\n        $category_results = [];\n        $total = 0;\n\n        foreach ($this->param as $category => $classifier) {\n            $prediction = $classifier->dot($instance_vec);\n            $category_results[$category] = $prediction;\n            $total += $prediction;\n        }\n\n        return array_map( function($prob) use ($total) {\n                return $prob / $total;\n        }, $category_results);\n    }\n\n    public function getParams() {\n        return $this->param;\n    }\n\n    public function explain($instance_vec, $n = 10) {\n        $explains = [];\n\n        foreach ($this->param as $category => $category_model) {\n            $explains[$category] = $this->categoryExplain($instance_vec, $category_model, $n);\n        }\n\n        return implode(\", \", $explains);\n    }\n\n\n    private function categoryExplain($instance_vec, $category_model, $n = 10) {\n\n        $keys = array_intersect_key($category_model->getParams(), $instance_vec->getParams());\n        $keys = array_map('abs', $keys);\n        arsort($keys);\n        $res = array_slice($keys, 0, (count($keys) < $n ? count($keys) : $n));\n\n        foreach ($res as $k => $v) {\n            $res[$k] = \"$k(\" . round($category_model->getParams($k), 2) . \")\";\n        }\n\n        return implode(\", \", $res);\n    }\n}"
  },
  {
    "path": "clients/phplib/Conjecture/MulticlassLogisticRegressionClassifier.php",
    "content": "<?php\n\n\nclass Conjecture_MulticlassLogisticRegressionClassifier extends Conjecture_MulticlassClassifier {\n\n    private $param = null;\n\n    public function predict($instance_vec) {\n        $category_results = [];\n        $total = 0;\n\n        foreach ($this->param as $category => $classifier) {\n            $prediction = exp($classifier->dot($instance_vec));\n            $category_results[$category] = $prediction;\n            $total += $prediction;\n        }\n\n        return array_map( function($prob) use ($total) {\n                return $prob / $total;\n        }, $category_results);\n    }\n\n}"
  },
  {
    "path": "clients/phplib/Conjecture/MulticlassOneVsAllClassifier.php",
    "content": "<?php\n\n\nclass Conjecture_MulticlassOneVsAllClassifier {\n\n    private $param = null;\n\n    /**\n     * $param is an array that maps category to a Conjecture_BinaryClassifier\n     * that represents that class\n     */\n    function __construct($param) {\n        $this->param = $param;\n    }\n\n    public function predict($instance_vec) {\n        $category_results = [];\n        $total = 0;\n\n        foreach ($this->param as $category => $classifier) {\n            $prediction = $classifier->predict($instance_vec);\n            $category_results[$category] = $prediction;\n            $total += $prediction;\n        }\n\n        return array_map( function($prob) use ($total) {\n                return $prob / $total;\n        }, $category_results);\n    }\n\n    public function getParams() {\n        $out_params = [];\n\n        foreach ($this->param as $category => $classifier) {\n            $out_params[$category] = $classifier->getParams();\n        }\n\n        return $out_params;\n    }\n\n    public function explain($instance_vec, $n = 10) {\n        $explains = [];\n\n        foreach ($this->param as $category => $classifier) {\n            $explains[$category] = $classifier->explain($instance_vec, $n);\n        }\n\n        return implode(\", \", $explains);\n    }\n\n}"
  },
  {
    "path": "clients/phplib/Conjecture/Text.php",
    "content": "<?php\n\n// This is bascially an exact replica of com.etsy.conjecture.text.Text\nclass Conjecture_Text {\n\n    private $input = null;\n\n    static function build($text) {\n        return new Conjecture_Text($text);\n    }\n\n    function __construct($text) {\n        $this->input = $text;\n    }\n\n    function toString() {\n        return $this->input;\n    }\n\n    function replaceNumbers($replacement = \"_num_\") {\n        $text = preg_replace(\"/[0-9]+/\", $replacement, $this->input);\n        return new Conjecture_Text(preg_replace(\"/\".$replacement.\"\\\\s+\".$replacement.\"/\", $replacement, $text));\n    }\n\n    function replaceHTMLEscapes($replacement = \" \") {\n        return new Conjecture_Text(preg_replace(\"/&[^;]+;/\", $replacement, $this->input));\n    }\n\n    function removeHTMLTags() {\n        return new Conjecture_Text(preg_replace(\"/<.*?>/\", \" \", $this->input));\n    }\n\n    function replaceHTMLTags($replacement = \" \") {\n        return new Conjecture_Text(preg_replace(\"/<[^>]+>/\", \" \", $this->input));\n    }\n\n    function replaceNonAlphaNumeric($replacement = \" \") {\n        return new Conjecture_Text(preg_replace(\"/[^a-zA-Z0-9\\\\.\\\\s\\\\-]+/\", $replacement, $this->input));\n    }\n\n    function replaceNonAlphaNumericUnderscore($replacement = \" \") {\n        return new Conjecture_Text(preg_replace(\"/[^a-zA-Z0-9\\\\.\\\\s\\\\-_]+/\", $replacement, $this->input));\n    }\n\n    function replaceNonAlpha($replacement = \" \") {\n        return new Conjecture_Text(preg_replace(\"/[^a-zA-Z]+/\", $replacement, $this->input));\n    }\n\n    function collapseHyphens() {\n        return new Conjecture_Text(preg_replace(\"/--+/\", \"--\", $this->input));\n    }\n\n    function collapseUnderscores() {\n        return new Conjecture_Text(preg_replace(\"/__+/\", \"__\", $this->input));\n    }\n\n    function collapsePeriods() {\n        return new Conjecture_Text(preg_replace(\"/\\.\\.+/\", \"..\", $this->input));\n    }\n\n    function stripPunctuation() {\n        $temp = preg_replace(\"^[^A-Za-z0-9]\", \"\", $this->input);\n        return new Conjecture_Text(preg_replace(\"[^A-Za-z0-9]$\", \"\", $temp));\n    }\n\n    // compact any white space\n    function collapse() {\n        return new Conjecture_Text(preg_replace(\"/\\\\s+/\", \" \", $this->input));\n    }\n\n    // remove any whitespace from the right of a string\n    function rstrip() {\n        return new Conjecture_Text(preg_replace(\"/\\\\s+$/\", \"\", $this->input));\n    }\n\n    // remove any whitespace from the left of a string\n    function lstrip() {\n        return new Conjecture_Text(preg_replace(\"/^\\\\s+/\", \"\", $this->input));\n    }\n\n    // remove any leading or trailing whitespace\n    function strip() {\n        return $this->rstrip()->lstrip();\n    }\n\n    // clean up any whitespace\n    function wsclean() {\n        return $this->strip()->collapse();\n    }\n\n    // remove any unprintable non-ASCII characters\n    function removeUnprintables() {\n        return new Conjecture_Text(preg_replace(\"/[^\\\\x20-\\\\x7E]/\", \"\", $this->input));\n    }\n\n    function collapseWhitespaceAndPunc() {\n        $text = $this->collapse()->collapseHyphens();\n        return new Conjecture_Text(preg_replace(\"/\\\\.\\\\.+/\", \".\", $text->toString()));\n    }\n\n    function toLowerCase() {\n        return new Conjecture_Text(strtolower($this->input));\n    }\n\n    function standardTextFilter() {\n        return $this->removeHTMLTags()\n                    ->replaceHTMLEscapes()\n                    ->replaceNumbers()\n                    ->replaceNonAlphaNumericUnderscore()\n                    ->collapseHyphens()\n                    ->collapseUnderscores()\n                    ->wsclean();\n    }\n\n    function toArrayFromShingles($n) {\n        $shingles = array();\n\n        $chars = str_split($this->input);\n        for ($i = 0; $i < count($chars) - $n + 1; $i++) {\n            $shingle = array_slice($chars, $i, $n);\n            $shingles[] = implode(\"\", $shingle);\n        }\n\n        return $shingles;\n    }\n\n    function toSequenceFromShingles($n) {\n        return new Conjecture_TextSequence($this->toArrayFromShingles($n));\n    }\n}\n"
  },
  {
    "path": "clients/phplib/Conjecture/TextSequence.php",
    "content": "<?php\n\n// This is bascially an exact replica of com.etsy.conjecture.text.TextSequence\nclass Conjecture_TextSequence {\n\n    private $tokens = null;\n\n    function __construct(array $tokens) {\n        $this->tokens = $tokens;\n    }\n\n    /**\n     * concatenates two TextSequences into an additional text sequence\n     */\n    function concat($other) {\n        return new Conjecture_TextSequence(array_merge($this->tokens, $other->tokens));\n    }\n\n    function mkString($glue = \" \") {\n        return implode($glue, $this->tokens);\n    }\n\n    function toString() {\n        return $this->mkString(\" \");\n    }\n\n    function getTokens() {\n        return $this->tokens;\n    }\n\n    function filterBlank() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return $x !== \"\";\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function filterStopwords() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return !in_array($x, self::$stopwordList);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function stopwords() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return in_array($x, self::$stopwordList);\n                                                        }\n                                                       )\n                                          );\n    }\n\n\n    function filterBadwords() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return !in_array($x, self::$badwordList);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function badwords() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return in_array($x, self::$badwordList);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function filterAllCaps() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return !preg_match('/^[A-Z]+$/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function AllCaps() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return preg_match('/^[A-Z]+$/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function filterCapitalized() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return !preg_match('/^[A-Z][^A-Z]+$/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function capitalized() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return preg_match('/^[A-Z][^A-Z]+$/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function filterLowercase() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return !preg_match('/^[a-z]+$/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function allLowercase() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return preg_match('/^[a-z]+$/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function filterURLs() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return !preg_match('/^https?://.+/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function allURLs() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return preg_match('/^https?://.+/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function filterListings() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return !preg_match('/^https?://.+etsy.+/listing/[0-9]+.*/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function allListings() {\n        return new Conjecture_TextSequence(array_filter($this->tokens,\n                                                        function($x) {\n                                                            return preg_match('/^https?://.+etsy.+/listing/[0-9]+.*/', $x);\n                                                        }\n                                                       )\n                                          );\n    }\n\n    function size() {\n        return count($this->tokens);\n    }\n\n    function stopWordCount() {\n        return $this->stopwords()->size();\n    }\n\n    function stopWordFraq($bins = 10.0) {\n        return floor(round($bins*$this->stopWordCount()/$this->size())/$bins);\n    }\n\n    function badWordCount() {\n        return $this->badwords()->size();\n    }\n\n    function badWordFraq($bins = 10.0) {\n        return floor(round($bins*$this->badWordCount()/$this->size())/$bins);\n    }\n\n    function capsCount() {\n        return $this->allCaps()->size();\n    }\n\n    function capFraq($bins = 10.0) {\n        return floor(round($bins*$this->capsCount()/$this->size())/$bins);\n    }\n\n    function urlCount() {\n        return $this->allURLs()->size();\n    }\n\n    function urlFraq($bins = 10.0) {\n        return floor(round($bins*$this->urlCount()/$this->size())/$bins);\n    }\n\n    function listingsCount() {\n        return $this->badwords()->size();\n    }\n\n    function listingsFraq($bins = 10.0) {\n        return floor(round($bins*$this->allListings()/$this->size())/$bins);\n    }\n\n    function sizeBin() {\n        return floor(log($this->size()));\n    }\n\n    // filtering methods (TODO)\n\n    function replaceNumbers($replacement = \"_num_\") {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) use ($replacement) {\n                                                   $text = preg_replace(\"/[0-9]+/\", $replacement, $x);\n                                                   return preg_replace(\"/\".$replacement.\"\\\\s+\".$replacement.\"/\", $replacement, $text);\n                                               }, $this->tokens));\n    }\n\n\n    function replaceHTMLEscapes($replacement = \" \") {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) use ($replacement) {\n                                                   return preg_replace(\"/&[^;]+;/\", $replacement, $x);\n                                               }, $this->tokens));\n    }\n\n    function removeHTMLTags() {\n        return $this->replaceHTMLTags(\" \");\n    }\n\n    function replaceHTMLTags($replacement = \" \") {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) use ($replacement) {\n                                                   return preg_replace(\"/<[^>]+>/\", $replacement, $x);\n                                               }, $this->tokens));\n    }\n\n    function replaceNonAlphaNumeric($replacement = \" \") {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) use ($replacement) {\n                                                   return preg_replace(\"/[^a-zA-Z0-9\\\\.\\\\s\\\\-]+/\", $replacement, $x);\n                                               }, $this->tokens));\n    }\n\n    function replaceNonAlphaNumericUnderscore($replacement = \" \") {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) use ($replacement) {\n                                                   return preg_replace(\"/[^a-zA-Z0-9\\\\.\\\\s\\\\-_]+/\", $replacement, $x);\n                                               }, $this->tokens));\n    }\n\n    function replaceNonAlpha($replacement = \" \") {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) use ($replacement) {\n                                                   return preg_replace(\"/[^a-zA-Z\\\\.\\\\s\\\\-_]+/\", $replacement, $x);\n                                               }, $this->tokens));\n    }\n\n    function collapseHyphens() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   return preg_replace(\"/--+/\", \"--\", $x);\n                                               }, $this->tokens));\n    }\n\n    function collapseUnderscores() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   return preg_replace(\"/__+/\", \"__\", $x);\n                                               }, $this->tokens));\n    }\n\n    function collapsePeriods() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   return preg_replace(\"/\\.\\.+/\", \"..\", $x);\n                                               }, $this->tokens));\n    }\n\n    function stripPunctuation() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   $temp = preg_replace(\"^[^A-Za-z0-9]\", \"\", $x);\n                                                   return preg_replace(\"[^A-Za-z0-9]$\", \"\", $temp);\n                                               }, $this->tokens));\n    }\n\n    // compact any white space\n    function collapse() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   return preg_replace(\"/\\\\s+/\", \" \", $x);\n                                               }, $this->tokens));\n    }\n\n    function rstrip() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   return preg_replace(\"/^\\\\s+/\", \"\", $x);\n                                               }, $this->tokens));\n    }\n\n    function lstrip() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   return preg_replace(\"/\\\\s+$/\", \"\", $x);\n                                               }, $this->tokens));\n    }\n\n    // remove any leading or trailing whitespace\n    function strip() {\n        return $this->rstrip()->lstrip();\n    }\n\n    // clean up any whitespace\n    function wsclean() {\n        return $this->strip()->collapse();\n    }\n\n    // remove any unprintable non-ASCII characters\n    function removeUnprintables() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   return preg_replace(\"/[^\\\\x20-\\\\x7E]/\", \"\", $x);\n                                               }, $this->tokens));\n\n    }\n\n    function collapseWhitespaceAndPunc() {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) {\n                                                   $ws = preg_replace(\"/\\\\s+/\", \" \", $x);\n                                                   $dh = preg_replace(\"/[\\\\-]+/\", \"-\", $ws);\n                                                   return preg_replace(\"/[\\\\.]+/\", \".\", $dh);\n                                               }, $this->tokens));\n    }\n\n    function prependNameSpace($namespace) {\n        return new Conjecture_TextSequence(array_map(\n                                               function($x) use ($namespace) {\n                                                   return $namespace . $x;\n                                               }, $this->tokens));\n    }\n\n    function toList() {\n        return $this->tokens;\n    }\n\n    function shingles($n, $whitespace = \"_\") {\n        $str = implode($whitespace, $this->tokens);\n        $arr = explode('', $str);\n\n        $shingles = array();\n        for ($i = 0; $i < count($arr) - $n; $i++) {\n            $shingles[] = implode('', array_slice($arr, $i, $i + $n));\n        }\n\n        return new Conjecture_TextSequence($shingles);\n    }\n\n    function ngrams($n, $glue = \" \") {\n        $grams = array();\n        for ($i = 0; $i < count($this->tokens) - $n+1; $i++) {\n            $grams[] = implode($glue, array_slice($this->tokens, $i, $n));\n        }\n\n        return new Conjecture_TextSequence($grams);\n    }\n\n    function unigramsAndBigrams($glue = \" \") {\n      return $this->ngrams(1)->concat($this->ngrams(2, $glue));\n    }\n\n    function toInstance() {\n        $instance = new Conjecture_Instance();\n\n        foreach ($this->tokens as $token) {\n            $instance->addTerm($token);\n        }\n\n        return $instance;\n    }\n\n    static $stopwordList = array(\"a\",\"as\",\"able\",\"about\",\"above\",\"according\",\"accordingly\",\"across\",\"actually\",\"after\",\"afterwards\",\"again\",\"against\",\"aint\",\"all\",\"allow\",\"allows\",\"almost\",\"alone\",\"along\",\"already\",\"also\",\"although\",\"always\",\"am\",\"among\",\"amongst\",\"amoungst\",\"amount\",\"an\",\"and\",\"another\",\"any\",\"anybody\",\"anyhow\",\"anyone\",\"anything\",\"anyway\",\"anyways\",\"anywhere\",\"apart\",\"appear\",\"appreciate\",\"appropriate\",\"are\",\"arent\",\"around\",\"as\",\"aside\",\"ask\",\"asking\",\"associated\",\"at\",\"available\",\"away\",\"awfully\",\"b\",\"back\",\"be\",\"became\",\"because\",\"become\",\"becomes\",\"becoming\",\"been\",\"before\",\"beforehand\",\"behind\",\"being\",\"believe\",\"below\",\"beside\",\"besides\",\"best\",\"better\",\"between\",\"beyond\",\"bill\",\"both\",\"bottom\",\"brief\",\"but\",\"by\",\"c\",\"cmon\",\"cs\",\"call\",\"came\",\"can\",\"cant\",\"cannot\",\"cant\",\"cause\",\"causes\",\"certain\",\"certainly\",\"changes\",\"clearly\",\"co\",\"com\",\"come\",\"comes\",\"con\",\"concerning\",\"consequently\",\"consider\",\"considering\",\"contain\",\"containing\",\"contains\",\"corresponding\",\"could\",\"couldnt\",\"couldnt\",\"course\",\"cry\",\"currently\",\"d\",\"de\",\"definitely\",\"describe\",\"described\",\"despite\",\"detail\",\"did\",\"didnt\",\"different\",\"do\",\"does\",\"doesnt\",\"doing\",\"dont\",\"done\",\"down\",\"downwards\",\"due\",\"during\",\"e\",\"each\",\"edu\",\"eg\",\"eight\",\"either\",\"eleven\",\"else\",\"elsewhere\",\"empty\",\"enough\",\"entirely\",\"especially\",\"et\",\"etc\",\"even\",\"ever\",\"every\",\"everybody\",\"everyone\",\"everything\",\"everywhere\",\"ex\",\"exactly\",\"example\",\"except\",\"f\",\"far\",\"few\",\"fifteen\",\"fifth\",\"fify\",\"fill\",\"find\",\"fire\",\"first\",\"five\",\"followed\",\"following\",\"follows\",\"for\",\"former\",\"formerly\",\"forth\",\"forty\",\"found\",\"four\",\"from\",\"front\",\"full\",\"further\",\"furthermore\",\"g\",\"get\",\"gets\",\"getting\",\"give\",\"given\",\"gives\",\"go\",\"goes\",\"going\",\"gone\",\"got\",\"gotten\",\"greetings\",\"h\",\"had\",\"hadnt\",\"happens\",\"hardly\",\"has\",\"hasnt\",\"hasnt\",\"have\",\"havent\",\"having\",\"he\",\"hes\",\"hello\",\"help\",\"hence\",\"her\",\"here\",\"heres\",\"hereafter\",\"hereby\",\"herein\",\"hereupon\",\"hers\",\"herself\",\"hi\",\"him\",\"himself\",\"his\",\"hither\",\"hopefully\",\"how\",\"howbeit\",\"however\",\"hundred\",\"i\",\"id\",\"ill\",\"im\",\"ive\",\"ie\",\"if\",\"ignored\",\"immediate\",\"in\",\"inasmuch\",\"inc\",\"indeed\",\"indicate\",\"indicated\",\"indicates\",\"inner\",\"insofar\",\"instead\",\"interest\",\"into\",\"inward\",\"is\",\"isnt\",\"it\",\"itd\",\"itll\",\"its\",\"its\",\"itself\",\"j\",\"just\",\"k\",\"keep\",\"keeps\",\"kept\",\"know\",\"known\",\"knows\",\"l\",\"last\",\"lately\",\"later\",\"latter\",\"latterly\",\"least\",\"less\",\"lest\",\"let\",\"lets\",\"like\",\"liked\",\"likely\",\"little\",\"look\",\"looking\",\"looks\",\"ltd\",\"m\",\"made\",\"mainly\",\"many\",\"may\",\"maybe\",\"me\",\"mean\",\"meanwhile\",\"merely\",\"might\",\"mill\",\"mine\",\"more\",\"moreover\",\"most\",\"mostly\",\"move\",\"much\",\"must\",\"my\",\"myself\",\"n\",\"name\",\"namely\",\"nd\",\"near\",\"nearly\",\"necessary\",\"need\",\"needs\",\"neither\",\"never\",\"nevertheless\",\"new\",\"next\",\"nine\",\"no\",\"nobody\",\"non\",\"none\",\"noone\",\"nor\",\"normally\",\"not\",\"nothing\",\"novel\",\"now\",\"nowhere\",\"o\",\"obviously\",\"of\",\"off\",\"often\",\"oh\",\"ok\",\"okay\",\"old\",\"on\",\"once\",\"one\",\"ones\",\"only\",\"onto\",\"or\",\"other\",\"others\",\"otherwise\",\"ought\",\"our\",\"ours\",\"ourselves\",\"out\",\"outside\",\"over\",\"overall\",\"own\",\"p\",\"part\",\"particular\",\"particularly\",\"per\",\"perhaps\",\"placed\",\"please\",\"plus\",\"possible\",\"presumably\",\"probably\",\"provides\",\"put\",\"q\",\"que\",\"quite\",\"qv\",\"r\",\"rather\",\"rd\",\"re\",\"really\",\"reasonably\",\"regarding\",\"regardless\",\"regards\",\"relatively\",\"respectively\",\"right\",\"s\",\"said\",\"same\",\"saw\",\"say\",\"saying\",\"says\",\"second\",\"secondly\",\"see\",\"seeing\",\"seem\",\"seemed\",\"seeming\",\"seems\",\"seen\",\"self\",\"selves\",\"sensible\",\"sent\",\"serious\",\"seriously\",\"seven\",\"several\",\"shall\",\"she\",\"should\",\"shouldnt\",\"show\",\"side\",\"since\",\"sincere\",\"six\",\"sixty\",\"so\",\"some\",\"somebody\",\"somehow\",\"someone\",\"something\",\"sometime\",\"sometimes\",\"somewhat\",\"somewhere\",\"soon\",\"sorry\",\"specified\",\"specify\",\"specifying\",\"still\",\"sub\",\"such\",\"sup\",\"sure\",\"system\",\"t\",\"ts\",\"take\",\"taken\",\"tell\",\"ten\",\"tends\",\"th\",\"than\",\"thank\",\"thanks\",\"thanx\",\"that\",\"thats\",\"thats\",\"the\",\"thea\",\"their\",\"theirs\",\"them\",\"themselves\",\"then\",\"thence\",\"there\",\"theres\",\"thereafter\",\"thereby\",\"therefore\",\"therein\",\"theres\",\"thereupon\",\"these\",\"they\",\"theyd\",\"theyll\",\"theyre\",\"theyve\",\"thickv\",\"thin\",\"think\",\"third\",\"this\",\"thorough\",\"thoroughly\",\"those\",\"though\",\"three\",\"through\",\"throughout\",\"thru\",\"thus\",\"to\",\"together\",\"too\",\"took\",\"top\",\"toward\",\"towards\",\"tried\",\"tries\",\"truly\",\"try\",\"trying\",\"twelve\",\"twenty\",\"twice\",\"two\",\"u\",\"un\",\"under\",\"unfortunately\",\"unless\",\"unlikely\",\"until\",\"unto\",\"up\",\"re\",\"werent\",\"what\",\"whats\",\"whatever\",\"when\",\"whence\",\"whenever\",\"where\",\"wheres\",\"whereafter\",\"whereas\",\"whereby\",\"wherein\",\"whereupon\",\"wherever\",\"whether\",\"which\",\"while\",\"whither\",\"who\",\"whos\",\"whoever\",\"whole\",\"whom\",\"whose\",\"why\",\"will\",\"willing\",\"wish\",\"with\",\"within\",\"without\",\"wont\",\"wonder\",\"would\",\"wouldnt\",\"x\",\"y\",\"yes\",\"yet\",\"you\",\"youd\",\"youll\",\"youre\",\"youve\",\"your\",\"yours\",\"yourself\",\"yourselves\",\"z\",\"zero\");\n\n    static $badwordList = array(\"ahole\", \"arse\", \"ass\", \"asshole\", \"asswipe\", \"bastard\", \"batty\", \"bender\", \"bitch\", \"bloody\", \"bollocks\", \"boner\", \"bumboy\", \"bugger\", \"coon\", \"cock\", \"cocksucker\", \"cracker\", \"crap\", \"cumsucker\", \"cunt\", \"damn\", \"dick\", \"dildo\", \"douchebag\", \"faggot\", \"fistfucker\", \"fuck\", \"fucker\", \"fuckwit\", \"fucktwat\", \"gaylord\", \"ho\", \"honky\", \"jackass\", \"jism\", \"joey\", \"knobcheese\", \"minge\", \"minger\", \"mong\", \"motherfucker\", \"munter\", \"pickle\", \"piss\", \"piss\", \"prick\", \"pussy\", \"rimmer\", \"schmuck\", \"shit\", \"slut\", \"spakka\", \"spaz\", \"skank\", \"taint\", \"tit\", \"tool\", \"tosser\", \"twat\", \"whore\", \"wanker\");\n}\n"
  },
  {
    "path": "clients/phplib/Conjecture/Vector.php",
    "content": "<?php \n\nclass Conjecture_Vector {\n\n    protected $vector = null;\n\n    function __construct($array = array()) {\n        $this->vector = (array)$array;\n    }\n\n    public function dot($rhs) {\n        $keys = array_intersect_key($this->vector, $rhs->vector);\n        $res = 0.0;\n\n        foreach ($keys as $key => $val) {\n            $res += $this->vector[$key] * $rhs->vector[$key];\n        }\n\n        return $res;\n    }\n\n    public function getParams() {\n        return $this->vector;\n    }\n\n    public function getParam($k) {\n        if (array_key_exists($k, $this->vector)) {\n            return $this->vector[$k];\n        } else {\n            return 0.0;\n        }\n    }\n}\n"
  },
  {
    "path": "data/iris.tsv",
    "content": "7.0\t3.2\t4.7\t1.4\tIris-versicolor\n5.6\t3.0\t4.1\t1.3\tIris-versicolor\n5.4\t3.4\t1.7\t0.2\tIris-setosa\n5.0\t3.0\t1.6\t0.2\tIris-setosa\n6.9\t3.2\t5.7\t2.3\tIris-virginica\n4.9\t3.0\t1.4\t0.2\tIris-setosa\n5.0\t2.3\t3.3\t1.0\tIris-versicolor\n5.2\t2.7\t3.9\t1.4\tIris-versicolor\n5.1\t3.8\t1.9\t0.4\tIris-setosa\n7.2\t3.6\t6.1\t2.5\tIris-virginica\n4.8\t3.4\t1.6\t0.2\tIris-setosa\n6.0\t2.9\t4.5\t1.5\tIris-versicolor\n5.8\t2.6\t4.0\t1.2\tIris-versicolor\n5.7\t2.6\t3.5\t1.0\tIris-versicolor\n5.9\t3.0\t4.2\t1.5\tIris-versicolor\n5.5\t2.3\t4.0\t1.3\tIris-versicolor\n4.6\t3.2\t1.4\t0.2\tIris-setosa\n6.3\t2.8\t5.1\t1.5\tIris-virginica\n6.3\t3.3\t6.0\t2.5\tIris-virginica\n6.9\t3.1\t4.9\t1.5\tIris-versicolor\n6.7\t3.3\t5.7\t2.5\tIris-virginica\n5.1\t3.7\t1.5\t0.4\tIris-setosa\n6.7\t3.3\t5.7\t2.1\tIris-virginica\n5.8\t2.8\t5.1\t2.4\tIris-virginica\n6.0\t3.4\t4.5\t1.6\tIris-versicolor\n5.4\t3.0\t4.5\t1.5\tIris-versicolor\n5.5\t3.5\t1.3\t0.2\tIris-setosa\n5.0\t3.3\t1.4\t0.2\tIris-setosa\n5.7\t4.4\t1.5\t0.4\tIris-setosa\n5.3\t3.7\t1.5\t0.2\tIris-setosa\n5.2\t3.5\t1.5\t0.2\tIris-setosa\n6.5\t2.8\t4.6\t1.5\tIris-versicolor\n7.4\t2.8\t6.1\t1.9\tIris-virginica\n4.9\t3.1\t1.5\t0.2\tIris-setosa\n5.0\t3.2\t1.2\t0.2\tIris-setosa\n7.7\t2.8\t6.7\t2.0\tIris-virginica\n4.8\t3.4\t1.9\t0.2\tIris-setosa\n6.5\t3.0\t5.2\t2.0\tIris-virginica\n6.3\t2.5\t5.0\t1.9\tIris-virginica\n6.4\t3.1\t5.5\t1.8\tIris-virginica\n5.8\t2.7\t5.1\t1.9\tIris-virginica\n7.1\t3.0\t5.9\t2.1\tIris-virginica\n5.7\t2.5\t5.0\t2.0\tIris-virginica\n6.4\t2.8\t5.6\t2.2\tIris-virginica\n6.4\t3.2\t4.5\t1.5\tIris-versicolor\n6.1\t2.6\t5.6\t1.4\tIris-virginica\n4.8\t3.0\t1.4\t0.1\tIris-setosa\n5.6\t2.8\t4.9\t2.0\tIris-virginica\n6.0\t2.2\t5.0\t1.5\tIris-virginica\n5.0\t3.5\t1.3\t0.3\tIris-setosa\n5.5\t2.6\t4.4\t1.2\tIris-versicolor\n5.0\t3.6\t1.4\t0.2\tIris-setosa\n5.0\t3.4\t1.6\t0.4\tIris-setosa\n6.3\t2.7\t4.9\t1.8\tIris-virginica\n6.7\t3.1\t4.7\t1.5\tIris-versicolor\n6.3\t2.5\t4.9\t1.5\tIris-versicolor\n4.5\t2.3\t1.3\t0.3\tIris-setosa\n6.8\t3.2\t5.9\t2.3\tIris-virginica\n7.2\t3.2\t6.0\t1.8\tIris-virginica\n5.5\t2.4\t3.8\t1.1\tIris-versicolor\n5.8\t2.7\t5.1\t1.9\tIris-virginica\n6.1\t2.8\t4.0\t1.3\tIris-versicolor\n6.3\t2.9\t5.6\t1.8\tIris-virginica\n6.1\t2.9\t4.7\t1.4\tIris-versicolor\n6.3\t2.3\t4.4\t1.3\tIris-versicolor\n4.6\t3.4\t1.4\t0.3\tIris-setosa\n5.5\t4.2\t1.4\t0.2\tIris-setosa\n6.5\t3.0\t5.5\t1.8\tIris-virginica\n6.7\t3.1\t4.4\t1.4\tIris-versicolor\n6.6\t2.9\t4.6\t1.3\tIris-versicolor\n5.9\t3.0\t5.1\t1.8\tIris-virginica\n6.4\t2.7\t5.3\t1.9\tIris-virginica\n5.6\t2.5\t3.9\t1.1\tIris-versicolor\n6.4\t3.2\t5.3\t2.3\tIris-virginica\n5.7\t3.8\t1.7\t0.3\tIris-setosa\n7.2\t3.0\t5.8\t1.6\tIris-virginica\n6.7\t3.0\t5.2\t2.3\tIris-virginica\n4.6\t3.1\t1.5\t0.2\tIris-setosa\n5.6\t2.9\t3.6\t1.3\tIris-versicolor\n6.4\t2.9\t4.3\t1.3\tIris-versicolor\n5.1\t3.5\t1.4\t0.2\tIris-setosa\n7.6\t3.0\t6.6\t2.1\tIris-virginica\n5.7\t2.8\t4.1\t1.3\tIris-versicolor\n5.6\t2.7\t4.2\t1.3\tIris-versicolor\n5.7\t2.9\t4.2\t1.3\tIris-versicolor\n5.4\t3.7\t1.5\t0.2\tIris-setosa\n6.4\t2.8\t5.6\t2.1\tIris-virginica\n4.6\t3.6\t1.0\t0.2\tIris-setosa\n4.4\t2.9\t1.4\t0.2\tIris-setosa\n4.4\t3.2\t1.3\t0.2\tIris-setosa\n6.2\t3.4\t5.4\t2.3\tIris-virginica\n6.3\t3.4\t5.6\t2.4\tIris-virginica\n6.8\t2.8\t4.8\t1.4\tIris-versicolor\n5.1\t3.4\t1.5\t0.2\tIris-setosa\n6.1\t3.0\t4.9\t1.8\tIris-virginica\n5.7\t3.0\t4.2\t1.2\tIris-versicolor\n5.0\t3.4\t1.5\t0.2\tIris-setosa\n5.0\t3.5\t1.6\t0.6\tIris-setosa\n7.7\t3.8\t6.7\t2.2\tIris-virginica\n4.9\t3.1\t1.5\t0.1\tIris-setosa\n6.0\t2.2\t4.0\t1.0\tIris-versicolor\n6.8\t3.0\t5.5\t2.1\tIris-virginica\n5.1\t2.5\t3.0\t1.1\tIris-versicolor\n6.5\t3.2\t5.1\t2.0\tIris-virginica\n4.7\t3.2\t1.3\t0.2\tIris-setosa\n6.6\t3.0\t4.4\t1.4\tIris-versicolor\n6.7\t3.0\t5.0\t1.7\tIris-versicolor\n4.8\t3.0\t1.4\t0.3\tIris-setosa\n5.1\t3.8\t1.5\t0.3\tIris-setosa\n7.7\t2.6\t6.9\t2.3\tIris-virginica\n5.1\t3.8\t1.6\t0.2\tIris-setosa\n5.0\t2.0\t3.5\t1.0\tIris-versicolor\n7.7\t3.0\t6.1\t2.3\tIris-virginica\n6.5\t3.0\t5.8\t2.2\tIris-virginica\n5.8\t4.0\t1.2\t0.2\tIris-setosa\n5.4\t3.4\t1.5\t0.4\tIris-setosa\n6.2\t2.2\t4.5\t1.5\tIris-versicolor\n5.7\t2.8\t4.5\t1.3\tIris-versicolor\n5.5\t2.5\t4.0\t1.3\tIris-versicolor\n7.3\t2.9\t6.3\t1.8\tIris-virginica\n5.6\t3.0\t4.5\t1.5\tIris-versicolor\n6.2\t2.8\t4.8\t1.8\tIris-virginica\n4.3\t3.0\t1.1\t0.1\tIris-setosa\n5.8\t2.7\t3.9\t1.2\tIris-versicolor\n7.9\t3.8\t6.4\t2.0\tIris-virginica\n6.2\t2.9\t4.3\t1.3\tIris-versicolor\n4.9\t2.5\t4.5\t1.7\tIris-virginica\n4.9\t3.6\t1.4\t0.1\tIris-setosa\n5.2\t3.4\t1.4\t0.2\tIris-setosa\n6.0\t2.7\t5.1\t1.6\tIris-versicolor\n6.9\t3.1\t5.4\t2.1\tIris-virginica\n4.8\t3.1\t1.6\t0.2\tIris-setosa\n6.7\t3.1\t5.6\t2.4\tIris-virginica\n6.3\t3.3\t4.7\t1.6\tIris-versicolor\n5.2\t4.1\t1.5\t0.1\tIris-setosa\n5.4\t3.9\t1.3\t0.4\tIris-setosa\n4.9\t2.4\t3.3\t1.0\tIris-versicolor\n5.5\t2.4\t3.7\t1.0\tIris-versicolor\n5.1\t3.5\t1.4\t0.3\tIris-setosa\n6.1\t3.0\t4.6\t1.4\tIris-versicolor\n5.1\t3.3\t1.7\t0.5\tIris-setosa\n4.4\t3.0\t1.3\t0.2\tIris-setosa\n5.9\t3.2\t4.8\t1.8\tIris-versicolor\n4.7\t3.2\t1.6\t0.2\tIris-setosa\n6.9\t3.1\t5.1\t2.3\tIris-virginica\n5.4\t3.9\t1.7\t0.4\tIris-setosa\n5.8\t2.7\t4.1\t1.0\tIris-versicolor\n6.1\t2.8\t4.7\t1.2\tIris-versicolor\n6.0\t3.0\t4.8\t1.8\tIris-virginica\n6.7\t2.5\t5.8\t1.8\tIris-virginica\n"
  },
  {
    "path": "project/build.properties",
    "content": "sbt.version=0.13.9\n"
  },
  {
    "path": "project/plugins.sbt",
    "content": "addSbtPlugin(\"com.eed3si9n\" % \"sbt-assembly\" % \"0.13.0\")\n\naddSbtPlugin(\"no.arktekk.sbt\" % \"aether-deploy\" % \"0.14\")\n\naddSbtPlugin(\"org.xerial.sbt\" % \"sbt-sonatype\" % \"0.2.1\")\n\naddSbtPlugin(\"com.typesafe.sbt\" % \"sbt-pgp\" % \"0.8.3\")"
  },
  {
    "path": "sbt",
    "content": "#!/usr/bin/env bash\n#\n# A more capable sbt runner, coincidentally also called sbt.\n# Author: Paul Phillips <paulp@improving.org>\n\n# todo - make this dynamic\ndeclare -r sbt_release_version=\"0.13.8\"\ndeclare -r sbt_unreleased_version=\"0.13.9-M1\"\ndeclare -r buildProps=\"project/build.properties\"\n\ndeclare sbt_jar sbt_dir sbt_create sbt_version\ndeclare scala_version sbt_explicit_version\ndeclare verbose noshare batch trace_level log_level\ndeclare sbt_saved_stty debugUs\n\nechoerr () { echo >&2 \"$@\"; }\nvlog ()    { [[ -n \"$verbose\" ]] && echoerr \"$@\"; }\n\n# spaces are possible, e.g. sbt.version = 0.13.0\nbuild_props_sbt () {\n  [[ -r \"$buildProps\" ]] && \\\n    grep '^sbt\\.version' \"$buildProps\" | tr '=\\r' ' ' | awk '{ print $2; }'\n}\n\nupdate_build_props_sbt () {\n  local ver=\"$1\"\n  local old=\"$(build_props_sbt)\"\n\n  [[ -r \"$buildProps\" ]] && [[ \"$ver\" != \"$old\" ]] && {\n    perl -pi -e \"s/^sbt\\.version\\b.*\\$/sbt.version=${ver}/\" \"$buildProps\"\n    grep -q '^sbt.version[ =]' \"$buildProps\" || printf \"\\nsbt.version=%s\\n\" \"$ver\" >> \"$buildProps\"\n\n    vlog \"!!!\"\n    vlog \"!!! Updated file $buildProps setting sbt.version to: $ver\"\n    vlog \"!!! Previous value was: $old\"\n    vlog \"!!!\"\n  }\n}\n\nset_sbt_version () {\n  sbt_version=\"${sbt_explicit_version:-$(build_props_sbt)}\"\n  [[ -n \"$sbt_version\" ]] || sbt_version=$sbt_release_version\n  export sbt_version\n}\n\n# restore stty settings (echo in particular)\nonSbtRunnerExit() {\n  [[ -n \"$sbt_saved_stty\" ]] || return\n  vlog \"\"\n  vlog \"restoring stty: $sbt_saved_stty\"\n  stty \"$sbt_saved_stty\"\n  unset sbt_saved_stty\n}\n\n# save stty and trap exit, to ensure echo is reenabled if we are interrupted.\ntrap onSbtRunnerExit EXIT\nsbt_saved_stty=\"$(stty -g 2>/dev/null)\"\nvlog \"Saved stty: $sbt_saved_stty\"\n\n# this seems to cover the bases on OSX, and someone will\n# have to tell me about the others.\nget_script_path () {\n  local path=\"$1\"\n  [[ -L \"$path\" ]] || { echo \"$path\" ; return; }\n\n  local target=\"$(readlink \"$path\")\"\n  if [[ \"${target:0:1}\" == \"/\" ]]; then\n    echo \"$target\"\n  else\n    echo \"${path%/*}/$target\"\n  fi\n}\n\ndie() {\n  echo \"Aborting: $@\"\n  exit 1\n}\n\nmake_url () {\n  version=\"$1\"\n\n  case \"$version\" in\n        0.7.*) echo \"http://simple-build-tool.googlecode.com/files/sbt-launch-0.7.7.jar\" ;;\n      0.10.* ) echo \"$sbt_launch_repo/org.scala-tools.sbt/sbt-launch/$version/sbt-launch.jar\" ;;\n    0.11.[12]) echo \"$sbt_launch_repo/org.scala-tools.sbt/sbt-launch/$version/sbt-launch.jar\" ;;\n            *) echo \"$sbt_launch_repo/org.scala-sbt/sbt-launch/$version/sbt-launch.jar\" ;;\n  esac\n}\n\ninit_default_option_file () {\n  local overriding_var=\"${!1}\"\n  local default_file=\"$2\"\n  if [[ ! -r \"$default_file\" && \"$overriding_var\" =~ ^@(.*)$ ]]; then\n    local envvar_file=\"${BASH_REMATCH[1]}\"\n    if [[ -r \"$envvar_file\" ]]; then\n      default_file=\"$envvar_file\"\n    fi\n  fi\n  echo \"$default_file\"\n}\n\ndeclare -r cms_opts=\"-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC\"\ndeclare -r jit_opts=\"-XX:ReservedCodeCacheSize=256m -XX:+TieredCompilation\"\ndeclare -r default_jvm_opts_common=\"-Xms512m -Xmx1536m -Xss2m $jit_opts $cms_opts\"\ndeclare -r noshare_opts=\"-Dsbt.global.base=project/.sbtboot -Dsbt.boot.directory=project/.boot -Dsbt.ivy.home=project/.ivy\"\ndeclare -r latest_28=\"2.8.2\"\ndeclare -r latest_29=\"2.9.3\"\ndeclare -r latest_210=\"2.10.5\"\ndeclare -r latest_211=\"2.11.7\"\n\ndeclare -r script_path=\"$(get_script_path \"$BASH_SOURCE\")\"\ndeclare -r script_name=\"${script_path##*/}\"\n\n# some non-read-onlies set with defaults\ndeclare java_cmd=\"java\"\ndeclare sbt_opts_file=\"$(init_default_option_file SBT_OPTS .sbtopts)\"\ndeclare jvm_opts_file=\"$(init_default_option_file JVM_OPTS .jvmopts)\"\ndeclare sbt_launch_repo=\"http://repo.typesafe.com/typesafe/ivy-releases\"\n\n# pull -J and -D options to give to java.\ndeclare -a residual_args\ndeclare -a java_args\ndeclare -a scalac_args\ndeclare -a sbt_commands\n\n# args to jvm/sbt via files or environment variables\ndeclare -a extra_jvm_opts extra_sbt_opts\n\naddJava () {\n  vlog \"[addJava] arg = '$1'\"\n  java_args+=(\"$1\")\n}\naddSbt () {\n  vlog \"[addSbt] arg = '$1'\"\n  sbt_commands+=(\"$1\")\n}\nsetThisBuild () {\n  vlog \"[addBuild] args = '$@'\"\n  local key=\"$1\" && shift\n  addSbt \"set $key in ThisBuild := $@\"\n}\naddScalac () {\n  vlog \"[addScalac] arg = '$1'\"\n  scalac_args+=(\"$1\")\n}\naddResidual () {\n  vlog \"[residual] arg = '$1'\"\n  residual_args+=(\"$1\")\n}\naddResolver () {\n  addSbt \"set resolvers += $1\"\n}\naddDebugger () {\n  addJava \"-Xdebug\"\n  addJava \"-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=$1\"\n}\nsetScalaVersion () {\n  [[ \"$1\" == *\"-SNAPSHOT\" ]] && addResolver 'Resolver.sonatypeRepo(\"snapshots\")'\n  addSbt \"++ $1\"\n}\nsetJavaHome () {\n  java_cmd=\"$1/bin/java\"\n  setThisBuild javaHome \"Some(file(\\\"$1\\\"))\"\n  export JAVA_HOME=\"$1\"\n  export JDK_HOME=\"$1\"\n  export PATH=\"$JAVA_HOME/bin:$PATH\"\n}\nsetJavaHomeQuietly () {\n  addSbt warn\n  setJavaHome \"$1\"\n  addSbt info\n}\n\n# if set, use JDK_HOME/JAVA_HOME over java found in path\nif [[ -e \"$JDK_HOME/lib/tools.jar\" ]]; then\n  setJavaHomeQuietly \"$JDK_HOME\"\nelif [[ -e \"$JAVA_HOME/bin/java\" ]]; then\n  setJavaHomeQuietly \"$JAVA_HOME\"\nfi\n\n# directory to store sbt launchers\ndeclare sbt_launch_dir=\"$HOME/.sbt/launchers\"\n[[ -d \"$sbt_launch_dir\" ]] || mkdir -p \"$sbt_launch_dir\"\n[[ -w \"$sbt_launch_dir\" ]] || sbt_launch_dir=\"$(mktemp -d -t sbt_extras_launchers.XXXXXX)\"\n\njava_version () {\n  local version=$(\"$java_cmd\" -version 2>&1 | grep -E -e '(java|openjdk) version' | awk '{ print $3 }' | tr -d \\\")\n  vlog \"Detected Java version: $version\"\n  echo \"${version:2:1}\"\n}\n\n# MaxPermSize critical on pre-8 jvms but incurs noisy warning on 8+\ndefault_jvm_opts () {\n  local v=\"$(java_version)\"\n  if [[ $v -ge 8 ]]; then\n    echo \"$default_jvm_opts_common\"\n  else\n    echo \"-XX:MaxPermSize=384m $default_jvm_opts_common\"\n  fi\n}\n\nbuild_props_scala () {\n  if [[ -r \"$buildProps\" ]]; then\n    versionLine=\"$(grep '^build.scala.versions' \"$buildProps\")\"\n    versionString=\"${versionLine##build.scala.versions=}\"\n    echo \"${versionString%% .*}\"\n  fi\n}\n\nexecRunner () {\n  # print the arguments one to a line, quoting any containing spaces\n  vlog \"# Executing command line:\" && {\n    for arg; do\n      if [[ -n \"$arg\" ]]; then\n        if printf \"%s\\n\" \"$arg\" | grep -q ' '; then\n          printf >&2 \"\\\"%s\\\"\\n\" \"$arg\"\n        else\n          printf >&2 \"%s\\n\" \"$arg\"\n        fi\n      fi\n    done\n    vlog \"\"\n  }\n\n  [[ -n \"$batch\" ]] && exec </dev/null\n  exec \"$@\"\n}\n\njar_url () {\n  make_url \"$1\"\n}\n\njar_file () {\n  echo \"$sbt_launch_dir/$1/sbt-launch.jar\"\n}\n\ndownload_url () {\n  local url=\"$1\"\n  local jar=\"$2\"\n\n  echoerr \"Downloading sbt launcher for $sbt_version:\"\n  echoerr \"  From  $url\"\n  echoerr \"    To  $jar\"\n\n  mkdir -p \"${jar%/*}\" && {\n    if which curl >/dev/null; then\n      curl --fail --silent --location \"$url\" --output \"$jar\"\n    elif which wget >/dev/null; then\n      wget --quiet -O \"$jar\" \"$url\"\n    fi\n  } && [[ -r \"$jar\" ]]\n}\n\nacquire_sbt_jar () {\n  sbt_url=\"$(jar_url \"$sbt_version\")\"\n  sbt_jar=\"$(jar_file \"$sbt_version\")\"\n\n  [[ -r \"$sbt_jar\" ]] || download_url \"$sbt_url\" \"$sbt_jar\"\n}\n\nusage () {\n  cat <<EOM\nUsage: $script_name [options]\n\nNote that options which are passed along to sbt begin with -- whereas\noptions to this runner use a single dash. Any sbt command can be scheduled\nto run first by prefixing the command with --, so --warn, --error and so on\nare not special.\n\nOutput filtering: if there is a file in the home directory called .sbtignore\nand this is not an interactive sbt session, the file is treated as a list of\nbash regular expressions. Output lines which match any regex are not echoed.\nOne can see exactly which lines would have been suppressed by starting this\nrunner with the -x option.\n\n  -h | -help         print this message\n  -v                 verbose operation (this runner is chattier)\n  -d, -w, -q         aliases for --debug, --warn, --error (q means quiet)\n  -x                 debug this script\n  -trace <level>     display stack traces with a max of <level> frames (default: -1, traces suppressed)\n  -debug-inc         enable debugging log for the incremental compiler\n  -no-colors         disable ANSI color codes\n  -sbt-create        start sbt even if current directory contains no sbt project\n  -sbt-dir   <path>  path to global settings/plugins directory (default: ~/.sbt/<version>)\n  -sbt-boot  <path>  path to shared boot directory (default: ~/.sbt/boot in 0.11+)\n  -ivy       <path>  path to local Ivy repository (default: ~/.ivy2)\n  -no-share          use all local caches; no sharing\n  -offline           put sbt in offline mode\n  -jvm-debug <port>  Turn on JVM debugging, open at the given port.\n  -batch             Disable interactive mode\n  -prompt <expr>     Set the sbt prompt; in expr, 's' is the State and 'e' is Extracted\n\n  # sbt version (default: sbt.version from $buildProps if present, otherwise $sbt_release_version)\n  -sbt-force-latest         force the use of the latest release of sbt: $sbt_release_version\n  -sbt-version  <version>   use the specified version of sbt (default: $sbt_release_version)\n  -sbt-dev                  use the latest pre-release version of sbt: $sbt_unreleased_version\n  -sbt-jar      <path>      use the specified jar as the sbt launcher\n  -sbt-launch-dir <path>    directory to hold sbt launchers (default: ~/.sbt/launchers)\n  -sbt-launch-repo <url>    repo url for downloading sbt launcher jar (default: $sbt_launch_repo)\n\n  # scala version (default: as chosen by sbt)\n  -28                       use $latest_28\n  -29                       use $latest_29\n  -210                      use $latest_210\n  -211                      use $latest_211\n  -scala-home <path>        use the scala build at the specified directory\n  -scala-version <version>  use the specified version of scala\n  -binary-version <version> use the specified scala version when searching for dependencies\n\n  # java version (default: java from PATH, currently $(java -version 2>&1 | grep version))\n  -java-home <path>         alternate JAVA_HOME\n\n  # passing options to the jvm - note it does NOT use JAVA_OPTS due to pollution\n  # The default set is used if JVM_OPTS is unset and no -jvm-opts file is found\n  <default>        $(default_jvm_opts)\n  JVM_OPTS         environment variable holding either the jvm args directly, or\n                   the reference to a file containing jvm args if given path is prepended by '@' (e.g. '@/etc/jvmopts')\n                   Note: \"@\"-file is overridden by local '.jvmopts' or '-jvm-opts' argument.\n  -jvm-opts <path> file containing jvm args (if not given, .jvmopts in project root is used if present)\n  -Dkey=val        pass -Dkey=val directly to the jvm\n  -J-X             pass option -X directly to the jvm (-J is stripped)\n\n  # passing options to sbt, OR to this runner\n  SBT_OPTS         environment variable holding either the sbt args directly, or\n                   the reference to a file containing sbt args if given path is prepended by '@' (e.g. '@/etc/sbtopts')\n                   Note: \"@\"-file is overridden by local '.sbtopts' or '-sbt-opts' argument.\n  -sbt-opts <path> file containing sbt args (if not given, .sbtopts in project root is used if present)\n  -S-X             add -X to sbt's scalacOptions (-S is stripped)\nEOM\n}\n\nprocess_args ()\n{\n  require_arg () {\n    local type=\"$1\"\n    local opt=\"$2\"\n    local arg=\"$3\"\n\n    if [[ -z \"$arg\" ]] || [[ \"${arg:0:1}\" == \"-\" ]]; then\n      die \"$opt requires <$type> argument\"\n    fi\n  }\n  while [[ $# -gt 0 ]]; do\n    case \"$1\" in\n          -h|-help) usage; exit 1 ;;\n                -v) verbose=true && shift ;;\n                -d) addSbt \"--debug\" && addSbt debug && shift ;;\n                -w) addSbt \"--warn\"  && addSbt warn  && shift ;;\n                -q) addSbt \"--error\" && addSbt error && shift ;;\n                -x) debugUs=true && shift ;;\n            -trace) require_arg integer \"$1\" \"$2\" && trace_level=\"$2\" && shift 2 ;;\n              -ivy) require_arg path \"$1\" \"$2\" && addJava \"-Dsbt.ivy.home=$2\" && shift 2 ;;\n        -no-colors) addJava \"-Dsbt.log.noformat=true\" && shift ;;\n         -no-share) noshare=true && shift ;;\n         -sbt-boot) require_arg path \"$1\" \"$2\" && addJava \"-Dsbt.boot.directory=$2\" && shift 2 ;;\n          -sbt-dir) require_arg path \"$1\" \"$2\" && sbt_dir=\"$2\" && shift 2 ;;\n        -debug-inc) addJava \"-Dxsbt.inc.debug=true\" && shift ;;\n          -offline) addSbt \"set offline := true\" && shift ;;\n        -jvm-debug) require_arg port \"$1\" \"$2\" && addDebugger \"$2\" && shift 2 ;;\n            -batch) batch=true && shift ;;\n           -prompt) require_arg \"expr\" \"$1\" \"$2\" && setThisBuild shellPrompt \"(s => { val e = Project.extract(s) ; $2 })\" && shift 2 ;;\n\n       -sbt-create) sbt_create=true && shift ;;\n          -sbt-jar) require_arg path \"$1\" \"$2\" && sbt_jar=\"$2\" && shift 2 ;;\n      -sbt-version) require_arg version \"$1\" \"$2\" && sbt_explicit_version=\"$2\" && shift 2 ;;\n -sbt-force-latest) sbt_explicit_version=\"$sbt_release_version\" && shift ;;\n          -sbt-dev) sbt_explicit_version=\"$sbt_unreleased_version\" && shift ;;\n   -sbt-launch-dir) require_arg path \"$1\" \"$2\" && sbt_launch_dir=\"$2\" && shift 2 ;;\n  -sbt-launch-repo) require_arg path \"$1\" \"$2\" && sbt_launch_repo=\"$2\" && shift 2 ;;\n    -scala-version) require_arg version \"$1\" \"$2\" && setScalaVersion \"$2\" && shift 2 ;;\n   -binary-version) require_arg version \"$1\" \"$2\" && setThisBuild scalaBinaryVersion \"\\\"$2\\\"\" && shift 2 ;;\n       -scala-home) require_arg path \"$1\" \"$2\" && setThisBuild scalaHome \"Some(file(\\\"$2\\\"))\" && shift 2 ;;\n        -java-home) require_arg path \"$1\" \"$2\" && setJavaHome \"$2\" && shift 2 ;;\n         -sbt-opts) require_arg path \"$1\" \"$2\" && sbt_opts_file=\"$2\" && shift 2 ;;\n         -jvm-opts) require_arg path \"$1\" \"$2\" && jvm_opts_file=\"$2\" && shift 2 ;;\n\n               -D*) addJava \"$1\" && shift ;;\n               -J*) addJava \"${1:2}\" && shift ;;\n               -S*) addScalac \"${1:2}\" && shift ;;\n               -28) setScalaVersion \"$latest_28\" && shift ;;\n               -29) setScalaVersion \"$latest_29\" && shift ;;\n              -210) setScalaVersion \"$latest_210\" && shift ;;\n              -211) setScalaVersion \"$latest_211\" && shift ;;\n\n           --debug) addSbt debug && addResidual \"$1\" && shift ;;\n            --warn) addSbt warn  && addResidual \"$1\" && shift ;;\n           --error) addSbt error && addResidual \"$1\" && shift ;;\n                 *) addResidual \"$1\" && shift ;;\n    esac\n  done\n}\n\n# process the direct command line arguments\nprocess_args \"$@\"\n\n# skip #-styled comments and blank lines\nreadConfigFile() {\n  while read line; do\n    [[ $line =~ ^# ]] || [[ -z $line ]] || echo \"$line\"\n  done < \"$1\"\n}\n\n# if there are file/environment sbt_opts, process again so we\n# can supply args to this runner\nif [[ -r \"$sbt_opts_file\" ]]; then\n  vlog \"Using sbt options defined in file $sbt_opts_file\"\n  while read opt; do extra_sbt_opts+=(\"$opt\"); done < <(readConfigFile \"$sbt_opts_file\")\nelif [[ -n \"$SBT_OPTS\" && ! (\"$SBT_OPTS\" =~ ^@.*) ]]; then\n  vlog \"Using sbt options defined in variable \\$SBT_OPTS\"\n  extra_sbt_opts=( $SBT_OPTS )\nelse\n  vlog \"No extra sbt options have been defined\"\nfi\n\n[[ -n \"${extra_sbt_opts[*]}\" ]] && process_args \"${extra_sbt_opts[@]}\"\n\n# reset \"$@\" to the residual args\nset -- \"${residual_args[@]}\"\nargumentCount=$#\n\n# set sbt version\nset_sbt_version\n\n# only exists in 0.12+\nsetTraceLevel() {\n  case \"$sbt_version\" in\n    \"0.7.\"* | \"0.10.\"* | \"0.11.\"* ) echoerr \"Cannot set trace level in sbt version $sbt_version\" ;;\n                                 *) setThisBuild traceLevel $trace_level ;;\n  esac\n}\n\n# set scalacOptions if we were given any -S opts\n[[ ${#scalac_args[@]} -eq 0 ]] || addSbt \"set scalacOptions in ThisBuild += \\\"${scalac_args[@]}\\\"\"\n\n# Update build.properties on disk to set explicit version - sbt gives us no choice\n[[ -n \"$sbt_explicit_version\" ]] && update_build_props_sbt \"$sbt_explicit_version\"\nvlog \"Detected sbt version $sbt_version\"\n\n[[ -n \"$scala_version\" ]] && vlog \"Overriding scala version to $scala_version\"\n\n# no args - alert them there's stuff in here\n(( argumentCount > 0 )) || {\n  vlog \"Starting $script_name: invoke with -help for other options\"\n  residual_args=( shell )\n}\n\n# verify this is an sbt dir or -create was given\n[[ -r ./build.sbt || -d ./project || -n \"$sbt_create\" ]] || {\n  cat <<EOM\n$(pwd) doesn't appear to be an sbt project.\nIf you want to start sbt anyway, run:\n  $0 -sbt-create\n\nEOM\n  exit 1\n}\n\n# pick up completion if present; todo\n[[ -r .sbt_completion.sh ]] && source .sbt_completion.sh\n\n# no jar? download it.\n[[ -r \"$sbt_jar\" ]] || acquire_sbt_jar || {\n  # still no jar? uh-oh.\n  echo \"Download failed. Obtain the jar manually and place it at $sbt_jar\"\n  exit 1\n}\n\nif [[ -n \"$noshare\" ]]; then\n  for opt in ${noshare_opts}; do\n    addJava \"$opt\"\n  done\nelse\n  case \"$sbt_version\" in\n    \"0.7.\"* | \"0.10.\"* | \"0.11.\"* | \"0.12.\"* )\n      [[ -n \"$sbt_dir\" ]] || {\n        sbt_dir=\"$HOME/.sbt/$sbt_version\"\n        vlog \"Using $sbt_dir as sbt dir, -sbt-dir to override.\"\n      }\n    ;;\n  esac\n\n  if [[ -n \"$sbt_dir\" ]]; then\n    addJava \"-Dsbt.global.base=$sbt_dir\"\n  fi\nfi\n\nif [[ -r \"$jvm_opts_file\" ]]; then\n  vlog \"Using jvm options defined in file $jvm_opts_file\"\n  while read opt; do extra_jvm_opts+=(\"$opt\"); done < <(readConfigFile \"$jvm_opts_file\")\nelif [[ -n \"$JVM_OPTS\" && ! (\"$JVM_OPTS\" =~ ^@.*) ]]; then\n  vlog \"Using jvm options defined in \\$JVM_OPTS variable\"\n  extra_jvm_opts=( $JVM_OPTS )\nelse\n  vlog \"Using default jvm options\"\n  extra_jvm_opts=( $(default_jvm_opts) )\nfi\n\n# traceLevel is 0.12+\n[[ -n \"$trace_level\" ]] && setTraceLevel\n\nmain () {\n  execRunner \"$java_cmd\" \\\n    \"${extra_jvm_opts[@]}\" \\\n    \"${java_args[@]}\" \\\n    -jar \"$sbt_jar\" \\\n    \"${sbt_commands[@]}\" \\\n    \"${residual_args[@]}\"\n}\n\n# sbt inserts this string on certain lines when formatting is enabled:\n#   val OverwriteLine = \"\\r\\u001BM\\u001B[2K\"\n# ...in order not to spam the console with a million \"Resolving\" lines.\n# Unfortunately that makes it that much harder to work with when\n# we're not going to print those lines anyway. We strip that bit of\n# line noise, but leave the other codes to preserve color.\nmainFiltered () {\n  local ansiOverwrite='\\r\\x1BM\\x1B[2K'\n  local excludeRegex=$(egrep -v '^#|^$' ~/.sbtignore | paste -sd'|' -)\n\n  echoLine () {\n    local line=\"$1\"\n    local line1=\"$(echo \"$line\" | sed -r 's/\\r\\x1BM\\x1B\\[2K//g')\"       # This strips the OverwriteLine code.\n    local line2=\"$(echo \"$line1\" | sed -r 's/\\x1B\\[[0-9;]*[JKmsu]//g')\" # This strips all codes - we test regexes against this.\n\n    if [[ $line2 =~ $excludeRegex ]]; then\n      [[ -n $debugUs ]] && echo \"[X] $line1\"\n    else\n      [[ -n $debugUs ]] && echo \"    $line1\" || echo \"$line1\"\n    fi\n  }\n\n  echoLine \"Starting sbt with output filtering enabled.\"\n  main | while read -r line; do echoLine \"$line\"; done\n}\n\n# Only filter if there's a filter file and we don't see a known interactive command.\n# Obviously this is super ad hoc but I don't know how to improve on it. Testing whether\n# stdin is a terminal is useless because most of my use cases for this filtering are\n# exactly when I'm at a terminal, running sbt non-interactively.\nshouldFilter () { [[ -f ~/.sbtignore ]] && ! egrep -q '\\b(shell|console|consoleProject)\\b' <<<\"${residual_args[@]}\"; }\n\n# run sbt\nif shouldFilter; then mainFiltered; else main; fi\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/GenericPair.java",
    "content": "package com.etsy.conjecture;\n\n/**\n * @author Diane Hu\n */\npublic class GenericPair<F, S> implements java.io.Serializable {\n\n    private static final long serialVersionUID = 123L;\n    public F first;\n    public S second;\n\n    /**\n     * Class constructor specifying the first and second number to create\n     * \n     * @param first\n     *            first number\n     * @param second\n     *            second number\n     */\n\n    public GenericPair(F first, S second) {\n        this.first = first;\n        this.second = second;\n    }\n\n    /**\n     * The method gets first number\n     * \n     * @return first number\n     */\n    public F getFirst() {\n        return first;\n    }\n\n    /**\n     * The method sets first number\n     * \n     * @param fisrt\n     *            first number\n     */\n    public void setFirst(F first) {\n        this.first = first;\n    }\n\n    /**\n     * The method gets second number\n     * \n     * @return second number\n     */\n    public S getSecond() {\n        return second;\n    }\n\n    /**\n     * The method sets second number\n     * \n     * @param second\n     *            second number\n     */\n    public void setSecond(S second) {\n        this.second = second;\n    }\n\n    @Override\n    public String toString() {\n        return first + \",\" + second;\n    }\n\n    @SuppressWarnings(\"unchecked\")\n    public boolean equals(Object o) {\n        if (!(o instanceof GenericPair<?, ?>))\n            return false;\n        GenericPair<F, S> p = (GenericPair<F, S>)o;\n        return (p.first).equals(first) && (p.second).equals(second);\n    }\n\n    public int hashCode() {\n        return 17 + first.hashCode() * 31 + second.hashCode();\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/PrimitivePair.java",
    "content": "package com.etsy.conjecture;\n\n/**\n * PrimitivePair is JavaBean\n * \n * @author Josh Attenberg\n */\npublic class PrimitivePair implements java.io.Serializable {\n    private static final long serialVersionUID = 1234L;\n    public double first;\n    public double second;\n\n    /**\n     * Class constructor specifying the first and second number to create\n     * \n     * @param first\n     *            first number\n     * @param second\n     *            second number\n     */\n    public PrimitivePair(double first, double second) {\n        this.first = first;\n        this.second = second;\n    }\n\n    /**\n     * The method gets first number\n     * \n     * @return first number\n     */\n    public double getFirst() {\n        return first;\n    }\n\n    /**\n     * The method sets first number\n     * \n     * @param fisrt\n     *            first number\n     */\n    public void setFirst(double fisrt) {\n        this.first = fisrt;\n    }\n\n    /**\n     * The method gets second number\n     * \n     * @return second number\n     */\n    public double getSecond() {\n        return second;\n    }\n\n    /**\n     * The method sets second number\n     * \n     * @param second\n     *            second number\n     */\n    public void setSecond(double second) {\n        this.second = second;\n    }\n\n    @Override\n    public String toString() {\n        return first + \",\" + second;\n    }\n\n    @Override\n    public boolean equals(Object o) {\n        if (!(o instanceof PrimitivePair))\n            return false;\n        PrimitivePair p = (PrimitivePair)o;\n        return p.first == first && p.second == second;\n    }\n\n    @Override\n    public int hashCode() {\n        return (17 + Utilities.doubleHash(first)) * 31\n                + Utilities.doubleHash(second);\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/Utilities.java",
    "content": "package com.etsy.conjecture;\n\nimport java.util.ArrayList;\nimport java.util.Arrays;\nimport java.util.Collection;\nimport java.util.Collections;\nimport java.util.Comparator;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.StringTokenizer;\n\nimport org.apache.commons.lang.StringUtils;\nimport com.google.common.hash.*;\nimport com.google.common.collect.Lists;\n\n/**\n * class of static data science utility methods\n * \n * @author jattenberg\n * \n */\npublic class Utilities {\n\n    public static final double SMALL = 1e-10;\n    public static final HashFunction HASHER = Hashing.md5();\n    public static final double ROOT2 = Math.sqrt(2d);\n    public static final double LOG2 = Math.log(2.);\n\n    private Utilities() {\n    }\n\n    public static String cleanLine(String line) {\n        StringBuffer buffer = new StringBuffer();\n        for (int i = 0; i < line.length(); i++) {\n            char c = line.charAt(i);\n            if (c < 128 && Character.isLetter(c)) {\n                buffer.append(c);\n            } else {\n                buffer.append(' ');\n            }\n        }\n        return buffer.toString().toLowerCase();\n    }\n\n    public static String cleanLineRobust(String input, String separator,\n            boolean ignoreNumbers) {\n        StringBuilder buff = new StringBuilder();\n        StringTokenizer tokenizer = new StringTokenizer(input,\n                \" +.,~\\\\<>\\\\$?!:;(){}|\" + \"\\b\\t\\n\\f\\r\\\"\\'\\\\\\\\/\\\\=\\\\&\\\\%\\\\_\");\n\n        while (tokenizer.hasMoreTokens()) {\n            String token = tokenizer.nextToken();\n            token = token.replaceAll(\"-{2,}\", \"-\");\n            token = token.replaceAll(\"^-\", \"\");\n            token = token.replaceAll(\"-$\", \"\");\n            if (token.length() < 2\n                    || (ignoreNumbers && StringUtils.containsAny(token,\n                            \"0123456789\")))\n                continue;\n            buff.append(token + separator);\n        }\n        int index = buff.lastIndexOf(separator);\n        if (index >= 0)\n            buff.delete(index, buff.length());\n        return buff.toString();\n    }\n\n    public static String checkNotBlank(String s) {\n        if (StringUtils.isBlank(s)) {\n            throw new IllegalArgumentException(\"Argument cannot be blank\");\n        }\n        return s;\n    }\n\n    public static List<String> checkNotBlank(List<String> S) {\n        for (String s : S)\n            checkNotBlank(s);\n        return S;\n    }\n\n    public static String[] checkNotBlank(String[] S) {\n        for (String s : S)\n            checkNotBlank(s);\n        return S;\n    }\n\n    public static double stringInnerProduct(Map<String, Double> coefficients,\n            Collection<String> input) {\n        double output = 0;\n        for (String token : input)\n            output += coefficients.containsKey(token) ? coefficients.get(token)\n                    : 0;\n        return output;\n    }\n\n    public static double sigmoid(double operand) {\n        return 1. / (1. + Math.exp(-operand));\n    }\n\n    /**\n     * derivative of the sigmoid function\n     */\n    public static double dsigmoid(double operand) {\n        return Math.exp(operand) / Math.pow(1. + Math.exp(operand), 2.);\n    }\n\n    /**\n     * returns the strings in input in sorted order\n     * \n     * @param input\n     * @return\n     */\n    public static String sortTerms(String input) {\n        return sortTerms(input, \"\\\\s+\");\n    }\n\n    public static String sortTerms(String input, String delim) {\n        String[] terms = input.split(delim);\n        Arrays.sort(terms);\n        return StringUtils.join(terms, delim);\n    }\n\n    public final static String cleanText(String tmp, int maxlen) {\n\n        StringTokenizer tok = new StringTokenizer(tmp,\n                \" +.,~\\\\<>\\\\$?!:;(){}|-0123456789\\b\\t\\n\\f\\r\\\"\\'\\\\\\\\/\\\\=\\\\&\\\\%\\\\_\");\n        StringBuilder buff = new StringBuilder();\n        while (tok.hasMoreTokens()) {\n            String out = tok.nextToken();\n            if (out.length() < 2 || out.length() > maxlen)\n                continue;\n            buff.append(out + \" \");\n        }\n        return buff.toString();\n    }\n\n    public final static List<String> grams(String input, int[] gramSizes,\n            String separator) {\n        List<String> out = Lists.newArrayList();\n        StringBuilder buff = new StringBuilder();\n        String[] tokens = StringUtils.split(input);\n\n        for (int i = 0; i < tokens.length; i++) {\n            String token = tokens[i];\n            for (int len : gramSizes) {\n                if (len > i + 1)\n                    continue;\n                if (len == 1) {\n                    out.add(token);\n                    continue;\n                }\n                buff.setLength(0);\n\n                for (int k = len - 1; k > 0; k--)\n                    buff.append(tokens[i - k] + separator);\n                buff.append(token);\n                out.add(buff.toString());\n            }\n        }\n        return out;\n    }\n\n    public static final boolean floatingPointEquals(double a, double b) {\n        return (a - b < SMALL) && (b - a < SMALL);\n    }\n\n    public static int doubleHash(double d) {\n        long t = Double.doubleToLongBits(d);\n        return (int)(t ^ (t >>> 32));\n    }\n\n    public static double logistic(double x) {\n        return 1d / (1 + Math.exp(-x));\n    }\n\n    static class ValueComparator<K, V extends Comparable<? super V>> implements\n            Comparator<Map.Entry<K, V>> {\n        boolean reverse;\n\n        public ValueComparator(boolean reverse) {\n            this.reverse = reverse;\n        }\n\n        public int compare(Map.Entry<K, V> a, Map.Entry<K, V> b) {\n            int res = a.getValue().compareTo(b.getValue());\n            return reverse ? -res : res;\n        }\n    }\n\n    public static <K, V extends Comparable<? super V>> ArrayList<K> orderKeysByValue(\n            Map<K, V> map) {\n        return orderKeysByValue(map, false);\n    }\n\n    public static <K, V extends Comparable<? super V>> ArrayList<K> orderKeysByValue(\n            Map<K, V> map, boolean reverse) {\n        ArrayList<Map.Entry<K, V>> keys = new ArrayList<Map.Entry<K, V>>();\n        keys.addAll(map.entrySet());\n        Collections.sort(keys, new ValueComparator<K, V>(reverse));\n        ArrayList<K> res = new ArrayList<K>();\n        for (int i = 0; i < keys.size(); i++) {\n            res.add(keys.get(i).getKey());\n        }\n        return res;\n    }\n\n    public static <K, V extends Comparable<? super V>> List<K> topKeysByValue(\n            Map<K, V> map, int n) {\n        ArrayList<K> keys = orderKeysByValue(map, true);\n        ArrayList<K> res = new ArrayList<K>(n);\n        for (int i = 0; i < n && i < keys.size(); i++) {\n            res.add(keys.get(i));\n        }\n        return res;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/AbstractInstance.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.util.Collection;\nimport java.util.List;\nimport java.util.Map;\n\npublic abstract class AbstractInstance<T extends AbstractInstance<T>> {\n\n    protected static final String SEP = \"___\";\n    public String id;\n    public String supporting_data;\n    protected double weight;\n\n    StringKeyedVector vector;\n\n    public AbstractInstance() {\n        this(new StringKeyedVector(), 1.0);\n    }\n\n    public AbstractInstance(double weight) {\n        this(new StringKeyedVector(), weight);\n    }\n\n    public AbstractInstance(StringKeyedVector skv) {\n        this(skv, 1.0);\n    }\n\n    public AbstractInstance(StringKeyedVector skv, double weight) {\n        this.vector = skv;\n        this.weight = weight;\n    }\n\n    public AbstractInstance(Map<String, Double> map) {\n        this(map, 1.0);\n    }\n\n    public AbstractInstance(Map<String, Double> map, double weight) {\n        this.vector = new StringKeyedVector(map);\n        this.weight = weight;\n    }\n\n    @SuppressWarnings(\"unchecked\")\n    public T setWeight(double weight) {\n        this.weight = weight;\n        return (T)this;\n    }\n\n    public double getWeight() {\n        return weight;\n    }\n\n    public String getId() {\n        return id;\n    }\n\n    public StringKeyedVector getVector() {\n        return vector;\n    }\n\n    public void setSupportingData(String s) {\n        supporting_data = s;\n    }\n\n    public String getSupportingData() {\n        return supporting_data;\n    }\n\n    @SuppressWarnings(\"unchecked\")\n    public T setCoordinate(String id, double value) {\n        vector.setCoordinate(id, value);\n        return (T)this;\n    }\n\n    @SuppressWarnings(\"unchecked\")\n    public T addToCoordinate(String id, double value) {\n        vector.addToCoordinate(id, value);\n        return (T)this;\n    }\n\n    @SuppressWarnings(\"unchecked\")\n    public T setId(String id) {\n        this.id = id;\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addTerm(java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTerm(String term) {\n        addTerm(term, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addTerm(java.lang.String,\n     * double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTerm(String term, double featureWeight) {\n        addToCoordinate(term, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermWithNamespace(java.\n     * lang.String, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermWithNamespace(String term, String namespace) {\n        addTermWithNamespace(term, namespace, 1);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermWithNamespace(java.\n     * lang.String, java.lang.String, double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermWithNamespace(String term, String namespace,\n            double featureWeight) {\n        addToCoordinate(namespace + SEP + term, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTerms(java.util.Collection,\n     * double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTerms(Collection<String> terms, double featureWeight) {\n        for (String term : terms) {\n            addToCoordinate(term, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTerms(java.util.Collection)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTerms(Collection<String> terms) {\n        addTerms(terms, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermsWithNamespace(java\n     * .util.Collection, java.lang.String, double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermsWithNamespace(Collection<String> terms, String namespace,\n            double featureWeight) {\n        for (String term : terms) {\n            addTermWithNamespace(term, namespace, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermsWithNamespace(java\n     * .util.Collection, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermsWithNamespace(Collection<String> terms, String namespace) {\n        addTermsWithNamespace(terms, namespace, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTerms(java.lang.String[],\n     * double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTerms(String[] terms, double featureWeight) {\n        for (String term : terms) {\n            addToCoordinate(term, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTerms(java.lang.String[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTerms(String[] terms) {\n        addTerms(terms, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermsWithNamespace(java\n     * .lang.String[], java.lang.String, double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermsWithNamespace(String[] terms, String namespace,\n            double featureWeight) {\n        for (String term : terms) {\n            addTermWithNamespace(term, namespace, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermsWithNamespace(java\n     * .lang.String[], java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermsWithNamespace(String[] terms, String namespace) {\n        addTermsWithNamespace(terms, namespace, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermsWithWeights(java.util\n     * .Map)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermsWithWeights(Map<String, Double> termsWithWeights) {\n        for (String term : termsWithWeights.keySet()) {\n            addTerm(term, termsWithWeights.get(term));\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addTermsWithWeightsWithNamespace\n     * (java.util.Map, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addTermsWithWeightsWithNamespace(\n            Map<String, Double> termsWithWeights, String namespace) {\n        for (String term : termsWithWeights.keySet()) {\n            addTermWithNamespace(term, namespace, termsWithWeights.get(term));\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addNumericArrayWithNamespace\n     * (double[], java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addNumericArrayWithNamespace(double[] array, String namespace) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(namespace + SEP + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addNumericArray(double[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addNumericArray(double[] array) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(\"\" + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addNumericArrayWithNamespace\n     * (java.lang.Double[], java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addNumericArrayWithNamespace(Double[] array, String namespace) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(namespace + SEP + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addNumericArray(java.lang.\n     * Double[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addNumericArray(Double[] array) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(\"\" + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addNumericArrayWithNamespace\n     * (java.util.List, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addNumericArrayWithNamespace(List<Double> values, String namespace) {\n        for (int i = 0; i < values.size(); i++) {\n            addToCoordinate(namespace + SEP + i, values.get(i));\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addNumericArray(java.util.\n     * List)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addNumericArray(List<Double> values) {\n        for (int i = 0; i < values.size(); i++) {\n            addToCoordinate(\"\" + i, values.get(i));\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setNumericArrayWithNamespace\n     * (double[], java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setNumericArrayWithNamespace(double[] array, String namespace) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(namespace + SEP + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setNumericArray(double[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setNumericArray(double[] array) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(\"\" + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setNumericArrayWithNamespace\n     * (java.lang.Double[], java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setNumericArrayWithNamespace(Double[] array, String namespace) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(namespace + SEP + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setNumericArray(java.lang.\n     * Double[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setNumericArray(Double[] array) {\n        for (int i = 0; i < array.length; i++) {\n            addToCoordinate(\"\" + i, array[i]);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setNumericArrayWithNamespace\n     * (java.util.List, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setNumericArrayWithNamespace(List<Double> values, String namespace) {\n        for (int i = 0; i < values.size(); i++) {\n            addToCoordinate(namespace + SEP + i, values.get(i));\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setNumericArray(java.util.\n     * List)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setNumericArray(List<Double> values) {\n        for (int i = 0; i < values.size(); i++) {\n            addToCoordinate(\"\" + i, values.get(i));\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIdField(long, double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdField(long id, double featureWeight) {\n        addToCoordinate(\"\" + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIdField(long)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdField(long id) {\n        addIdField(id, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdFieldWithNamespace(long,\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdFieldWithNamespace(long id, double featureWeight,\n            String namespace) {\n        addToCoordinate(namespace + SEP + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdFieldWithNamespace(long,\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdFieldWithNamespace(long id, String namespace) {\n        addIdFieldWithNamespace(id, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIdField(int, double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdField(int id, double featureWeight) {\n        addToCoordinate(\"\" + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIdField(int)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdField(int id) {\n        addIdField(id, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdFieldWithNamespace(int,\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdFieldWithNamespace(int id, double featureWeight,\n            String namespace) {\n        addToCoordinate(namespace + SEP + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdFieldWithNamespace(int,\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdFieldWithNamespace(int id, String namespace) {\n        addIdFieldWithNamespace(id, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIds(long[], double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIds(long[] ids, double featureWeight) {\n        for (long id : ids) {\n            addToCoordinate(\"\" + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIds(long[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIds(long[] ids) {\n        addIds(ids, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIds(int[], double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIds(int[] ids, double featureWeight) {\n        for (long id : ids) {\n            addToCoordinate(\"\" + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#addIds(int[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIds(int[] ids) {\n        addIds(ids, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIds(java.util.Collection,\n     * double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIds(Collection<Integer> ids, double featureWeight) {\n        for (long id : ids) {\n            addToCoordinate(\"\" + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIds(java.util.Collection)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIds(Collection<Integer> ids) {\n        addIds(ids, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdsWithNamespace(long[],\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdsWithNamespace(long[] ids, double featureWeight,\n            String namespace) {\n        for (long id : ids) {\n            addToCoordinate(namespace + SEP + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdsWithNamespace(long[],\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdsWithNamespace(long[] ids, String namespace) {\n        addIdsWithNamespace(ids, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdsWithNamespace(int[],\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdsWithNamespace(int[] ids, double featureWeight,\n            String namespace) {\n        for (int id : ids) {\n            addToCoordinate(namespace + SEP + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdsWithNamespace(int[],\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdsWithNamespace(int[] ids, String namespace) {\n        addIdsWithNamespace(ids, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdsWithNamespace(java.util\n     * .Collection, double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdsWithNamespace(Collection<Long> ids, double featureWeight,\n            String namespace) {\n        for (Long id : ids) {\n            addToCoordinate(namespace + SEP + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#addIdsWithNamespace(java.util\n     * .Collection, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T addIdsWithNamespace(Collection<Long> ids, String namespace) {\n        addIdsWithNamespace(ids, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIdField(long, double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdField(long id, double featureWeight) {\n        addToCoordinate(\"\" + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIdField(long)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdField(long id) {\n        setIdField(id, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdFieldWithNamespace(long,\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdFieldWithNamespace(long id, double featureWeight,\n            String namespace) {\n        addToCoordinate(namespace + SEP + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdFieldWithNamespace(long,\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdFieldWithNamespace(long id, String namespace) {\n        setIdFieldWithNamespace(id, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIdField(int, double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdField(int id, double featureWeight) {\n        addToCoordinate(\"\" + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIdField(int)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdField(int id) {\n        setIdField(id, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdFieldWithNamespace(int,\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdFieldWithNamespace(int id, double featureWeight,\n            String namespace) {\n        addToCoordinate(namespace + SEP + id, featureWeight);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdFieldWithNamespace(int,\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdFieldWithNamespace(int id, String namespace) {\n        setIdFieldWithNamespace(id, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIds(long[], double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIds(long[] ids, double featureWeight) {\n        for (long id : ids) {\n            addToCoordinate(\"\" + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIds(long[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIds(long[] ids) {\n        setIds(ids, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIds(int[], double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIds(int[] ids, double featureWeight) {\n        for (long id : ids) {\n            addToCoordinate(\"\" + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see com.etsy.conjecture.data.InstanceInterface#setIds(int[])\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIds(int[] ids) {\n        setIds(ids, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIds(java.util.Collection,\n     * double)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIds(Collection<Integer> ids, double featureWeight) {\n        for (long id : ids) {\n            addToCoordinate(\"\" + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIds(java.util.Collection)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIds(Collection<Integer> ids) {\n        setIds(ids, 1.);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdsWithNamespace(long[],\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdsWithNamespace(long[] ids, double featureWeight,\n            String namespace) {\n        for (long id : ids) {\n            addToCoordinate(namespace + SEP + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdsWithNamespace(long[],\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdsWithNamespace(long[] ids, String namespace) {\n        setIdsWithNamespace(ids, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdsWithNamespace(int[],\n     * double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdsWithNamespace(int[] ids, double featureWeight,\n            String namespace) {\n        for (int id : ids) {\n            addToCoordinate(namespace + SEP + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdsWithNamespace(int[],\n     * java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdsWithNamespace(int[] ids, String namespace) {\n        setIdsWithNamespace(ids, 1., namespace);\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdsWithNamespace(java.util\n     * .Collection, double, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdsWithNamespace(Collection<Long> ids, double featureWeight,\n            String namespace) {\n        for (Long id : ids) {\n            addToCoordinate(namespace + SEP + id, featureWeight);\n        }\n        return (T)this;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see\n     * com.etsy.conjecture.data.InstanceInterface#setIdsWithNamespace(java.util\n     * .Collection, java.lang.String)\n     */\n\n    @SuppressWarnings(\"unchecked\")\n    public T setIdsWithNamespace(Collection<Long> ids, String namespace) {\n        setIdsWithNamespace(ids, 1., namespace);\n        return (T)this;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/BinaryLabel.java",
    "content": "package com.etsy.conjecture.data;\n\nimport static com.google.common.base.Preconditions.checkArgument;\n\npublic class BinaryLabel extends RealValuedLabel {\n\n    private static final long serialVersionUID = 1L;\n\n    public BinaryLabel() {\n        super(0.0);\n    }\n\n    public BinaryLabel(double value) {\n        super(checkBinaryValue(value));\n\n    }\n\n    private static double checkBinaryValue(double value) {\n        checkArgument(value >= 0 && value <= 1,\n                \"value must be in [0, 1], given: %s\", value);\n        return value;\n    }\n\n    // {0,+1} -> {-1,+1}\n    public double getAsPlusMinus() {\n        return 2.0 * (getValue() - 0.5);\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/BinaryLabeledInstance.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.util.Map;\n\n/**\n * TODO: when using method string all methods return a RealValueLabeledInstance\n * think about how to avoid this while not using generic types\n */\npublic class BinaryLabeledInstance extends\n        AbstractInstance<BinaryLabeledInstance> implements\n        LabeledInstance<BinaryLabel> {\n\n    protected BinaryLabel label;\n\n    public BinaryLabel getLabel() {\n        return label;\n    }\n\n    public BinaryLabeledInstance() {\n        this(new BinaryLabel(0.0), 1.0);\n    }\n\n    public BinaryLabeledInstance(double label, Map<String, Double> instance) {\n        this(new BinaryLabel(label), instance, 1.0);\n    }\n\n    public BinaryLabeledInstance(double label, Map<String, Double> instance,\n            double weight) {\n        this(new BinaryLabel(label), instance, weight);\n    }\n\n    public BinaryLabeledInstance(double label, StringKeyedVector vec) {\n        this(new BinaryLabel(label), vec.getMap(), 1.0);\n    }\n\n    public BinaryLabeledInstance(double label, StringKeyedVector vec,\n            double weight) {\n        this(new BinaryLabel(label), vec.getMap(), weight);\n    }\n\n    public BinaryLabeledInstance(BinaryLabel label, Map<String, Double> instance) {\n        this(label, instance, 1.0);\n    }\n\n    public BinaryLabeledInstance(BinaryLabel label,\n            Map<String, Double> instance, double weight) {\n        super(instance, weight);\n        this.label = label;\n    }\n\n    public BinaryLabeledInstance(BinaryLabel label, StringKeyedVector vec) {\n        this(label, vec.getMap(), 1.0);\n    }\n\n    public BinaryLabeledInstance(BinaryLabel label, StringKeyedVector vec,\n            double weight) {\n        this(label, vec.getMap(), weight);\n    }\n\n    public BinaryLabeledInstance(double label) {\n        this(new BinaryLabel(label), 1.0);\n    }\n\n    public BinaryLabeledInstance(double label, double weight) {\n        this(new BinaryLabel(label), weight);\n    }\n\n    public BinaryLabeledInstance(BinaryLabel label) {\n        this(label, 1.0);\n    }\n\n    public BinaryLabeledInstance(BinaryLabel label, double weight) {\n        super(weight);\n        this.label = label;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/ByteArrayDoubleHashMap.java",
    "content": "package com.etsy.conjecture.data;\n\nimport gnu.trove.function.TDoubleFunction;\nimport gnu.trove.iterator.TObjectDoubleIterator;\nimport gnu.trove.map.hash.TObjectDoubleHashMap;\n\nimport java.io.IOException;\nimport java.io.ObjectInputStream;\nimport java.io.ObjectOutputStream;\nimport java.io.Serializable;\nimport java.io.UnsupportedEncodingException;\nimport java.util.AbstractMap;\nimport java.util.Arrays;\nimport java.util.HashSet;\nimport java.util.Iterator;\nimport java.util.Map;\nimport java.util.Set;\n\nimport com.esotericsoftware.kryo.Kryo;\nimport com.esotericsoftware.kryo.KryoSerializable;\nimport com.esotericsoftware.kryo.io.Input;\nimport com.esotericsoftware.kryo.io.Output;\n\npublic class ByteArrayDoubleHashMap implements Serializable, KryoSerializable,\n        Iterable<Map.Entry<String, Double>>, Map<String, Double> {\n\n    private static final long serialVersionUID = -7070522686694887436L;\n\n    // - represent the sparse map by a mapping of coordinate name strings\n    // (feature names)\n    // to doubles.\n    protected TObjectDoubleHashMap<byte[]> map;\n\n    protected String keyEncoding;\n    protected float loadFactor;\n    protected double defaultValue;\n\n    public ByteArrayDoubleHashMap() {\n        this(10, 0.8f, 0.0);\n    }\n\n    public ByteArrayDoubleHashMap(int initialCapacity, float loadFactor,\n            double defaultValue) {\n        this(initialCapacity, loadFactor, \"ASCII\", defaultValue);\n    }\n\n    public ByteArrayDoubleHashMap(int initialCapacity, float loadFactor,\n            String keyEncoding, double defaultValue) {\n        this.map = new TByteArrayDoubleHashMap(initialCapacity, loadFactor,\n                defaultValue);\n        this.keyEncoding = keyEncoding;\n        this.loadFactor = loadFactor;\n        this.defaultValue = defaultValue;\n    }\n\n    public String byteArrayToString(byte[] b) {\n        try {\n            return new String(b, keyEncoding);\n        } catch (UnsupportedEncodingException e) {\n            e.printStackTrace();\n            return null;\n        }\n    }\n\n    public byte[] stringToByteArray(String s) {\n        try {\n            return s.getBytes(keyEncoding);\n        } catch (UnsupportedEncodingException e) {\n            e.printStackTrace();\n            return null;\n        }\n    }\n\n    /**\n     * Customized trove hashmap which does both: customized hash/equality\n     * functions, and also storing the values as a primitive array.\n     */\n    static class TByteArrayDoubleHashMap extends TObjectDoubleHashMap<byte[]> {\n        public TByteArrayDoubleHashMap(int initialSize, float loadFactor,\n                double defaultValue) {\n            super(initialSize, loadFactor, defaultValue);\n        }\n\n        protected int hash(Object obj) {\n            return Arrays.hashCode((byte[])obj);\n        }\n\n        protected boolean equals(Object a, Object b) {\n            return b != null && b != REMOVED\n                    && Arrays.equals((byte[])a, (byte[])b);\n        }\n\n        // - ovrride this to prevent doubling on resize.\n        public double put(byte[] key, double value) {\n            int index = insertKey(key);\n            double previous = 0.0;\n            boolean isNewMapping = true;\n            if (index < 0) {\n                index = -index - 1;\n                previous = _values[index];\n                isNewMapping = false;\n            }\n            _values[index] = value;\n            if (isNewMapping) {\n                postInsertHook2(consumeFreeSlot);\n            }\n\n            return previous;\n        }\n\n        protected final void postInsertHook2(boolean usedFreeSlot) {\n            if (usedFreeSlot) {\n                _free--;\n            }\n\n            if (++_size > _maxSize || _free == 0) {\n                int newCapacity = _size > _maxSize ? gnu.trove.impl.PrimeFinder\n                        .nextPrime((int)(capacity() * 1.2) + 10) : capacity();\n                if (newCapacity > 1000000) {\n                    System.out.println(\"rehashing to size: \" + newCapacity\n                            + \" from \" + capacity());\n                }\n                rehash(newCapacity);\n                computeMaxSize(capacity());\n            }\n        }\n    }\n\n    public int size() {\n        return map.size();\n    }\n\n    public boolean containsKey(Object key) {\n        if (key instanceof byte[]) {\n            return map.containsKey(key);\n        } else if (key instanceof String) {\n            return map.containsKey(stringToByteArray((String)key));\n        } else {\n            throw new IllegalArgumentException(\"class \"\n                    + key.getClass().toString()\n                    + \" is not valid for ByteArrayDoubleHashMap.containsKey\");\n        }\n    }\n\n    public Set<String> keySet() {\n        Set<String> res = new HashSet<String>();\n        for (byte[] b : map.keySet()) {\n            res.add(byteArrayToString(b));\n        }\n        return res;\n    }\n\n    public Set<Double> values() {\n        Set<Double> values = new HashSet<Double>();\n        for (Map.Entry<String, Double> e : this) {\n            values.add(e.getValue());\n        }\n        return values;\n    }\n\n    public boolean containsValue(Object d) {\n        return values().contains((Double)d);\n    }\n\n    public Set<Map.Entry<String, Double>> entrySet() {\n        Set<Map.Entry<String, Double>> entries = new HashSet<Map.Entry<String, Double>>();\n        for (Map.Entry<String, Double> e : this) {\n            entries.add(e);\n        }\n        return entries;\n    }\n\n    public boolean isEmpty() {\n        return size() > 0;\n    }\n\n    public void clear() {\n        map.clear();\n    }\n\n    public Double remove(Object k) {\n        return removePrimitive((String)k);\n    }\n\n    public Double get(Object k) {\n        return getPrimitive((String)k);\n    }\n\n    public Double put(String key, Double value) {\n        return putPrimitive(key, value);\n    }\n\n    public void putAll(Map<? extends String, ? extends Double> m) {\n        for (Map.Entry<? extends String, ? extends Double> e : m.entrySet()) {\n            put((String)e.getKey(), (Double)e.getValue());\n        }\n    }\n\n    public double getPrimitive(byte[] key) {\n        return map.get(key);\n    }\n\n    public double getPrimitive(String key) {\n        return map.get(stringToByteArray(key));\n    }\n\n    public double putPrimitive(byte[] key, double value) {\n        return map.put(key, value);\n    }\n\n    public double putPrimitive(String key, double value) {\n        return map.put(stringToByteArray(key), value);\n    }\n\n    public double removePrimitive(byte[] key) {\n        return map.remove(key);\n    }\n\n    public double removePrimitive(String key) {\n        return map.remove(stringToByteArray(key));\n    }\n\n    public void transformValues(TDoubleFunction func) {\n        map.transformValues(func);\n    }\n\n    public TObjectDoubleIterator<byte[]> troveIterator() {\n        return map.iterator();\n    }\n\n    public Iterator<Map.Entry<String, Double>> iterator() {\n        return new Iterator<Map.Entry<String, Double>>() {\n            private TObjectDoubleIterator<byte[]> iter = troveIterator();\n\n            public boolean hasNext() {\n                return iter.hasNext();\n            }\n\n            public void remove() {\n                iter.remove();\n            }\n\n            public Map.Entry<String, Double> next() {\n                iter.advance();\n                return new AbstractMap.SimpleImmutableEntry<String, Double>(\n                        byteArrayToString(iter.key()), iter.value());\n            }\n        };\n    }\n\n    // - java serialization\n    private void writeObject(ObjectOutputStream output) throws IOException {\n        output.writeObject(keyEncoding);\n        output.writeFloat(loadFactor);\n        output.writeDouble(defaultValue);\n        output.writeInt(map.size());\n        for (TObjectDoubleIterator<byte[]> it = map.iterator(); it.hasNext();) {\n            it.advance();\n            byte[] key = it.key();\n            output.writeInt(key.length);\n            for (int i = 0; i < key.length; i++) {\n                output.writeByte(key[i]);\n            }\n            output.writeDouble(it.value());\n        }\n    }\n\n    private void readObject(ObjectInputStream input) throws IOException,\n            ClassNotFoundException {\n        keyEncoding = (String)input.readObject();\n        loadFactor = input.readFloat();\n        defaultValue = input.readDouble();\n        int size = input.readInt();\n        map = new TByteArrayDoubleHashMap(size, loadFactor, defaultValue);\n        for (int i = 0; i < size; i++) {\n            int length = input.readInt();\n            byte[] key = new byte[length];\n            for (int j = 0; j < length; j++) {\n                key[j] = input.readByte();\n            }\n            double value = input.readDouble();\n            map.put(key, value);\n        }\n    }\n\n    // - kryo serialization for use in scalding.\n    public void write(Kryo kryo, Output output) {\n        output.writeString(keyEncoding);\n        output.writeFloat(loadFactor);\n        output.writeDouble(defaultValue);\n        output.writeInt(map.size());\n        for (TObjectDoubleIterator<byte[]> it = map.iterator(); it.hasNext();) {\n            it.advance();\n            byte[] key = it.key();\n            output.writeInt(key.length);\n            for (int i = 0; i < key.length; i++) {\n                output.writeByte(key[i]);\n            }\n            output.writeDouble(it.value());\n        }\n    }\n\n    public void read(Kryo kryo, Input input) {\n        keyEncoding = input.readString();\n        loadFactor = input.readFloat();\n        defaultValue = input.readDouble();\n        int size = input.readInt();\n        map = new TByteArrayDoubleHashMap(size, loadFactor, defaultValue);\n        for (int i = 0; i < size; i++) {\n            int length = input.readInt();\n            byte[] key = new byte[length];\n            for (int j = 0; j < length; j++) {\n                key[j] = input.readByte();\n            }\n            double value = input.readDouble();\n            map.put(key, value);\n        }\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/ClusterLabel.java",
    "content": "package com.etsy.conjecture.data;\n\npublic class ClusterLabel extends Label{\n\n    private static final long serialVersionUID = 1L;\n\n    protected String label;\n\n    public ClusterLabel() {\n        this(null);\n    }\n\n    public ClusterLabel(String label) {\n        this.label = label;\n    }\n\n    public String getLabel() {\n        return this.label;\n    }\n\n    public void setLabel(String label) {\n        this.label = label;\n    }\n\n    public String toString() {\n        return label;\n    }\n\n    @Override\n    public int hashCode() {\n        final int prime = 31;\n        int result = 1;\n        result = prime * result + ((label == null) ? 0 : label.hashCode());\n        return result;\n    }\n\n    @Override\n    public boolean equals(Object obj) {\n        if (this == obj)\n            return true;\n        if (obj == null)\n            return false;\n        if (getClass() != obj.getClass())\n            return false;\n        ClusterLabel other = (ClusterLabel) obj;\n        if (label == null) {\n            if (other.label != null)\n                return false;\n        } else if (!label.equals(other.label))\n            return false;\n        return true;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/ClusterPrediction.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.util.Map;\nimport com.google.common.collect.Maps;\n\n/**\n * Representing a probability of membership in each cluster\n */\npublic class ClusterPrediction extends ClusterLabel{\n\n    private static final long serialVersionUID = -1L;\n\n    /**\n     * Cluster membership probabilities\n     */\n    private Map<String, Double> clusterProbs;\n\n    public ClusterPrediction(Map<String, Double> clusterProbs) {\n        this.clusterProbs = Maps.newHashMap(clusterProbs);\n        boolean first = true;\n        double maxProb = 0;\n        String maxCategory = null;\n        for (String key : clusterProbs.keySet()) {\n            if(first || clusterProbs.get(key) > maxProb) {\n              maxProb = clusterProbs.get(key);\n              maxCategory = key;\n              first = false;\n            }\n        }\n        setLabel(maxCategory);\n    }\n\n    public Map<String,Double> getMap() {\n        return clusterProbs;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/Instance.java",
    "content": "package com.etsy.conjecture.data;\n\n\n//TODO: reset methods for string adders\n//TODO: for instance, vector subtraction?\npublic class Instance extends AbstractInstance<Instance> {\n\n    public Instance() {\n        super();\n    }\n\n    public Instance(StringKeyedVector vec) {\n        super(vec);\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/InstanceFactory.java",
    "content": "package com.etsy.conjecture.data;\n\npublic class InstanceFactory {\n\n    private InstanceFactory() {\n    };\n\n    public static Instance buildInstance() {\n        return new Instance();\n    }\n\n    public static Instance copyInstance(Instance inst) {\n        return new Instance(inst.getVector());\n    }\n\n    public static BinaryLabeledInstance toBinaryLabeledInstance(double label,\n            Instance instance) {\n        return new BinaryLabeledInstance(label, instance.getVector());\n    }\n\n    public static BinaryLabeledInstance toBinaryLabeledInstance(\n            BinaryLabel label, Instance instance) {\n        return new BinaryLabeledInstance(label, instance.getVector());\n    }\n\n    public static RealValueLabeledInstance toRealValueLabeledInstance(\n            double label, Instance instance) {\n        return new RealValueLabeledInstance(label, instance.getVector());\n    }\n\n    public static RealValueLabeledInstance toRealValueLabeledInstance(\n            RealValuedLabel label, Instance instance) {\n        return new RealValueLabeledInstance(label, instance.getVector());\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/InstanceInterface.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.util.Collection;\nimport java.util.List;\nimport java.util.Map;\n\npublic interface InstanceInterface<T extends InstanceInterface<T>> {\n\n    public abstract String getId();\n\n    public abstract T setId(String id);\n\n    public abstract T addTerm(String term);\n\n    public abstract T addTerm(String term, double featureWeight);\n\n    public abstract T addTermWithNamespace(String term, String namespace);\n\n    public abstract T addTermWithNamespace(String term, String namespace,\n            double featureWeight);\n\n    public abstract T addTerms(Collection<String> terms, double featureWeight);\n\n    public abstract T addTerms(Collection<String> terms);\n\n    public abstract T addTermsWithNamespace(Collection<String> terms,\n            String namespace, double featureWeight);\n\n    public abstract T addTermsWithNamespace(Collection<String> terms,\n            String namespace);\n\n    public abstract T addTerms(String[] terms, double featureWeight);\n\n    public abstract T addTerms(String[] terms);\n\n    public abstract T addTermsWithNamespace(String[] terms, String namespace,\n            double featureWeight);\n\n    public abstract T addTermsWithNamespace(String[] terms, String namespace);\n\n    public abstract T addTermsWithWeights(Map<String, Double> termsWithWeights);\n\n    public abstract T addTermsWithWeightsWithNamespace(\n            Map<String, Double> termsWithWeights, String namespace);\n\n    public abstract T addNumericArrayWithNamespace(double[] array,\n            String namespace);\n\n    public abstract T addNumericArray(double[] array);\n\n    public abstract T addNumericArrayWithNamespace(Double[] array,\n            String namespace);\n\n    public abstract T addNumericArray(Double[] array);\n\n    public abstract T addNumericArrayWithNamespace(List<Double> values,\n            String namespace);\n\n    public abstract T addNumericArray(List<Double> values);\n\n    public abstract T setNumericArrayWithNamespace(double[] array,\n            String namespace);\n\n    public abstract T setNumericArray(double[] array);\n\n    public abstract T setNumericArrayWithNamespace(Double[] array,\n            String namespace);\n\n    public abstract T setNumericArray(Double[] array);\n\n    public abstract T setNumericArrayWithNamespace(List<Double> values,\n            String namespace);\n\n    public abstract T setNumericArray(List<Double> values);\n\n    public abstract T addIdField(long id, double featureWeight);\n\n    public abstract T addIdField(long id);\n\n    public abstract T addIdFieldWithNamespace(long id, double featureWeight,\n            String namespace);\n\n    public abstract T addIdFieldWithNamespace(long id, String namespace);\n\n    public abstract T addIdField(int id, double featureWeight);\n\n    public abstract T addIdField(int id);\n\n    public abstract T addIdFieldWithNamespace(int id, double featureWeight,\n            String namespace);\n\n    public abstract T addIdFieldWithNamespace(int id, String namespace);\n\n    public abstract T addIds(long[] ids, double featureWeight);\n\n    public abstract T addIds(long[] ids);\n\n    public abstract T addIds(int[] ids, double featureWeight);\n\n    public abstract T addIds(int[] ids);\n\n    public abstract T addIds(Collection<Integer> ids, double featureWeight);\n\n    public abstract T addIds(Collection<Integer> ids);\n\n    public abstract T addIdsWithNamespace(long[] ids, double featureWeight,\n            String namespace);\n\n    public abstract T addIdsWithNamespace(long[] ids, String namespace);\n\n    public abstract T addIdsWithNamespace(int[] ids, double featureWeight,\n            String namespace);\n\n    public abstract T addIdsWithNamespace(int[] ids, String namespace);\n\n    public abstract T addIdsWithNamespace(Collection<Long> ids,\n            double featureWeight, String namespace);\n\n    public abstract T addIdsWithNamespace(Collection<Long> ids, String namespace);\n\n    public abstract T setIdField(long id, double featureWeight);\n\n    public abstract T setIdField(long id);\n\n    public abstract T setIdFieldWithNamespace(long id, double featureWeight,\n            String namespace);\n\n    public abstract T setIdFieldWithNamespace(long id, String namespace);\n\n    public abstract T setIdField(int id, double featureWeight);\n\n    public abstract T setIdField(int id);\n\n    public abstract T setIdFieldWithNamespace(int id, double featureWeight,\n            String namespace);\n\n    public abstract T setIdFieldWithNamespace(int id, String namespace);\n\n    public abstract T setIds(long[] ids, double featureWeight);\n\n    public abstract T setIds(long[] ids);\n\n    public abstract T setIds(int[] ids, double featureWeight);\n\n    public abstract T setIds(int[] ids);\n\n    public abstract T setIds(Collection<Integer> ids, double featureWeight);\n\n    public abstract T setIds(Collection<Integer> ids);\n\n    public abstract T setIdsWithNamespace(long[] ids, double featureWeight,\n            String namespace);\n\n    public abstract T setIdsWithNamespace(long[] ids, String namespace);\n\n    public abstract T setIdsWithNamespace(int[] ids, double featureWeight,\n            String namespace);\n\n    public abstract T setIdsWithNamespace(int[] ids, String namespace);\n\n    public abstract T setIdsWithNamespace(Collection<Long> ids,\n            double featureWeight, String namespace);\n\n    public abstract T setIdsWithNamespace(Collection<Long> ids, String namespace);\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/Label.java",
    "content": "package com.etsy.conjecture.data;\n\npublic class Label implements java.io.Serializable {\n\n    private static final long serialVersionUID = 1L;\n\n    public Label() {\n\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/LabeledInstance.java",
    "content": "package com.etsy.conjecture.data;\n\npublic interface LabeledInstance<L extends Label> {\n    public L getLabel();\n\n    public StringKeyedVector getVector();\n\n    public double getWeight();\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/LazyVector.java",
    "content": "package com.etsy.conjecture.data;\n\nimport gnu.trove.function.TDoubleFunction;\nimport gnu.trove.iterator.TObjectDoubleIterator;\n\nimport java.io.IOException;\nimport java.io.ObjectInputStream;\nimport java.io.ObjectOutputStream;\nimport java.io.Serializable;\nimport java.util.ArrayList;\nimport java.util.Iterator;\nimport java.util.Map;\nimport java.util.Set;\n\nimport com.esotericsoftware.kryo.Kryo;\nimport com.esotericsoftware.kryo.KryoSerializable;\nimport com.esotericsoftware.kryo.io.Input;\nimport com.esotericsoftware.kryo.io.Output;\nimport com.etsy.conjecture.Utilities;\n\npublic class LazyVector extends StringKeyedVector implements Serializable,\n        KryoSerializable {\n\n    private static final long serialVersionUID = -7070522686694887436L;\n\n    protected transient ByteArrayDoubleHashMap iterations;\n\n    protected long iteration = 0;\n\n    protected UpdateFunction updater;\n\n    /**\n     * The function used to update the parameters during the lazy update\n     */\n    public static interface UpdateFunction extends Serializable {\n        public double lazyUpdate(String key, double param, long startIteration,\n                long endIteration);\n    }\n\n    public LazyVector() {\n        this(new UpdateFunction() {\n            private static final long serialVersionUID = 1740773207106961880L;\n\n            public double lazyUpdate(String key, double p, long a, long b) {\n                return p;\n            }\n        });\n    }\n\n    public LazyVector(UpdateFunction uf) {\n        this(10, uf);\n    }\n\n    public LazyVector(int initialCapacity, UpdateFunction uf) {\n        super(initialCapacity);\n        iterations = new ByteArrayDoubleHashMap(initialCapacity, LOAD_FACTOR,\n                FEATURE_ENCODING, 0.0);\n        updater = uf;\n    }\n\n    public LazyVector(StringKeyedVector skv, UpdateFunction uf) {\n        if (skv instanceof LazyVector) {\n            ((LazyVector)skv).delazify();\n        }\n        this.vector = skv.vector;\n        iterations = new ByteArrayDoubleHashMap(skv.size(), LOAD_FACTOR,\n                FEATURE_ENCODING, 0.0);\n        updater = uf;\n    }\n\n    public LazyVector(ByteArrayDoubleHashMap map, UpdateFunction uf) {\n        super(map);\n        iterations = new ByteArrayDoubleHashMap(10, LOAD_FACTOR,\n                FEATURE_ENCODING, 0.0);\n        updater = uf;\n    }\n\n    public LazyVector(Map<String, Double> jmap, UpdateFunction uf) {\n        super(jmap);\n        iterations = new ByteArrayDoubleHashMap(10, LOAD_FACTOR,\n                FEATURE_ENCODING, 0.0);\n        updater = uf;\n    }\n\n    public void incrementIteration() {\n        iteration++;\n    }\n\n    public void delazify() {\n        for (TObjectDoubleIterator<byte[]> it = vector.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            long startIter = (long)iterations.getPrimitive(it.key()); // defaults\n                                                                      // to 0.0\n            if (startIter < iteration) {\n                it.setValue(updater.lazyUpdate(it.key().toString(), it.value(), startIter, iteration));\n                iterations.putPrimitive(it.key(), (double)iteration);\n            }\n        }\n        removeZeroCoordinates();\n    }\n\n    public double delazifyCoordinate(String key) {\n        return delazifyCoordinate(vector.stringToByteArray(key));\n    }\n\n    public double delazifyCoordinate(byte[] key) {\n        if (vector.containsKey(key)) {\n            long oldIteration = (long)iterations.getPrimitive(key);\n            double initial = vector.getPrimitive(key);\n            if (oldIteration < iteration) {\n                double updated = updater.lazyUpdate(key.toString(), initial, oldIteration,\n                        iteration);\n                if (Utilities.floatingPointEquals(updated, 0.0d)) {\n                    vector.removePrimitive(key);\n                    iterations.removePrimitive(key);\n                } else {\n                    iterations.putPrimitive(key, (double)iteration);\n                    vector.putPrimitive(key, updated);\n                }\n                return updated;\n            } else {\n                return initial;\n            }\n        }\n        return 0.0;\n    }\n\n    public void skipToIteration(long iter) {\n        delazify();\n        iteration = iter;\n        for (TObjectDoubleIterator<byte[]> it = iterations.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            it.setValue((double)iter);\n        }\n    }\n\n    /**\n     * disregards prior value at a particular key, replacing with the specified\n     * value.\n     */\n    public double setCoordinate(String key, double value) {\n        if (Utilities.floatingPointEquals(value, 0d)) {\n            return deleteCoordinate(key);\n        } else if (!freezeKeySet) {\n            vector.putPrimitive(key, value);\n            iterations.putPrimitive(key, (double)iteration);\n        }\n        return 0d;\n    }\n\n    /**\n     * remove a coordinate from the vector (same as setting it to 0).\n     */\n    public double deleteCoordinate(String key) {\n        if (vector.containsKey(key) && !freezeKeySet) {\n            iterations.removePrimitive(key);\n            return vector.removePrimitive(key);\n        } else {\n            return 0d;\n        }\n    }\n\n    public Map<String, Double> getMap() {\n        return vector;\n    }\n\n    protected double addToCoordinateInternal(byte[] bkey, double value) {\n        delazifyCoordinate(bkey);\n        if (vector.containsKey(bkey)) {\n            double updated = vector.getPrimitive(bkey) + value;\n            if (Utilities.floatingPointEquals(updated, 0.0d)) {\n                iterations.removePrimitive(bkey);\n                return vector.removePrimitive(bkey);\n            } else {\n                iterations.putPrimitive(bkey, (double)iteration);\n                return vector.putPrimitive(bkey, updated);\n            }\n        } else if (!freezeKeySet && !Utilities.floatingPointEquals(value, 0.0d)) {\n            vector.putPrimitive(bkey, value);\n            iterations.putPrimitive(bkey, (double)iteration);\n        }\n        return 0d;\n    }\n\n    /**\n     * return the value of a coordinate.\n     */\n    public double getCoordinate(String key) {\n        delazifyCoordinate(key);\n        return vector.getPrimitive(key);\n    }\n\n    /**\n     * the dimension of the vector.\n     */\n    public int size() {\n        delazify();\n        return vector.size();\n    }\n\n    /**\n     * whether this vector has a non-zero value for a coordinate.\n     */\n    public boolean containsKey(String key) {\n        delazify();\n        return vector.containsKey(key);\n    }\n\n    /**\n     * whether this vector has a non-zero value for a coordinate.\n     */\n    public boolean contains(String key) {\n        return containsKey(key);\n    }\n\n    /**\n     * the set of non-zero coordinate names.\n     */\n    public Set<String> keySet() {\n        delazify();\n        return vector.keySet();\n    }\n\n    /**\n     * the set of values in the map.\n     */\n    public Set<Double> values() {\n        delazify();\n        return vector.values();\n    }\n\n    /**\n     * Apply an arbitrary scalar function to the values.\n     */\n    public void transformValues(TDoubleFunction func) {\n        delazify();\n        vector.transformValues(func);\n    }\n\n    /**\n     * Remove zeros that may have appeared as a result of a transform\n     */\n    public void removeZeroCoordinates() {\n        for (TObjectDoubleIterator<byte[]> it = vector.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            if (Utilities.floatingPointEquals(it.value(), 0d)) {\n                iterations.removePrimitive(it.key());\n                it.remove();\n            }\n        }\n    }\n\n    /**\n     * compute the inner product between this and vec.\n     */\n    public double dot(StringKeyedVector skv) {\n        if (skv instanceof LazyVector) {\n            return dotWithLazy((LazyVector)skv);\n        } else {\n            return dotWithSKV(skv);\n        }\n    }\n\n    protected double dotWithSKV(StringKeyedVector vec) {\n        // dont figure out which ones bigger etc, since delazifying this to get\n        // the size is too slow.\n        double res = 0.0;\n        for (TObjectDoubleIterator<byte[]> it = vec.vector.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            res += it.value() * delazifyCoordinate(it.key());\n        }\n        return res;\n    }\n\n    protected double dotWithLazy(LazyVector vec) {\n        ByteArrayDoubleHashMap vec_small = this.size() > vec.size() ? vec.vector\n                : this.vector;\n        ByteArrayDoubleHashMap vec_big = this.size() > vec.size() ? this.vector\n                : vec.vector;\n        ArrayList<byte[]> commonCoordinates = new ArrayList<byte[]>(); // prevent\n                                                                       // modification\n                                                                       // during\n                                                                       // iteration.\n        double res = 0.0;\n        for (TObjectDoubleIterator<byte[]> it = vec_small.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            if (vec_big.containsKey(it.key())) {\n                commonCoordinates.add(it.key());\n            }\n        }\n        for (byte[] key : commonCoordinates) {\n            delazifyCoordinate(key);\n            vec.delazifyCoordinate(key);\n            res += vec_small.getPrimitive(key) * vec_big.getPrimitive(key);\n        }\n        return res;\n    }\n\n    /**\n     * compute the LP norm for given p < infinity.\n     */\n    public double LPNorm(double p) {\n        delazify();\n        return super.LPNorm(p);\n    }\n\n    /**\n     * immutable access the underlying hash map.\n     */\n    public Iterator<Map.Entry<String, Double>> iterator() {\n        delazify();\n        return vector.iterator();\n    }\n\n    public String toString() {\n        delazify();\n        return super.toString();\n    }\n\n    private Object writeReplace() throws java.io.ObjectStreamException {\n        delazify();\n        return this;\n    }\n\n    // - java serialization\n    private void writeObject(ObjectOutputStream output) throws IOException {\n        output.writeLong(iteration);\n        output.writeObject(vector);\n        output.writeObject(updater);\n        output.writeBoolean(freezeKeySet);\n    }\n\n    private void readObject(ObjectInputStream input) throws IOException,\n            ClassNotFoundException {\n        iteration = input.readLong();\n        vector = (ByteArrayDoubleHashMap)input.readObject();\n        updater = (UpdateFunction)input.readObject();\n        freezeKeySet = input.readBoolean();\n        // set up iteration info,\n        iterations = new ByteArrayDoubleHashMap(10, LOAD_FACTOR,\n                (double)iteration);\n    }\n\n    // - kryo serialization for use in scalding.\n    public void write(Kryo kryo, Output output) {\n        delazify();\n        output.writeLong(iteration);\n        kryo.writeObject(output, vector);\n        kryo.writeClassAndObject(output, updater);\n        output.writeBoolean(freezeKeySet);\n    }\n\n    public void read(Kryo kryo, Input input) {\n        iteration = input.readLong();\n        vector = kryo.readObject(input, ByteArrayDoubleHashMap.class);\n        updater = (UpdateFunction)kryo.readClassAndObject(input);\n        freezeKeySet = input.readBoolean();\n        // set up iteration info,\n        iterations = new ByteArrayDoubleHashMap(10, LOAD_FACTOR,\n                (double)iteration);\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/MulticlassLabel.java",
    "content": "package com.etsy.conjecture.data;\n\n/**\n * representing a 100% probability of membership in a particular class\n */\npublic class MulticlassLabel extends Label {\n\n    private static final long serialVersionUID = 1L;\n\n    protected String label;\n\n    public MulticlassLabel() {\n        this(null);\n    }\n\n    public MulticlassLabel(String label) {\n        this.label = label;\n    }\n\n    public String getLabel() {\n        return this.label;\n    }\n\n    public void setLabel(String label) {\n        this.label = label;\n    }\n\n    public String toString() {\n        return label;\n    }\n\n    public BinaryLabel toBinaryLabel(String className) {\n        return new BinaryLabel(className.equals(label) ? 1.0 : 0.0);\n    }\n\n    @Override\n    public int hashCode() {\n        final int prime = 31;\n        int result = 1;\n        result = prime * result + ((label == null) ? 0 : label.hashCode());\n        return result;\n    }\n\n    @Override\n    public boolean equals(Object obj) {\n        if (this == obj)\n            return true;\n        if (obj == null)\n            return false;\n        if (getClass() != obj.getClass())\n            return false;\n        MulticlassLabel other = (MulticlassLabel)obj;\n        if (label == null) {\n            if (other.label != null)\n                return false;\n        } else if (!label.equals(other.label))\n            return false;\n        return true;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/MulticlassLabeledInstance.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.util.Map;\n\npublic class MulticlassLabeledInstance extends\n        AbstractInstance<MulticlassLabeledInstance> implements\n        LabeledInstance<MulticlassLabel> {\n\n    protected MulticlassLabel label;\n\n    public MulticlassLabel getLabel() {\n        return label;\n    }\n\n    public MulticlassLabeledInstance(String label) {\n        this(new MulticlassLabel(label), 1.0);\n    }\n\n    public MulticlassLabeledInstance(String label, double weight) {\n        this(new MulticlassLabel(label), weight);\n    }\n\n    public MulticlassLabeledInstance(String label, Map<String, Double> instance) {\n        this(new MulticlassLabel(label), instance, 1.0);\n    }\n\n    public MulticlassLabeledInstance(String label,\n            Map<String, Double> instance, double weight) {\n        this(new MulticlassLabel(label), instance, weight);\n    }\n\n    public MulticlassLabeledInstance(String label, StringKeyedVector vec) {\n        this(new MulticlassLabel(label), vec.getMap(), 1.0);\n    }\n\n    public MulticlassLabeledInstance(String label, StringKeyedVector vec,\n            double weight) {\n        this(new MulticlassLabel(label), vec.getMap(), weight);\n    }\n\n    public MulticlassLabeledInstance(MulticlassLabel label) {\n        this(label, 1.0);\n    }\n\n    public MulticlassLabeledInstance(MulticlassLabel label, double weight) {\n        super(weight);\n        this.label = label;\n    }\n\n    public MulticlassLabeledInstance(MulticlassLabel label,\n            Map<String, Double> instance) {\n        this(label, instance, 1.0);\n    }\n\n    public MulticlassLabeledInstance(MulticlassLabel label,\n            Map<String, Double> instance, double weight) {\n        super(instance, weight);\n        this.label = label;\n    }\n\n    public MulticlassLabeledInstance(MulticlassLabel label,\n            StringKeyedVector vec) {\n        this(label, vec.getMap(), 1.0);\n    }\n\n    public MulticlassLabeledInstance(MulticlassLabel label,\n            StringKeyedVector vec, double weight) {\n        this(label, vec.getMap(), weight);\n    }\n\n    public BinaryLabeledInstance toBinaryInstance(String category) {\n        double tmpLabel = 0d;\n        if (category.equals(this.label.getLabel())) {\n            tmpLabel = 1d;\n        }\n        return new BinaryLabeledInstance(tmpLabel, getVector());\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/MulticlassPrediction.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.util.Map;\nimport com.google.common.collect.Maps;\n\n/**\n * representing a probability of membership in each class\n */\npublic class MulticlassPrediction extends MulticlassLabel {\n\n    private static final long serialVersionUID = -1L;\n\n    /**\n     * class membership probabilities\n     */\n    private Map<String, Double> classProbs;\n\n    public MulticlassPrediction(Map<String, Double> classProbs) {\n        this.classProbs = Maps.newHashMap(classProbs);\n        boolean first = true;\n        double maxProb = 0;\n        String maxCategory = null;\n        for (String key : classProbs.keySet()) {\n            if (first || classProbs.get(key) > maxProb) {\n                maxProb = classProbs.get(key);\n                maxCategory = key;\n                first = false;\n            }\n        }\n        setLabel(maxCategory);\n    }\n\n    public Double getProb(String category) {\n        return classProbs.get(category);\n    }\n\n    public Double getProbOrElse(String category, Double def) {\n        if (classProbs.containsKey(category)) {\n            return classProbs.get(category);\n        } else {\n            return def;\n        }\n    }\n\n    public Map<String, Double> getMap() {\n        return classProbs;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/RealValueLabeledInstance.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.util.Map;\n\npublic class RealValueLabeledInstance extends\n        AbstractInstance<RealValueLabeledInstance> implements\n        LabeledInstance<RealValuedLabel> {\n\n    private final RealValuedLabel label;\n\n    public RealValuedLabel getLabel() {\n        return label;\n    }\n\n    public RealValueLabeledInstance() {\n        this(0.0);\n    }\n\n    public RealValueLabeledInstance(RealValuedLabel label) {\n        this(label, 1.0);\n    }\n\n    public RealValueLabeledInstance(RealValuedLabel label, double weight) {\n        super(weight);\n        this.label = label;\n    }\n\n    public RealValueLabeledInstance(double label) {\n        this(new RealValuedLabel(label), 1.0);\n    }\n\n    public RealValueLabeledInstance(double label, double weight) {\n        this(new RealValuedLabel(label), weight);\n    }\n\n    public RealValueLabeledInstance(double label, Map<String, Double> instance) {\n        this(new RealValuedLabel(label), instance, 1.0);\n    }\n\n    public RealValueLabeledInstance(double label, Map<String, Double> instance,\n            double weight) {\n        this(new RealValuedLabel(label), instance, weight);\n    }\n\n    public RealValueLabeledInstance(double label, StringKeyedVector vec) {\n        this(new RealValuedLabel(label), vec.getMap(), 1.0);\n    }\n\n    public RealValueLabeledInstance(double label, StringKeyedVector vec,\n            double weight) {\n        this(new RealValuedLabel(label), vec.getMap(), weight);\n    }\n\n    public RealValueLabeledInstance(RealValuedLabel label,\n            Map<String, Double> instance) {\n        this(label, instance, 1.0);\n    }\n\n    public RealValueLabeledInstance(RealValuedLabel label,\n            Map<String, Double> instance, double weight) {\n        super(instance, weight);\n        this.label = label;\n    }\n\n    public RealValueLabeledInstance(RealValuedLabel label, StringKeyedVector vec) {\n        this(label, vec, 1.0);\n    }\n\n    public RealValueLabeledInstance(RealValuedLabel label,\n            StringKeyedVector vec, double weight) {\n        super(vec.getMap(), weight);\n        this.label = label;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/RealValuedLabel.java",
    "content": "package com.etsy.conjecture.data;\n\npublic class RealValuedLabel extends Label {\n\n    protected final Double value;\n    private static final long serialVersionUID = -1L;\n\n    public RealValuedLabel(double value) {\n        this.value = new Double(value);\n    }\n\n    public Double getValue() {\n        return this.value;\n    }\n\n    @Override\n    public String toString() {\n        return value + \"\";\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/Recommendation.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.io.Serializable;\n\npublic class Recommendation implements Serializable {\n\n    private static final long serialVersionUID = 1L;\n\n    public final double score;\n    public final String id;\n\n    public Recommendation(String id, double score) {\n        this.id = id;\n        this.score = score;\n    }\n\n}"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/data/StringKeyedVector.java",
    "content": "package com.etsy.conjecture.data;\n\nimport gnu.trove.function.TDoubleFunction;\nimport gnu.trove.iterator.TObjectDoubleIterator;\n\nimport java.io.Serializable;\nimport java.util.Iterator;\nimport java.util.Map;\nimport java.util.Set;\n\nimport com.etsy.conjecture.Utilities;\nimport com.google.gson.Gson;\n\npublic class StringKeyedVector implements Serializable,\n        Iterable<Map.Entry<String, Double>> {\n\n    private static final long serialVersionUID = -7070522686694887436L;\n\n    // - represent the sparse vector by a mapping of coordinate name strings\n    // (feature names)\n    // to doubles.\n    protected ByteArrayDoubleHashMap vector;\n\n    // - whether to permit the addition of more features to this vector.\n    protected boolean freezeKeySet = false;\n\n    // - the load factor for the underlying hashmap.\n    public static final float LOAD_FACTOR = 0.9f;\n\n    public static final String FEATURE_ENCODING = \"ASCII\";\n\n    public StringKeyedVector() {\n        this(10);\n    }\n\n    public StringKeyedVector(int initialCapacity) {\n        vector = new ByteArrayDoubleHashMap(initialCapacity, LOAD_FACTOR,\n                FEATURE_ENCODING, 0.0);\n    }\n\n    public StringKeyedVector(StringKeyedVector skv) {\n        this(skv.size());\n        add(skv);\n    }\n\n    public StringKeyedVector(Map<String, Double> jmap) {\n        vector = new ByteArrayDoubleHashMap(jmap.size(), LOAD_FACTOR,\n                FEATURE_ENCODING, 0.0);\n        vector.putAll(jmap);\n    }\n\n    /**\n     * returns whether the key set is frozen (true means that further dimensions\n     * cannot be added to this vector).\n     */\n    public boolean getFreezeKeySet() {\n        return freezeKeySet;\n    }\n\n    /**\n     * sets whether the key set is frozen (true means that further dimensions\n     * cannot be added to this vector).\n     */\n    public void setFreezeKeySet(boolean freeze) {\n        freezeKeySet = freeze;\n    }\n\n    /**\n     * disregards prior value at a particular key, replacing with the specified\n     * value.\n     */\n    public double setCoordinate(String key, double value) {\n        if (Utilities.floatingPointEquals(value, 0d)) {\n            return deleteCoordinate(key);\n        } else if (!freezeKeySet) {\n            vector.putPrimitive(key, value);\n        }\n        return 0d;\n    }\n\n    /**\n     * remove a coordinate from the vector (same as setting it to 0).\n     */\n    public double deleteCoordinate(String key) {\n        if (vector.containsKey(key) && !freezeKeySet) {\n            return vector.removePrimitive(key);\n        } else {\n            return 0d;\n        }\n    }\n\n    public Map<String, Double> getMap() {\n        return vector;\n    }\n\n    /**\n     * add to a specified coordinate (treating it as 0 if it was not present).\n     */\n    public double addToCoordinate(String key, double value) {\n        byte[] bkey = vector.stringToByteArray(key);\n        return addToCoordinateInternal(bkey, value);\n    }\n\n    protected double addToCoordinateInternal(byte[] bkey, double value) {\n        if (vector.containsKey(bkey)) {\n            double updated = vector.getPrimitive(bkey) + value;\n            if (Utilities.floatingPointEquals(updated, 0.0d)) {\n                return vector.removePrimitive(bkey);\n            } else {\n                return vector.putPrimitive(bkey, updated);\n            }\n        } else if (!freezeKeySet && !Utilities.floatingPointEquals(value, 0.0d)) {\n            vector.putPrimitive(bkey, value);\n        }\n        return 0d;\n    }\n\n    /**\n     * return the value of a coordinate.\n     */\n    public double getCoordinate(String key) {\n        return vector.getPrimitive(key);\n    }\n\n    /**\n     * add a multiple of vec to this.\n     */\n    public void addScaled(StringKeyedVector vec, double scale) {\n        if (vec instanceof LazyVector) {\n            ((LazyVector)vec).delazify();\n        }\n        for (TObjectDoubleIterator<byte[]> it = vec.vector.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            addToCoordinateInternal(it.key(), scale * it.value());\n        }\n    }\n\n    public StringKeyedVector multiplyPointwise(StringKeyedVector vec) {\n        StringKeyedVector res = new StringKeyedVector();\n        if (vec instanceof LazyVector) {\n            ((LazyVector)vec).delazify();\n        }\n        for (TObjectDoubleIterator<byte[]> it = vec.vector.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            res.vector.putPrimitive(it.key(), vector.getPrimitive(it.key())\n                    * it.value());\n        }\n        return res;\n    }\n\n    public StringKeyedVector projectOntoNonZeroCoordinates(StringKeyedVector vec) {\n        StringKeyedVector res = new StringKeyedVector();\n        if (vec instanceof LazyVector) {\n            ((LazyVector)vec).delazify();\n        }\n        for (TObjectDoubleIterator<byte[]> it = vec.vector.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            res.addToCoordinateInternal(it.key(), vector.getPrimitive(it.key()));\n        }\n        return res;\n    }\n\n    /**\n     * the dimension of the vector.\n     */\n    public int size() {\n        return vector.size();\n    }\n\n    /**\n     * whether this vector has a non-zero value for a coordinate.\n     */\n    public boolean containsKey(String key) {\n        return vector.containsKey(key);\n    }\n\n    /**\n     * whether this vector has a non-zero value for a coordinate.\n     */\n    public boolean contains(String key) {\n        return containsKey(key);\n    }\n\n    /**\n     * the set of non-zero coordinate names.\n     */\n    public Set<String> keySet() {\n        return vector.keySet();\n    }\n\n    /**\n     * the set of values in the map.\n     */\n    public Set<Double> values() {\n        return vector.values();\n    }\n\n    /**\n     * add vec to this\n     */\n    public void add(StringKeyedVector vec) {\n        addScaled(vec, 1.0);\n    }\n\n    /**\n     * subtract vec from this.\n     */\n    public void sub(StringKeyedVector vec) {\n        addScaled(vec, -1.0);\n    }\n\n    /**\n     * multiply this vector by a scalar.\n     */\n    public void mul(final double a) {\n        transformValues(new TDoubleFunction() {\n            public double execute(double b) {\n                return a * b;\n            }\n        });\n    }\n\n    /**\n     * Apply an arbitrary scalar function to the values.\n     */\n    public void transformValues(TDoubleFunction func) {\n        vector.transformValues(func);\n    }\n\n    /**\n     * Remove zeros that may have appeared as a result of a transform\n     */\n    public void removeZeroCoordinates() {\n        @SuppressWarnings(\"unused\")\n        int i = 0;\n        for (TObjectDoubleIterator<byte[]> it = vector.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            if (Utilities.floatingPointEquals(it.value(), 0d)) {\n                i++;\n                it.remove();\n            }\n        }\n    }\n\n    /**\n     * compute the inner product between this and vec.\n     */\n    public double dot(StringKeyedVector vec) {\n        if (vec instanceof LazyVector) {\n            return vec.dot(this);\n        }\n        ByteArrayDoubleHashMap vec_small = this.size() > vec.size() ? vec.vector\n                : this.vector;\n        ByteArrayDoubleHashMap vec_big = this.size() > vec.size() ? this.vector\n                : vec.vector;\n        double res = 0.0;\n        for (TObjectDoubleIterator<byte[]> it = vec_small.troveIterator(); it\n                .hasNext();) {\n            it.advance();\n            if (vec_big.containsKey(it.key())) {\n                res += it.value() * vec_big.getPrimitive(it.key());\n            }\n        }\n        return res;\n    }\n\n    /**\n     * compute the LP norm for given p < infinity.\n     */\n    public double LPNorm(double p) {\n        double tot = 0d;\n        for (double v : vector.values()) {\n            tot += Math.pow(Math.abs(v), p);\n        }\n        return Math.pow(tot, 1d / p);\n    }\n\n    /**\n     * Find the max value.\n     */\n    public double max() {\n        double max = 0.0;\n        for (double v : vector.values()) {\n            if (v > max) {\n                max = v;\n            }\n        }\n        return max;\n    }\n\n    /**\n     * immutable access the underlying hash map.\n     */\n    public Iterator<Map.Entry<String, Double>> iterator() {\n        return vector.iterator();\n    }\n\n    public String toString() {\n        Gson gson = new Gson();\n        return gson.toJson(vector);\n    }\n\n    /**\n     * performs a deep copy of a stringkeyedvector\n     *\n     */\n    public StringKeyedVector copy() {\n        StringKeyedVector out = new StringKeyedVector(this.size());\n        Iterator<Map.Entry<String, Double>> it = this.iterator();\n\n        while (it.hasNext()) {\n            Map.Entry<String, Double> entry = it.next();\n            String key = entry.getKey();\n            Double value = entry.getValue();\n\n            out.setCoordinate(key, value);\n        }\n\n        return out;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/BinaryModelEvaluation.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.HashMap;\nimport java.util.Map;\nimport java.util.SortedMap;\nimport java.util.TreeMap;\n\nimport com.etsy.conjecture.data.BinaryLabel;\nimport com.etsy.conjecture.PrimitivePair;\n\n/**\n * a basic container for evaluations TODO: add getters for individual metrics\n */\npublic class BinaryModelEvaluation implements ModelEvaluation<BinaryLabel>,\n        Serializable {\n\n    private static final long serialVersionUID = 1L;\n    private final ReceiverOperatingCharacteristic ROC;\n    private final ConfusionMatrix conf;\n\n    public BinaryModelEvaluation() {\n        ROC = new ReceiverOperatingCharacteristic();\n        conf = new ConfusionMatrix(2);\n    }\n\n    public void merge(ModelEvaluation<BinaryLabel> other) {\n      BinaryModelEvaluation tempOther = (BinaryModelEvaluation) other;\n      ROC.add(tempOther.ROC);\n      conf.add(tempOther.conf);\n    }\n\n    public void add(BinaryLabel real, BinaryLabel pred) {\n        add(real.getValue(), pred.getValue());\n    }\n\n    public void add(double label, double prediction) {\n        ROC.add(label, prediction);\n        conf.addHard((int)label, prediction);\n    }\n\n    public void add(PrimitivePair labelPrediction) {\n        ROC.add(labelPrediction);\n        conf.addHard((int)labelPrediction.first, labelPrediction.second);\n    }\n\n    public double computeAUC() {\n        return ROC.binaryAUC();\n    }\n\n    public double computeBrier() {\n        return ROC.brierScore();\n    }\n\n    public double computeAccy() {\n        return conf.computeAccuracy();\n    }\n\n    public double computeAccy(int dim) {\n        return conf.computeAccuracy(dim);\n    }\n\n    public double computeFmeasure() {\n        return conf.computeAverageFmeasure();\n    }\n\n    public double computeFmeasure(int dim) {\n        return conf.computeFmeasure(dim);\n    }\n\n    public double computePrecision() {\n        return conf.computeAveragePrecision();\n    }\n\n    public double computePrecision(int dim) {\n        return conf.computePrecision(dim);\n    }\n\n    public double computeRecall() {\n        return conf.computeAverageRecall();\n    }\n\n    public double computeRecall(int dim) {\n        return conf.computeRecall(dim);\n    }\n\n    public Map<String, Double> getStatistics() {\n        SortedMap<String, Double> m = new TreeMap<String, Double>();\n\n        m.put(\"Brier\", computeBrier());\n        m.put(\"Acc (avg)\", computeAccy());\n        m.put(\"F1 (avg)\", computeFmeasure());\n        m.put(\"Prc (avg)\", computePrecision());\n        m.put(\"Rec (avg)\", computeRecall());\n\n        m.put(\"0-class Acc\", computeAccy(0));\n        m.put(\"0-class F1\", computeFmeasure(0));\n        m.put(\"0-class Prc\", computePrecision(0));\n        m.put(\"0-class Rec\", computeRecall(0));\n\n        m.put(\"1-class Acc\", computeAccy(1));\n        m.put(\"1-class F1\", computeFmeasure(1));\n        m.put(\"1-class Prc\", computePrecision(1));\n        m.put(\"1-class Rec\", computeRecall(1));\n        m.put(\"1-class AUC\", computeAUC());\n        return m;\n    }\n\n    public Map<String, Object> getObjects() {\n        Map<String, Object> m = new HashMap<String, Object>();\n        m.put(\"conf\", conf.toString());\n        return m;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/ConfusionMatrix.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.Collection;\n\nimport com.etsy.conjecture.PrimitivePair;\nimport static com.google.common.base.Preconditions.checkArgument;\n\n/**\n * class representing a confusion matrix for representing misclassification\n * errors.\n * {@link <a href=\"http://en.wikipedia.org/wiki/Confusion_matrix\">Confusion Matrix</a>}\n * \n * @author jattenberg\n */\npublic class ConfusionMatrix implements Serializable {\n\n    private static final long serialVersionUID = 1L;\n\n    /**\n     * The data structure representing the confusion matrix. rows correspond to\n     * labels, columns to predictions\n     */\n    private double[][] confMatrix;\n\n    /** The num_classes represented in the confusion matrix */\n    private final int numClasses;\n\n    /** The number of label / prediction pairs observed */\n    double obs;\n\n    /**\n     * Instantiates a new confusion matrix.\n     * \n     * @param classes\n     *            the number of target classes in the problem being considered\n     */\n    public ConfusionMatrix(int classes) {\n        obs = 0;\n        this.numClasses = classes;\n        this.confMatrix = new double[numClasses][numClasses];\n    }\n\n    public void add(ConfusionMatrix m) {\n        obs += m.obs;\n        for (int i = 0; i < numClasses; i++) {\n            for (int j = 0; j < numClasses; j++) {\n                confMatrix[i][j] += m.confMatrix[i][j];\n            }\n        }\n    }\n\n    /**\n     * Instantiates a new confusion matrix and adds some initial data\n     * \n     * @param classes\n     *            - the number of target classes in the problem being considered\n     * @param labelsAndPredictions\n     *            the labels and predictions\n     */\n    public ConfusionMatrix(int classes,\n            Collection<PrimitivePair> labelsAndPredictions) {\n        this(classes);\n        for (PrimitivePair p : labelsAndPredictions)\n            addInfo(p.first, p.second);\n    }\n\n    /**\n     * Instantiates a new confusion matrix and adds some initial data\n     * \n     * @param classes\n     *            - the number of target classes in the problem being considered\n     * @param labelsAndPredictions\n     *            the labels and predictions\n     */\n    public ConfusionMatrix(int classes, PrimitivePair[] labelsAndPredictions) {\n        this(classes);\n        for (PrimitivePair p : labelsAndPredictions)\n            addInfo(p.first, p.second);\n    }\n\n    /**\n     * Instantiates a new confusion matrix and adds some initial data\n     * \n     * @param classes\n     *            - the number of target classes in the problem being considered\n     * @param labelsAndPredictions\n     *            the labels and predictions\n     */\n    public ConfusionMatrix(int classes, double[] labels, double[] predictions) {\n        this(classes);\n        checkArgument(\n                labels.length == predictions.length,\n                \"labels and predictions must be of the same length! (%s vs %s)\",\n                labels.length, predictions.length);\n        for (int i = 0; i < labels.length; i++) {\n            addInfo(labels[i], predictions[i]);\n        }\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix\n     * \n     * @param label\n     *            the index of the actual class\n     * @param guess\n     *            the index of the predicted class\n     */\n    public void addInfo(int label, int guess) {\n        obs++;\n        this.confMatrix[label][guess]++;\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param label\n     *            the index of the actual class\n     * @param guess\n     *            the predicted distribution over classes.\n     */\n    public void addInfo(int label, double[] guess) {\n        addInfo(label, guess, 1);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param label\n     *            the index of the actual class\n     * @param guess\n     *            the predicted distribution over classes.\n     * @param freq\n     *            the number of times to consider the input label / prediction\n     *            pair\n     */\n    public void addInfo(int label, double[] guess, double freq) {\n        checkArgument(\n                guess.length == numClasses,\n                \"input lenght (%d) must match num classes in confusion matrix (%d) \",\n                guess.length, numClasses);\n        obs += freq;\n        for (int i = 0; i < numClasses; i++) {\n            confMatrix[label][i] += freq * guess[i];\n        }\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * note, only applicable for binary classification (2 class) problems\n     * \n     * @param label\n     *            the actual probability of membership in the positive class\n     * @param prediction\n     *            the predicted probability of membership in the positive class\n     */\n    public void addInfo(double label, double prediction) {\n        checkArgument(\n                2 == numClasses,\n                \"num classes in confusion matrix (%d) must be 2 for this method\",\n                numClasses);\n        addInfo(new double[] { 1. - label, label }, new double[] {\n                1. - prediction, prediction });\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     */\n    public void addInfo(double[] softlabels, double[] guess) {\n        obs++;\n        for (int i = 0; i < numClasses; i++) {\n            for (int j = 0; j < numClasses; j++) {\n                confMatrix[i][j] += softlabels[i] * guess[j];\n            }\n        }\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     * @param freq\n     *            the number of times to consider this label / prediction pair\n     */\n    public void addInfo(double[] softlabels, double[] guess, double freq) {\n        obs += freq;\n        for (int i = 0; i < numClasses; i++) {\n            for (int j = 0; j < numClasses; j++) {\n                confMatrix[i][j] += softlabels[i] * guess[j] * freq;\n            }\n        }\n    }\n\n    /**\n     * Computes the actual distribution over labels\n     * \n     * @return the double[] encoding probabilities in each class.\n     */\n    public double[] classDistribution() {\n        double[] dists = new double[this.numClasses];\n        for (int i = 0; i < numClasses; i++) {\n            dists[i] = classDistribution(i);\n        }\n        return dists;\n    }\n\n    /**\n     * Computes the actual probability of mambership in a particular class\n     * denoted by the input index\n     * \n     * @param num\n     *            index of the class of interest\n     * @return the probability of membership in the requested class\n     */\n    public double classDistribution(int num) {\n        double classSum = 0;\n        double totSum = 0;\n        for (int i = 0; i < numClasses; i++) {\n            for (int j = 0; j < numClasses; j++) {\n                if (i == num)\n                    classSum += confMatrix[i][j];\n                totSum += confMatrix[i][j];\n            }\n        }\n        return classSum / totSum;\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     * @param freq\n     *            the number of times to consider this label / prediction pair\n     */\n    public void addHard(double[] softlabels, double[] guess, double weight) {\n        addInfo(softToHard(softlabels), softToHard(guess), weight);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     */\n    public void addHard(double[] softlabels, double[] guess) {\n        addInfo(softToHard(softlabels), softToHard(guess));\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels note, only applicable for binary classification (2\n     * class) problems\n     * \n     * @param label\n     *            the index of the actual class of membership\n     * @param prediction\n     *            the predicted probability of membership in the positive class\n     */\n    public void addHard(int label, double[] guess) {\n        addInfo(label, softToHard(guess));\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels note, only applicable for binary classification (2\n     * class) problems\n     * \n     * @param label\n     *            the index of the actual class of membership\n     * @param prediction\n     *            the predicted probability of membership in the positive class\n     */\n    public void addHard(int label, double prediction) {\n        addInfo(label,\n                softToHard(new double[] { 1.0 - prediction, prediction }));\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels note, only applicable for binary classification (2\n     * class) problems\n     * \n     * @param label\n     *            the index of the actual class of membership\n     * @param prediction\n     *            the predicted probability of membership in the positive class\n     * @param freq\n     *            the number of times this label / prediction pair should be\n     *            considered.\n     */\n    public void addHard(int label, double[] guess, double freq) {\n        addInfo(label, softToHard(guess));\n    }\n\n    /**\n     * converts a soft prediction of probability estimates into a categorical\n     * indicator for the most likely class\n     * \n     * @param scores\n     *            probabilities of label class membership\n     * @return the categorical values, 0's for all target classes with a 1 for\n     *         the most likely class\n     */\n    private static double[] softToHard(double[] scores) {\n        int maxindex = 0;\n        double max = 0;\n        double[] out = new double[scores.length];\n        for (int i = 0; i < scores.length; i++) {\n            if (scores[i] > max) {\n                maxindex = i;\n                max = scores[i];\n            }\n        }\n        out[maxindex] = 1;\n        return out;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see java.lang.Object#toString()\n     */\n    @Override\n    public String toString() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"predicted:\\t\");\n        for (int i = 0; i < numClasses - 1; i++) {\n            buff.append(i + \"\\t\");\n        }\n        buff.append((numClasses - 1) + \"\\n\");\n        for (int i = 0; i < numClasses; i++) {\n            buff.append(\"actually \" + i + \":\\t\");\n            for (int j = 0; j < numClasses; j++) {\n                buff.append(String.format(\"%.4f\\t\", confMatrix[i][j]));\n            }\n            buff.append(\"\\n\");\n        }\n        return buff.toString();\n    }\n\n    /**\n     * To string row normalized (divided by the sum of each row)\n     * \n     * @return the string representation of the confusion matrix that has been\n     *         row normalized\n     */\n    public String toStringRowNormalized() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"predicted:\\t\");\n        for (int i = 0; i < numClasses - 1; i++) {\n            buff.append(i + \"\\t\");\n        }\n        double[] rowSums = this.rowSums();\n        buff.append((numClasses - 1) + \"\\n\");\n        for (int i = 0; i < numClasses; i++) {\n            buff.append(\"actually \" + i + \":\\t\");\n            for (int j = 0; j < numClasses; j++) {\n                String s = String.format(\"%.4f\\t\", confMatrix[i][j]\n                        / rowSums[i]);\n                buff.append(s);\n            }\n            buff.append(\"\\n\");\n        }\n        return buff.toString();\n    }\n\n    /**\n     * To string column normalized (divided by the sum of each column)\n     * \n     * @return the string representation of the confusion matrix that has been\n     *         column normalized\n     */\n    public String toStringColNormalized() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"predicted:\\t\");\n        for (int i = 0; i < numClasses - 1; i++) {\n            buff.append(i + \"\\t\");\n        }\n        double[] colSums = this.colSums();\n        buff.append((numClasses - 1) + \"\\n\");\n        for (int i = 0; i < numClasses; i++) {\n            buff.append(\"actually \" + i + \":\\t\");\n            for (int j = 0; j < numClasses; j++) {\n                String s = String.format(\"%.4f\\t\", confMatrix[i][j]\n                        / colSums[i]);\n                buff.append(s);\n            }\n            buff.append(\"\\n\");\n        }\n        return buff.toString();\n    }\n\n    /**\n     * Compute the sum of each row\n     * \n     * @return an array containing the sum of each row.\n     */\n    public double[] rowSums() {\n        double[] sums = new double[numClasses];\n        for (int i = 0; i < numClasses; i++) {\n            for (int j = 0; j < numClasses; j++) {\n                sums[i] += confMatrix[i][j];\n            }\n        }\n        return sums;\n    }\n\n    /**\n     * Compute the accuracy for a given class; the % of examples that have been\n     * correctly classifieed.\n     * \n     * @param classid\n     *            the index of the class where accuracy has been requested\n     * @return the % of correctly classified examples for the requested class\n     */\n    public double computeAccuracy(int classid) {\n        double tn = 0.;\n        for (int i = 0; i < numClasses; i++)\n            for (int j = 0; j < numClasses; j++)\n                if (j != classid && i != classid)\n                    tn += confMatrix[i][j];\n        double tp = confMatrix[classid][classid];\n        return (tn + tp) / obs;\n    }\n\n    public double computeAverageFmeasure() {\n        double[] rowSums = rowSums();\n        double total = total(rowSums);\n        double fmeasure = 0.;\n\n        for (int i = 0; i < numClasses; i++) {\n            fmeasure += rowSums[i] * computeFmeasure(i);\n        }\n        return fmeasure / total;\n    }\n\n    public double computeAveragePrecision() {\n        double[] rowSums = rowSums();\n        double total = total(rowSums);\n        double precision = 0.;\n\n        for (int i = 0; i < numClasses; i++) {\n            precision += rowSums[i] * computePrecision(i);\n        }\n        return precision / total;\n    }\n\n    public double computeAverageRecall() {\n        double[] rowSums = rowSums();\n        double total = total(rowSums);\n        double recall = 0.;\n\n        for (int i = 0; i < numClasses; i++) {\n            recall += rowSums[i] * computeRecall(i);\n        }\n        return recall / total;\n    }\n\n    /**\n     * Compute the sums of each column\n     * \n     * @return an array containing the sum of each column.\n     */\n    public double[] colSums() {\n        double[] sums = new double[numClasses];\n        for (int i = 0; i < numClasses; i++) {\n            for (int j = 0; j < numClasses; j++) {\n                sums[j] += confMatrix[i][j];\n            }\n        }\n        return sums;\n    }\n\n    /**\n     * Return the confusion matrix as a 2d array\n     * \n     * @return the confusion matrix data structure\n     */\n    public double[][] getMatrix() {\n        double[][] out = new double[numClasses][numClasses];\n        for (int i = 0; i < numClasses; i++)\n            for (int j = 0; j < numClasses; j++)\n                out[i][j] = confMatrix[i][j];\n        return out;\n    }\n\n    /**\n     * Gets the number of classes in the confusion matrix\n     * \n     * @return the number of classes considered\n     */\n    public int getDim() {\n        return this.numClasses;\n    }\n\n    /**\n     * Computes the accuracy over all observations for all classes (% of\n     * correctly labeled examples).\n     * \n     * @return accuracy over all classes.\n     */\n    public double computeAccuracy() {\n        double accy = 0;\n        double tot = 0;\n        double right = 0;\n        for (int i = 0; i < this.numClasses; i++) {\n            tot += total(this.confMatrix[i]);\n            right += this.confMatrix[i][i];\n        }\n        if (tot > 0) {\n            accy = right / tot;\n        }\n\n        return accy;\n    }\n\n    /**\n     * Compute the precision for each class; the % of members of labeled as\n     * belonging to each class who were actually members of that class\n     * \n     * @return an array containing the precision values for each class.\n     */\n    public double[] computePrecision() {\n        double[] precision = new double[this.numClasses];\n        for (int i = 0; i < this.numClasses; i++) {\n            double yes = 0;\n            double no = 0;\n            for (int j = 0; j < this.numClasses; j++) {\n                if (i == j)\n                    yes += confMatrix[i][j];\n                else\n                    no += confMatrix[i][j];\n            }\n            if (yes + no != 0)\n                precision[i] = yes / (yes + no);\n        }\n        return precision;\n    }\n\n    /**\n     * Compute the recall for each class; the % of members of belonging to each\n     * class that were labeled as class members\n     * \n     * @return an array containing the recall values for each class.\n     */\n    public double[] computeRecall() {\n        double[] recall = new double[this.numClasses];\n        double yes[] = new double[this.numClasses];\n        double no[] = new double[this.numClasses];\n        for (int i = 0; i < this.numClasses; i++) {\n            for (int j = 0; j < this.numClasses; j++) {\n                if (i == j)\n                    yes[j] += confMatrix[i][j];\n                else\n                    no[j] += confMatrix[i][j];\n            }\n        }\n        for (int i = 0; i < numClasses; i++) {\n            if (yes[i] + no[i] != 0)\n                recall[i] = yes[i] / (yes[i] + no[i]);\n        }\n        return recall;\n    }\n\n    /**\n     * Computes the F-measure for each class; the harmonic mean of precision and\n     * recall\n     * {@link <a href=\"http://en.wikipedia.org/wiki/F_measure\">F-Measure</a>}\n     * for more info\n     * \n     * @return the array containing the F-measure for each class\n     */\n    public double[] computeFmeasure() {\n        double[] fmeasure = new double[numClasses];\n        double[] precision = this.computePrecision();\n        double[] recall = this.computeRecall();\n\n        for (int i = 0; i < this.numClasses; i++) {\n            if (recall[i] + precision[i] != 0)\n                fmeasure[i] = 2.0 * (precision[i] * recall[i])\n                        / (precision[i] + recall[i]);\n        }\n        return fmeasure;\n    }\n\n    /**\n     * Builds a string table containing the common IR measures, precision,\n     * recall, and F measure for each class\n     * \n     * @return the string with performance stats\n     */\n    public String getIR() {\n        StringBuffer buff = new StringBuffer();\n        buff.append(\"class\\t\" + \"precision\\t\" + \"recall\\t\" + \"F measure\\n\");\n        double[] precision = this.computePrecision();\n        double[] recall = this.computeRecall();\n        double[] fmeasure = this.computeFmeasure();\n\n        for (int i = 0; i < numClasses; i++) {\n            buff.append(i + \"\\t\" + precision[i] + \"\\t\" + recall[i] + \"\\t\"\n                    + fmeasure[i] + \"\\n\");\n        }\n        return buff.toString();\n\n    }\n\n    /**\n     * Computes precision for a given class; the % of members of belonging to\n     * each class that were labeled as class members\n     * \n     * @param dim\n     *            class of interest\n     * @return the precision for the requested class\n     */\n    public double computePrecision(int dim) {\n        double tot = 0;\n        for (int i = 0; i < numClasses; i++)\n            tot += confMatrix[i][dim];\n        return confMatrix[dim][dim] / tot;\n    }\n\n    /**\n     * Compute the recall for a given class; the % of members of belonging to\n     * each class that were labeled as class members\n     * \n     * @param dim\n     *            the class of interest\n     * @return the recall for the requested class\n     */\n    public double computeRecall(int dim) {\n        double tot = 0;\n        for (int i = 0; i < numClasses; i++)\n            tot += confMatrix[dim][i];\n        return confMatrix[dim][dim] / tot;\n    }\n\n    /**\n     * Computes the F-measure for a given class; the harmonic mean of precision\n     * and recall\n     * {@link <a href=\"http://en.wikipedia.org/wiki/F_measure\">F-Measure</a>}\n     * for more info\n     * \n     * \n     * @param dim\n     *            the class of interest\n     * @return the F-Measure of the requested class\n     */\n    public double computeFmeasure(int dim) {\n        double pre = computePrecision(dim);\n        double rec = computeRecall(dim);\n        return 2 * (pre * rec) / (pre + rec);\n    }\n\n    /**\n     * Total.\n     * \n     * @param arr\n     *            the arr\n     * @return the double\n     */\n    private double total(double[] arr) {\n        double total = 0;\n        for (int i = 0; i < arr.length; i++)\n            total += arr[i];\n        return total;\n    }\n\n    /**\n     * Builds a confusion matrix with the input observations and computes the\n     * accuracy over all observations for all classes (% of correctly labeled\n     * examples).\n     * \n     * \n     * @param input\n     *            the input label / prediction pairs\n     * @return the accuracy of the input values\n     */\n    public static double computeAccuracy(Collection<PrimitivePair> input) {\n        ConfusionMatrix conf = new ConfusionMatrix(2);\n        for (PrimitivePair p : input)\n            conf.addInfo(new double[] { 1. - p.first, p.first }, new double[] {\n                    1. - p.second, p.second });\n        return conf.computeAccuracy();\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/EvaluationAggregator.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.TreeMap;\n\nimport org.apache.commons.math3.stat.descriptive.DescriptiveStatistics;\n\nimport com.etsy.conjecture.data.Label;\n\npublic class EvaluationAggregator<L extends Label> implements Serializable {\n\n    private static final long serialVersionUID = 5825037849957449364L;\n    protected Map<String, DescriptiveStatistics> stats = new TreeMap<String, DescriptiveStatistics>();\n    protected Map<String, List<Object>> obj = new HashMap<String, List<Object>>();\n\n    public void add(ModelEvaluation<L> eval) {\n        Map<String, Double> fold = eval.getStatistics();\n        if (!stats.isEmpty()) {\n            if (!fold.keySet().equals(stats.keySet())) {\n                throw new java.lang.RuntimeException(\n                        \"Tried to add incompatible folds, with fields:\"\n                                + fold.keySet().toString() + \" and \"\n                                + stats.keySet().toString());\n            }\n            for (Map.Entry<String, Double> e : fold.entrySet()) {\n                stats.get(e.getKey()).addValue(e.getValue());\n            }\n            for (Map.Entry<String, Object> e : eval.getObjects().entrySet()) {\n                obj.get(e.getKey()).add(e.getValue());\n            }\n        } else {\n            for (Map.Entry<String, Double> e : fold.entrySet()) {\n                DescriptiveStatistics ds = new DescriptiveStatistics();\n                ds.addValue(e.getValue());\n                stats.put(e.getKey(), ds);\n            }\n            for (Map.Entry<String, Object> e : eval.getObjects().entrySet()) {\n                obj.put(e.getKey(), new ArrayList<Object>(5));\n                obj.get(e.getKey()).add(e.getValue());\n            }\n        }\n    }\n\n    public double getValue(String key) {\n       return stats.get(key).getMean();\n    }\n\n    @Override\n    public String toString() {\n        StringBuilder buff = new StringBuilder(\"Stat:\\tMean\\tStdDev\\tMedian\\n\");\n        for (Map.Entry<String, DescriptiveStatistics> e : stats.entrySet()) {\n            buff.append(e.getKey() + \":\\t\" + format(e.getValue()) + \"\\n\");\n        }\n        for (Map.Entry<String, List<Object>> e : obj.entrySet()) {\n            buff.append(e.getKey()).append(\":\\n\");\n            for (Object o : e.getValue()) {\n                buff.append(\"----\\n\").append(o.toString()).append(\"\\n\");\n            }\n        }\n        return buff.toString();\n    }\n\n    private String format(DescriptiveStatistics stats) {\n        return String.format(\"%.4f\\t%.4f\\t%.4f\", stats.getMean(),\n                stats.getStandardDeviation(), stats.getPercentile(50));\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/ModelEvaluation.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport com.etsy.conjecture.data.Label;\n\nimport java.util.Map;\n\npublic interface ModelEvaluation<L extends Label> {\n    public void add(L real, L predicted);\n\n    public Map<String, Double> getStatistics();\n\n    public Map<String, Object> getObjects();\n\n    public void merge(ModelEvaluation<L> other);\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/MulticlassConfusionMatrix.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.Map;\nimport java.util.Set;\nimport java.util.SortedMap;\nimport java.util.TreeMap;\n\n/**\n * class representing a confusion matrix for representing misclassification\n * errors.\n * {@link <a href=\"http://en.wikipedia.org/wiki/Confusion_matrix\">Confusion Matrix</a>}\n * \n * @author jattenberg\n */\npublic class MulticlassConfusionMatrix implements Serializable {\n\n    private static final long serialVersionUID = 1L;\n\n    /**\n     * The data structure representing the confusion matrix. rows correspond to\n     * labels, columns to predictions\n     */\n    private final SortedMap<String, SortedMap<String, Double>> confusionMatrix;\n\n    /** The num_classes represented in the confusion matrix */\n    private final int numClasses;\n\n    /** The number of label / prediction pairs observed */\n    double obs;\n\n    /**\n     * Instantiates a new confusion matrix.\n     * \n     * @param classes\n     *            the number of target classes in the problem being considered\n     */\n    public MulticlassConfusionMatrix(String[] categories) {\n        obs = 0;\n        this.numClasses = categories.length;\n        confusionMatrix = initializeMatrix(categories);\n    }\n\n    public void add(MulticlassConfusionMatrix m) {\n        obs += m.obs;\n        for(Map.Entry<String,SortedMap<String, Double>> entry : m.confusionMatrix.entrySet()) {\n            String label = entry.getKey();\n            SortedMap<String, Double> value = entry.getValue();\n            for(Map.Entry<String,Double> inner_entry : value.entrySet()) {\n                String inner_label = inner_entry.getKey();\n                Double update = inner_entry.getValue();\n                confusionMatrix.get(label).put(inner_label, update + getValue(label, inner_label));\n            }\n        }\n    }\n\n    private static SortedMap<String, SortedMap<String, Double>> initializeMatrix(\n            Set<String> categories) {\n        String[] catArray = new String[categories.size()];\n        int ct = 0;\n        for (String category : categories) {\n            catArray[ct++] = category;\n        }\n        return initializeMatrix(catArray);\n    }\n\n    private static SortedMap<String, SortedMap<String, Double>> initializeMatrix(\n            String[] categories) {\n        SortedMap<String, SortedMap<String, Double>> conf = new TreeMap<String, SortedMap<String, Double>>();\n\n        for (String categoryOuter : categories) {\n            conf.put(categoryOuter, new TreeMap<String, Double>());\n            for (String categoryInner : categories) {\n                conf.get(categoryOuter).put(categoryInner, 0d);\n            }\n        }\n        return conf;\n    }\n\n    private Double getValue(String label, String guess) {\n        return confusionMatrix.get(label).get(guess);\n    }\n\n    private void updateConfusionMatrix(String label, String guess, double value) {\n        confusionMatrix.get(label).put(guess, value + getValue(label, guess));\n    }\n\n    private Map<String, Double> initializeProbabilityMatrix() {\n        Map<String, Double> probs = new TreeMap<String, Double>();\n        for (String category : confusionMatrix.keySet()) {\n            probs.put(category, 0d);\n        }\n        return probs;\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix\n     * \n     * @param label\n     *            the index of the actual class\n     * @param guess\n     *            the index of the predicted class\n     */\n    public void addInfo(String label, String guess) {\n        obs++;\n        updateConfusionMatrix(label, guess, 1d);\n    }\n\n    public void addInfo(String label, String guess, double freq) {\n        obs += freq;\n        updateConfusionMatrix(label, guess, freq);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param label\n     *            the index of the actual class\n     * @param guess\n     *            the predicted distribution over classes.\n     */\n    public void addInfo(String label, Map<String, Double> guesses) {\n        addInfo(label, guesses, 1d);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param label\n     *            the index of the actual class\n     * @param guess\n     *            the predicted distribution over classes.\n     * @param freq\n     *            the number of times to consider the input label / prediction\n     *            pair\n     */\n    public void addInfo(String label, Map<String, Double> predictions,\n            double freq) {\n        // TODO: ensure that sets match\n        obs += freq;\n        for (String category : predictions.keySet()) {\n            updateConfusionMatrix(label, category, predictions.get(category));\n        }\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     */\n    public void addInfo(Map<String, Double> labels,\n            Map<String, Double> predictions) {\n        addInfo(labels, predictions, 1d);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with soft labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     * @param freq\n     *            the number of times to consider this label / prediction pair\n     */\n    public void addInfo(Map<String, Double> labels,\n            Map<String, Double> predictions, double freq) {\n        obs += freq;\n        for (String categoryLabel : labels.keySet()) {\n            for (String categoryGuess : predictions.keySet()) {\n                updateConfusionMatrix(categoryLabel, categoryGuess, freq);\n            }\n        }\n    }\n\n    /**\n     * Computes the actual distribution over labels\n     * \n     * @return the Map<String, Double encoding probabilities in each class.\n     */\n    public Map<String, Double> classDistribution() {\n        Map<String, Double> dists = initializeProbabilityMatrix();\n        for (String category : dists.keySet()) {\n            dists.put(category, classDistribution(category));\n        }\n        return dists;\n    }\n\n    /**\n     * Computes the actual probability of mambership in a particular class\n     * denoted by the input index TODO: implement this more efficiently\n     * \n     * @param num\n     *            index of the class of interest\n     * @return the probability of membership in the requested class\n     */\n    public double classDistribution(String category) {\n        double classSum = 0;\n        double totSum = 0;\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            for (String categoryPrediction : confusionMatrix.keySet()) {\n                if (categoryPrediction.equals(categoryLabel)) {\n                    classSum += getValue(categoryLabel, categoryPrediction);\n                }\n                totSum += getValue(categoryLabel, categoryPrediction);\n            }\n        }\n        return totSum > 0d ? classSum : 0d;\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     * @param freq\n     *            the number of times to consider this label / prediction pair\n     */\n    public void addHard(Map<String, Double> softlabels,\n            Map<String, Double> predictions, double weight) {\n        addInfo(softToHard(softlabels), softToHard(predictions), weight);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels\n     * \n     * @param softlabels\n     *            actual distribution of target class memberships\n     * @param guess\n     *            the predicted distribution of class memberships\n     */\n    public void addHard(Map<String, Double> softlabels,\n            Map<String, Double> predictions) {\n        addInfo(softToHard(softlabels), softToHard(predictions), 1d);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels note, only applicable for binary classification (2\n     * class) problems\n     * \n     * @param label\n     *            the index of the actual class of membership\n     * @param prediction\n     *            the predicted probability of membership in the positive class\n     */\n    public void addHard(String label, Map<String, Double> guess) {\n        addInfo(label, softToHard(guess), 1d);\n    }\n\n    /**\n     * Adds a label / prediction pair to the confusion matrix with hard (most\n     * likely class) labels note, only applicable for binary classification (2\n     * class) problems\n     * \n     * @param label\n     *            the index of the actual class of membership\n     * @param prediction\n     *            the predicted probability of membership in the positive class\n     * @param freq\n     *            the number of times this label / prediction pair should be\n     *            considered.\n     */\n    public void addHard(String label, Map<String, Double> guess, double freq) {\n        addInfo(label, softToHard(guess), 1d);\n    }\n\n    /**\n     * converts a soft prediction of probability estimates into a categorical\n     * indicator for the most likely class\n     * \n     * @param scores\n     *            probabilities of label class membership\n     * @return the categorical values, 0's for all target classes with a 1 for\n     *         the most likely class\n     */\n    private static String softToHard(Map<String, Double> scores) {\n        String maxindex = null;\n        double max = Double.NEGATIVE_INFINITY;\n        for (String category : scores.keySet()) {\n            if (scores.get(category) > max) {\n                max = scores.get(category);\n                maxindex = category;\n            }\n        }\n        return maxindex;\n    }\n\n    public String printDebug() {\n        return \"\";\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see java.lang.Object#toString()\n     */\n    public String toString() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"predicted:\\t\");\n        for (String category : confusionMatrix.keySet()) {\n            buff.append(category + \"\\t\");\n        }\n        buff.append(\"\\n\");\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            buff.append(\"actually \" + categoryLabel + \":\\t\");\n            for (String categoryPrediction : confusionMatrix.keySet()) {\n                buff.append(String.format(\"%.4f\\t\",\n                        getValue(categoryLabel, categoryPrediction)));\n            }\n            buff.append(\"\\n\");\n        }\n        return buff.toString();\n    }\n\n    /**\n     * To string row normalized (divided by the sum of each row)\n     * \n     * @return the string representation of the confusion matrix that has been\n     *         row normalized\n     */\n    public String toStringRowNormalized() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"predicted:\\t\");\n        for (String category : confusionMatrix.keySet()) {\n            buff.append(category + \"\\t\");\n        }\n        Map<String, Double> rowSums = rowSums();\n        buff.append(\"\\n\");\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            buff.append(\"actually \" + categoryLabel + \":\\t\");\n            for (String categoryPrediction : confusionMatrix.keySet()) {\n                String s = String.format(\n                        \"%.4f\\t\",\n                        getValue(categoryLabel, categoryPrediction)\n                                / rowSums.get(categoryLabel));\n                buff.append(s);\n            }\n            buff.append(\"\\n\");\n        }\n        return buff.toString();\n    }\n\n    /**\n     * To string column normalized (divided by the sum of each column)\n     * \n     * @return the string representation of the confusion matrix that has been\n     *         column normalized\n     */\n    public String toStringColNormalized() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"predicted:\\t\");\n        for (String category : confusionMatrix.keySet()) {\n            buff.append(category + \"\\t\");\n        }\n        Map<String, Double> colSums = colSums();\n        buff.append(\"\\n\");\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            buff.append(\"actually \" + categoryLabel + \":\\t\");\n            for (String categoryPrediction : confusionMatrix.keySet()) {\n                String s = String.format(\n                        \"%.4f\\t\",\n                        getValue(categoryLabel, categoryPrediction)\n                                / colSums.get(categoryLabel));\n                buff.append(s);\n            }\n            buff.append(\"\\n\");\n        }\n        return buff.toString();\n    }\n\n    /**\n     * Compute the sum of each row\n     * \n     * @return an array containing the sum of each row.\n     */\n    public Map<String, Double> rowSums() {\n        Map<String, Double> sums = initializeProbabilityMatrix();\n\n        for (String cateogryLabel : confusionMatrix.keySet()) {\n            for (String cateogryPrediction : confusionMatrix.keySet()) {\n                sums.put(\n                        cateogryLabel,\n                        sums.get(cateogryLabel)\n                                + getValue(cateogryLabel, cateogryPrediction));\n            }\n        }\n        return sums;\n    }\n\n    /**\n     * Compute the accuracy for a given class; the % of examples that have been\n     * correctly classifieed.\n     * \n     * @param classid\n     *            the index of the class where accuracy has been requested\n     * @return the % of correctly classified examples for the requested class\n     */\n    public double computeAccuracy(String classId) {\n        double tn = 0.;\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            for (String categoryPrediction : confusionMatrix.keySet()) {\n                if (!categoryLabel.equals(classId)\n                        && !categoryPrediction.equals(classId))\n                    tn += getValue(categoryLabel, categoryPrediction);\n            }\n        }\n\n        double tp = getValue(classId, classId);\n        return (tn + tp) / obs;\n    }\n\n    public double computeAverageFmeasure() {\n        Map<String, Double> rowSums = rowSums();\n        double total = total(rowSums);\n        double fmeasure = 0.;\n\n        for (String category : confusionMatrix.keySet()) {\n            fmeasure += rowSums.get(category) * computeFmeasure(category);\n        }\n        return fmeasure / total;\n    }\n\n    public double computeAveragePrecision() {\n        Map<String, Double> rowSums = rowSums();\n        double total = total(rowSums);\n        double precision = 0.;\n\n        for (String category : confusionMatrix.keySet()) {\n            double pre = computePrecision(category);\n            if (!Double.isNaN(pre)) { // when nothing is predicted as category,\n                                      // pre is NaN\n                precision += rowSums.get(category) * pre;\n            }\n        }\n        return precision / total;\n    }\n\n    public double computeAverageRecall() {\n        Map<String, Double> rowSums = rowSums();\n        double total = total(rowSums);\n        double recall = 0.;\n\n        for (String category : confusionMatrix.keySet()) {\n            double re = computeRecall(category);\n            if (!Double.isNaN(re)) { // re is NaN when there are 0 examples with\n                                     // label of category\n                recall += rowSums.get(category) * re;\n            }\n        }\n        return recall / total;\n    }\n\n    /**\n     * Compute the sums of each column\n     * \n     * @return an array containing the sum of each column.\n     */\n    public Map<String, Double> colSums() {\n        Map<String, Double> sums = initializeProbabilityMatrix();\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            for (String categoryPrediction : confusionMatrix.keySet()) {\n                sums.put(categoryPrediction, sums.get(categoryPrediction)\n                        + getValue(categoryLabel, categoryPrediction));\n            }\n        }\n        return sums;\n    }\n\n    /**\n     * Return the confusion matrix as a 2d array\n     * \n     * @return the confusion matrix data structure\n     */\n    public SortedMap<String, SortedMap<String, Double>> getMatrix() {\n        SortedMap<String, SortedMap<String, Double>> out = initializeMatrix(confusionMatrix\n                .keySet());\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            for (String categoryPrediction : confusionMatrix.keySet()) {\n                out.get(categoryLabel).put(categoryPrediction,\n                        getValue(categoryLabel, categoryPrediction));\n            }\n        }\n        return out;\n    }\n\n    /**\n     * Gets the number of classes in the confusion matrix\n     * \n     * @return the number of classes considered\n     */\n    public int getDim() {\n        return this.numClasses;\n    }\n\n    /**\n     * Computes the accuracy over all observations for all classes (% of\n     * correctly labeled examples).\n     * \n     * @return accuracy over all classes.\n     */\n    public double computeAccuracy() {\n        double accy = 0d;\n        double tot = 0d;\n        double right = 0d;\n        Map<String, Double> rowSums = rowSums();\n        for (String category : confusionMatrix.keySet()) {\n            tot += rowSums.get(category);\n            right += getValue(category, category);\n        }\n        if (tot > 0) {\n            accy = right / tot;\n        }\n\n        return accy;\n    }\n\n    /**\n     * Compute the precision for each class; the % of members of labeled as\n     * belonging to each class who were actually members of that class\n     * \n     * @return an array containing the precision values for each class.\n     */\n    public Map<String, Double> computePrecision() {\n        Map<String, Double> precision = initializeProbabilityMatrix();\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            precision.put(categoryLabel, computePrecision(categoryLabel));\n        }\n        return precision;\n    }\n\n    /**\n     * Compute the recall for each class; the % of members of belonging to each\n     * class that were labeled as class members\n     * \n     * @return an array containing the recall values for each class.\n     */\n    public Map<String, Double> computeRecall() {\n        Map<String, Double> recall = initializeProbabilityMatrix();\n        for (String categoryLabel : confusionMatrix.keySet()) {\n            recall.put(categoryLabel, computeRecall(categoryLabel));\n        }\n        return recall;\n    }\n\n    /**\n     * Computes the F-measure for each class; the harmonic mean of precision and\n     * recall\n     * {@link <a href=\"http://en.wikipedia.org/wiki/F_measure\">F-Measure</a>}\n     * for more info\n     * \n     * @return the array containing the F-measure for each class\n     */\n    public Map<String, Double> computeFmeasure() {\n        Map<String, Double> fmeasure = initializeProbabilityMatrix();\n        Map<String, Double> precision = this.computePrecision();\n        Map<String, Double> recall = this.computeRecall();\n\n        for (String category : confusionMatrix.keySet()) {\n            if (recall.get(category) + precision.get(category) != 0)\n                fmeasure.put(\n                        category,\n                        2.0\n                                * (precision.get(category) * recall\n                                        .get(category))\n                                / (precision.get(category) + recall\n                                        .get(category)));\n        }\n        return fmeasure;\n    }\n\n    /**\n     * Builds A String Table Containing The common IR measures, precision,\n     * recall, and F measure for each class\n     * \n     * @return the string with performance stats\n     */\n    public String getIR() {\n        StringBuffer buff = new StringBuffer();\n        buff.append(\"class\\t\" + \"precision\\t\" + \"recall\\t\" + \"F measure\\n\");\n        Map<String, Double> precision = this.computePrecision();\n        Map<String, Double> recall = this.computeRecall();\n        Map<String, Double> fmeasure = this.computeFmeasure();\n\n        for (String category : confusionMatrix.keySet()) {\n            buff.append(category + \"\\t\" + precision.get(category) + \"\\t\"\n                    + recall.get(category) + \"\\t\" + fmeasure.get(category)\n                    + \"\\n\");\n        }\n        return buff.toString();\n\n    }\n\n    /**\n     * Computes precision for a given class; the % of members of belonging to\n     * each class that were labeled as class members\n     * \n     * @param dim\n     *            class of interest\n     * @return the precision for the requested class\n     */\n    public double computePrecision(String category) {\n        double tot = 0;\n        for (String label : confusionMatrix.keySet()) {\n            tot += getValue(label, category);\n        }\n        return getValue(category, category) / tot;\n    }\n\n    /**\n     * Compute the recall for a given class; the % of members of belonging to\n     * each class that were labeled as class members\n     * \n     * @param dim\n     *            the class of interest\n     * @return the recall for the requested class\n     */\n    public double computeRecall(String category) {\n        double tot = 0;\n        for (String prediction : confusionMatrix.keySet())\n            tot += getValue(category, prediction);\n        return getValue(category, category) / tot;\n    }\n\n    /**\n     * Computes the F-measure for a given class; the harmonic mean of precision\n     * and recall\n     * {@link <a href=\"http://en.wikipedia.org/wiki/F_measure\">F-Measure</a>}\n     * for more info\n     * \n     * \n     * @param dim\n     *            the class of interest\n     * @return the F-Measure of the requested class\n     */\n    public double computeFmeasure(String category) {\n        double pre = computePrecision(category);\n        double rec = computeRecall(category);\n        return 2 * (pre * rec) / (pre + rec);\n    }\n\n    /**\n     * Total.\n     * \n     * @param arr\n     *            the arr\n     * @return the double\n     */\n    private double total(Map<String, Double> probs) {\n        double total = 0;\n        for (String category : probs.keySet())\n            total += probs.get(category);\n        return total;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/MulticlassModelEvaluation.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.HashMap;\nimport java.util.Map;\nimport java.util.SortedMap;\nimport java.util.TreeMap;\n\nimport com.etsy.conjecture.GenericPair;\nimport com.etsy.conjecture.data.MulticlassLabel;\nimport com.etsy.conjecture.data.MulticlassPrediction;\n\n/**\n * a basic container for evaluations TODO: add getters for individual metrics\n */\n\npublic class MulticlassModelEvaluation implements Serializable,\n        ModelEvaluation<MulticlassLabel> {\n\n\n    /**\n     * \n     */\n    private static final long serialVersionUID = 4916724871985109129L;\n    private final MulticlassReceiverOperatingCharacteristic ROC;\n    private final MulticlassConfusionMatrix conf;\n    private final String[] categories;\n\n    public MulticlassModelEvaluation(String[] categories) {\n        this.categories = categories;\n        ROC = new MulticlassReceiverOperatingCharacteristic(categories);\n        conf = new MulticlassConfusionMatrix(categories);\n    }\n\n    public void add(String label, MulticlassPrediction prediction) {\n        ROC.add(label, prediction);\n        conf.addInfo(label, prediction.getLabel());\n    }\n\n    public void merge(ModelEvaluation<MulticlassLabel> other) {\n        MulticlassModelEvaluation tempOther = (MulticlassModelEvaluation) other;\n        ROC.add(tempOther.ROC);\n        conf.add(tempOther.conf);\n    }\n    \n    public void add(GenericPair<String, MulticlassPrediction> labelPrediction) {\n        add(labelPrediction.first, labelPrediction.second);\n    }\n\n    public void add(MulticlassLabel real, MulticlassLabel pred) {\n        if (!(pred instanceof MulticlassPrediction)) {\n            throw new java.lang.RuntimeException(\n                    \"MulticlassModelEvaluation needs a MulticlassPrediction\");\n        }\n        add(real.getLabel(), (MulticlassPrediction)pred);\n    }\n\n    public double computeAUC() {\n        return ROC.multiclassAUC();\n    }\n\n    public double computeAUC(String dim) {\n        return ROC.singleClassAUC(dim);\n    }\n\n    public double computeBrier() {\n        return ROC.multiclassBrierScore();\n    }\n\n    public double computeAccy() {\n        return conf.computeAccuracy();\n    }\n\n    public double computeAccy(String dim) {\n        return conf.computeAccuracy(dim);\n    }\n\n    public double computeFmeasure() {\n        return conf.computeAverageFmeasure();\n    }\n\n    public double computeFmeasure(String dim) {\n        return conf.computeFmeasure(dim);\n    }\n\n    public double computePrecision() {\n        return conf.computeAveragePrecision();\n    }\n\n    public double computePrecision(String dim) {\n        return conf.computePrecision(dim);\n    }\n\n    public double computeRecall() {\n        return conf.computeAverageRecall();\n    }\n\n    public double computeRecall(String dim) {\n        return conf.computeRecall(dim);\n    }\n\n    public double computePercent(String dim) {\n        return ROC.computePercent(dim);\n    }\n\n    public String printDebug() {\n        return conf.printDebug();\n    }\n\n    public Map<String, Double> getStatistics() {\n        SortedMap<String, Double> m = new TreeMap<String, Double>();\n\n        m.put(\"AUC (avg)\", computeAUC());\n        m.put(\"Brier (avg)\", computeBrier());\n        m.put(\"Acc (avg)\", computeAccy());\n        m.put(\"F1 (avg)\", computeFmeasure());\n        m.put(\"Prc (avg)\", computePrecision());\n        m.put(\"Rec (avg)\", computeRecall());\n\n        for (String category : categories) {\n            m.put(category + \": Pct\", computePercent(category));\n            m.put(category + \": AUC\", computeAUC(category));\n            m.put(category + \": Acc\", computeAccy(category));\n            m.put(category + \": F1\", computeFmeasure(category));\n            m.put(category + \": Prc\", computePrecision(category));\n            m.put(category + \": Rec\", computeRecall(category));\n        }\n\n        return m;\n    }\n\n    public String toString() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"AUC: \" + ROC.multiclassAUC() + \"\\n\");\n        buff.append(\"Brier: \" + ROC.multiclassBrierScore() + \"\\n\");\n        buff.append(\"IR metrics:\\n\" + conf.getIR() + \"\\n\");\n        buff.append(\"Confusion Matrix:\\n\" + conf.toString() + \"\\n\");\n        return buff.toString();\n    }\n\n    public HashMap<String, Object> getObjects() {\n        HashMap<String, Object> m = new HashMap<String, Object>();\n        m.put(\"conf\", conf.toString());\n        return m;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/MulticlassReceiverOperatingCharacteristic.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.Collection;\nimport java.util.Map;\nimport java.util.HashMap;\n\nimport com.etsy.conjecture.GenericPair;\nimport com.etsy.conjecture.data.MulticlassPrediction;\nimport static com.google.common.base.Preconditions.checkArgument;\n\npublic class MulticlassReceiverOperatingCharacteristic implements Serializable {\n\n    private static final long serialVersionUID = 1L;\n\n    /** Num examples in each class. */\n    private Map<String, Integer> classCounts;\n\n    /** Num total examples */\n    private int numExamples;\n\n    /** Binary ROCs for each class */\n    private Map<String, ReceiverOperatingCharacteristic> classROC;\n\n    /**\n     * Instantiates a new receiver operating characteristic.\n     */\n    public MulticlassReceiverOperatingCharacteristic(String[] categories) {\n        classROC = new HashMap<String, ReceiverOperatingCharacteristic>();\n        classCounts = new HashMap<String, Integer>();\n        for (String category : categories) {\n            classROC.put(category, new ReceiverOperatingCharacteristic());\n            classCounts.put(category, 0);\n        }\n    }\n\n    public void add(MulticlassReceiverOperatingCharacteristic other) {\n        numExamples += other.numExamples;\n        for(Map.Entry<String,Integer> entry : other.classCounts.entrySet()) {\n            String category = entry.getKey();\n            Integer count = entry.getValue();\n            classCounts.put(category, classCounts.get(category)+count);\n        }\n\n        for(Map.Entry<String,ReceiverOperatingCharacteristic> entry : other.classROC.entrySet()) {\n            String category = entry.getKey();\n            ReceiverOperatingCharacteristic update = entry.getValue();\n            ReceiverOperatingCharacteristic roc = classROC.get(category);\n            roc.add(update);\n            classROC.put(category, roc);\n        }\n    }\n\n    public void add(GenericPair<String, MulticlassPrediction> labelPrediction) {\n        add(labelPrediction.first, labelPrediction.second);\n    }\n\n    public void add(String label, MulticlassPrediction prediction) {\n        checkArgument(classCounts.containsKey(label),\n                \"label is of unknown category: %s\", label);\n        checkArgument(classROC.containsKey(label),\n                \"label is of unknown category: %s\", label);\n\n        // accum class counts\n        int count = classCounts.get(label);\n        classCounts.put(label, count + 1);\n\n        // accum total counts;\n        numExamples++;\n\n        // add to individual binary ROC classes\n        for (String category : classCounts.keySet()) {\n            double binaryPrediction = prediction.getMap().get(category);\n            double classLabel = category.equals(label) ? 1d : 0d;\n            classROC.get(category).add(classLabel, binaryPrediction);\n        }\n    }\n\n    public double multiclassAUC() {\n        double weightedAverageAUC = 0d;\n        for (String label : classCounts.keySet()) {\n            double classInfluence = (double)classCounts.get(label)\n                    / numExamples;\n            ReceiverOperatingCharacteristic roc = classROC.get(label);\n            double classAUC = roc.binaryAUC();\n            weightedAverageAUC += classInfluence * classAUC;\n        }\n        return weightedAverageAUC;\n    }\n\n    public double singleClassAUC(String category) {\n        return classROC.get(category).binaryAUC();\n    }\n\n    public double multiclassBrierScore() {\n        double brierScore = 0d;\n        int numClasses = classCounts.keySet().size();\n        for (String label : classCounts.keySet()) {\n            brierScore += (classROC.get(label)).brierScore();\n        }\n        return brierScore / numClasses;\n    }\n\n    public double computePercent(String category) {\n        return classCounts.get(category) / (double) numExamples;\n    }\n\n    public static double computeAUC(\n            Collection<GenericPair<String, MulticlassPrediction>> labelsAndPredictions,\n            String[] categories) {\n        MulticlassReceiverOperatingCharacteristic roc = new MulticlassReceiverOperatingCharacteristic(\n                categories);\n        for (GenericPair<String, MulticlassPrediction> p : labelsAndPredictions)\n            roc.add((String)p.first, (MulticlassPrediction)p.second);\n        return roc.multiclassAUC();\n    }\n\n    public static double computeBrierScore(\n            Collection<GenericPair<String, MulticlassPrediction>> labelsAndPredictions,\n            String[] categories) {\n        MulticlassReceiverOperatingCharacteristic roc = new MulticlassReceiverOperatingCharacteristic(\n                categories);\n        for (GenericPair<String, MulticlassPrediction> p : labelsAndPredictions)\n            roc.add(p.first, p.second);\n        return roc.multiclassBrierScore();\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/ReceiverOperatingCharacteristic.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.ArrayList;\nimport java.util.Collection;\nimport java.util.Comparator;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.NavigableMap;\nimport java.util.TreeMap;\n\nimport com.etsy.conjecture.PrimitivePair;\n\npublic class ReceiverOperatingCharacteristic implements Serializable {\n\n    private static class NumComparator implements Comparator<Double>,\n            Serializable {\n        private static final long serialVersionUID = 6569477679353298040L;\n\n        @Override\n        public int compare(Double o1, Double o2) {\n            return -o1.compareTo(o2);\n        }\n    }\n\n    private static final long serialVersionUID = 1L;\n    private static NumComparator numComparator = new NumComparator();\n\n    /**\n     * map for storing the number of positive and negative labeled examples for\n     * each prediction\n     */\n    private NavigableMap<Double, int[]> examples;\n\n    /** The pos. */\n    private double pos = 0;\n\n    /** The neg. */\n    private double neg = 0;\n\n    /**\n     * Instantiates a new receiver operating characteristic.\n     */\n    public ReceiverOperatingCharacteristic() {\n        examples = new TreeMap<Double, int[]>(numComparator);\n    }\n\n    /**\n     * Merge two ROCs together.\n     */\n    public void add(ReceiverOperatingCharacteristic r) {\n        for (Map.Entry<Double, int[]> entry : r.examples.entrySet()) {\n            increment(entry.getKey(), entry.getValue());\n        }\n        pos += r.pos;\n        neg += r.neg;\n    }\n\n    /**\n     * increments count values\n     */\n    private void increment(Double key, int[] value) {\n        if (!examples.containsKey(key)) {\n            examples.put(key, value);\n        } else {\n            int[] oldVals = examples.get(key);\n            oldVals[0] += value[0];\n            oldVals[1] += value[1];\n            examples.put(key, oldVals);\n        }\n    }\n\n    private void increment(Double prediction, double label) {\n\n        if (label > 0.5) {\n            pos++;\n            increment(prediction, new int[] { 1, 0 });\n        } else {\n            neg++;\n            increment(prediction, new int[] { 0, 1 });\n        }\n    }\n\n    /**\n     * Adds the.\n     * \n     * @param label\n     *            the label\n     * @param prediction\n     *            the prediction\n     */\n    public void add(double label, double prediction) {\n        increment(prediction, label);\n    }\n\n    /* pair should be in label, prediction order */\n    public void add(PrimitivePair pair) {\n        add(pair.first, pair.second);\n    }\n\n    /**\n     * Roc.\n     * \n     * @return the double[][]\n     */\n    public double[][] ROC() {\n\n        // checked: examples are in sorted order.\n        // pos and neg are correct\n\n        List<PrimitivePair> curve = new ArrayList<PrimitivePair>();\n        double tp = 0;\n        double fp = 0;\n\n        for (int[] counts : examples.values()) {\n            curve.add(new PrimitivePair(fp / neg, tp / pos));\n\n            tp += counts[0];\n            fp += counts[1];\n        }\n        curve.add(new PrimitivePair(fp / neg, tp / pos));\n\n        double[][] out = new double[curve.size()][2];\n\n        for (int i = 0; i < curve.size(); i++) {\n            out[i][0] = curve.get(i).second; // tpr\n            out[i][1] = curve.get(i).first; // fpr\n        }\n        return out;\n\n    }\n\n    /**\n     * Brier score.\n     * \n     * @return the double\n     */\n    public double brierScore() {\n        double score = 0;\n        double total = 0;\n\n        for (Map.Entry<Double, int[]> entry : examples.entrySet()) {\n\n            Double pred = entry.getKey();\n            int[] counts = entry.getValue();\n\n            score += counts[0] * (1 - pred) * (1 - pred);\n            score += counts[1] * (0 - pred) * (0 - pred);\n            total += counts[0] + counts[1];\n        }\n\n        return score / total;\n    }\n\n    /**\n     * bins the predictions. looks at the average label compared to the median\n     * prediction for each bin. computes the brier score based on this\n     * \n     * @param bins\n     *            the bins\n     * @return the double\n     */\n    public double averagedBrierScore(int bins) {\n        double score = 0;\n        double predBins = Math.min(bins, pos + neg);\n\n        double binWidth = 1. / predBins;\n        double bottom = 0.;\n        double top = bottom + binWidth;\n\n        for (int i = 0; i < predBins; i++) {\n            double num = 0;\n            double avgLabel = 0;\n\n            NavigableMap<Double, int[]> subMap = examples.subMap(bottom, true,\n                    top, true);\n            for (int[] labels : subMap.values()) {\n                avgLabel += labels[0];\n                num += labels[0] + labels[1];\n            }\n\n            double medianscore = (bottom + top) / 2.;\n\n            if (num > 0) {\n                avgLabel /= num;\n                score += (medianscore - avgLabel) * (medianscore - avgLabel);\n            }\n\n            top += binWidth;\n            bottom += binWidth;\n        }\n        return score / predBins;\n    }\n\n    /**\n     * Binary auc.\n     * \n     * @return the double\n     */\n    public double binaryAUC() {\n        double[][] ROC = ROC();\n        double area = 0.0;\n        for (int i = 1; i < ROC.length; i++) {\n            area += trapezoidArea(ROC[i - 1][1], ROC[i][1], ROC[i - 1][0],\n                    ROC[i][0]);\n        }\n        area += trapezoidArea(1, ROC[ROC.length - 1][1], 1,\n                ROC[ROC.length - 1][0]);\n        return area;\n    }\n\n    /**\n     * Trapezoid area.\n     * \n     * @param x1\n     *            the x1\n     * @param x2\n     *            the x2\n     * @param y1\n     *            the y1\n     * @param y2\n     *            the y2\n     * @return the double\n     */\n    private double trapezoidArea(double x1, double x2, double y1, double y2) {\n        double base = Math.abs(x1 - x2);\n        double avgHeight = (y1 + y2) / 2.0;\n        return base * avgHeight;\n    }\n\n    /*\n     * (non-Javadoc)\n     * \n     * @see java.lang.Object#toString()\n     */\n    @Override\n    public String toString() {\n        StringBuffer buff = new StringBuffer();\n        double[][] out = this.ROC();\n        for (int i = 0; i < out.length; i++) {\n            buff.append(out[i][0] + \"\\t\" + out[i][1] + \"\\n\");\n        }\n        return buff.toString();\n    }\n\n    /**\n     * Compute auc.\n     * \n     * @param labelsAndPredictions\n     *            the labels and predictions\n     * @return the double\n     */\n    public static double computeAUC(\n            Collection<PrimitivePair> labelsAndPredictions) {\n        ReceiverOperatingCharacteristic roc = new ReceiverOperatingCharacteristic();\n        for (PrimitivePair p : labelsAndPredictions)\n            roc.add(p.first, p.second);\n        return roc.binaryAUC();\n    }\n\n    /**\n     * Compute brier score.\n     * \n     * @param labelsAndPredictions\n     *            the labels and predictions\n     * @return the double\n     */\n    public static double computeBrierScore(\n            Collection<PrimitivePair> labelsAndPredictions) {\n        ReceiverOperatingCharacteristic roc = new ReceiverOperatingCharacteristic();\n        for (PrimitivePair p : labelsAndPredictions)\n            roc.add(p.first, p.second);\n        return roc.averagedBrierScore(25);\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/evaluation/RegressionModelEvaluation.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport java.io.Serializable;\nimport java.util.HashMap;\n\nimport com.etsy.conjecture.PrimitivePair;\nimport com.etsy.conjecture.data.RealValuedLabel;\n\n/**\n * a basic container for evaluations TODO: add getters for individual metrics\n */\npublic class RegressionModelEvaluation implements\n        ModelEvaluation<RealValuedLabel>, Serializable {\n\n    private static final long serialVersionUID = 1L;\n    private double MSE = 0, MAE = 0, examples = 0;\n\n    public void add(RealValuedLabel real, RealValuedLabel pred) {\n        add(real.getValue(), pred.getValue());\n    }\n\n    public void merge(ModelEvaluation<RealValuedLabel> other) {\n        RegressionModelEvaluation tempOther = (RegressionModelEvaluation) other;\n        MSE += tempOther.MSE;\n        MAE += tempOther.MAE;\n        examples += tempOther.examples;\n    }\n\n    public void add(double label, double prediction) {\n        double difference = Math.abs(label - prediction);\n        MSE += difference * difference;\n        MAE += difference;\n        examples++;\n    }\n\n    public void add(PrimitivePair labelPrediction) {\n        add(labelPrediction.getFirst(), labelPrediction.getSecond());\n    }\n\n    public double computeMeanSquaredError() {\n        return examples > 0 ? MSE / examples : 0;\n    }\n\n    public double computeMeanAbsoluteError() {\n        return examples > 0 ? MAE / examples : 0;\n    }\n\n    public HashMap<String, Double> getStatistics() {\n        HashMap<String, Double> m = new HashMap<String, Double>();\n        m.put(\"MSE\", computeMeanSquaredError());\n        m.put(\"MAE\", computeMeanAbsoluteError());\n        return m;\n    }\n\n    @Override\n    public String toString() {\n        StringBuilder buff = new StringBuilder();\n        buff.append(\"MSE: \" + computeMeanSquaredError() + \"\\n\");\n        buff.append(\"MAE: \" + computeMeanAbsoluteError() + \"\\n\");\n        return buff.toString();\n    }\n\n    public HashMap<String, Object> getObjects() {\n        HashMap<String, Object> m = new HashMap<String, Object>();\n        return m;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/AdagradOptimizer.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.*;\nimport com.etsy.conjecture.data.*;\n\nimport java.util.*;\n\n/**\n *  AdaGrad provides adaptive per-feature learning rates at each time step t.\n *  Described here: http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf\n */\npublic class AdagradOptimizer<L extends Label> extends SGDOptimizer<L> {\n\n    private StringKeyedVector unnormalizedGradients = new StringKeyedVector();\n    private StringKeyedVector summedGradients = new StringKeyedVector();\n\n    @Override\n    public StringKeyedVector getUpdate(LabeledInstance instance) {\n        StringKeyedVector gradients = model.getGradients(instance);\n        StringKeyedVector updateVec = new StringKeyedVector();\n        Iterator<Map.Entry<String, Double>> it = gradients.iterator();\n        while (it.hasNext()) {\n            Map.Entry<String,Double> pairs = (Map.Entry)it.next();\n            String feature = pairs.getKey();\n            double gradient = pairs.getValue();\n            double featureLearningRate = updateAndGetFeatureLearningRate(feature, gradient);\n            updateVec.setCoordinate(feature, gradient * -featureLearningRate);\n       }\n       return updateVec;\n    }\n\n    /**\n     *  Update adaptive feature specific learning rates\n     */\n    public double updateAndGetFeatureLearningRate(String feature, double gradient) {\n        double gradUpdate = 0.0;\n        if (summedGradients.containsKey(feature)) {\n            gradUpdate = gradient * gradient;\n        } else {\n            /**\n             *  Unmentioned in the literature, but initializing\n             *  the squared gradient at 1.0 rather than 0.0\n             *  helps avoid oscillation.\n             */\n            gradUpdate = 1d+(gradient * gradient);\n        }\n        summedGradients.addToCoordinate(feature, gradUpdate);\n        unnormalizedGradients.addToCoordinate(feature, gradient);\n        return getFeatureLearningRate(feature);\n    }\n\n    public double getFeatureLearningRate(String feature) {\n        return initialLearningRate/Math.sqrt(summedGradients.getCoordinate(feature));\n    }\n\n    /**\n     *  Overrides the lazy l1 and l2 regularization in the base class\n     *  to do adagrad with l1 regularization.\n     *\n     *  Lazily calculates and applies the update that minimizes the l1\n     *  regularized objective. See \"Adding l1 regularization\" in\n     *  http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf\n     */\n    @Override\n    public double lazyUpdate(String feature, double param, long start, long end) {\n        if (Utilities.floatingPointEquals(laplace, 0.0d)) {\n            return param;\n        }\n        for (long iter = start + 1; iter <= end; iter++) {\n            if (Utilities.floatingPointEquals(param, 0.0d)) {\n                return 0.0d;\n            }\n            if (laplace > 0.0) {\n                return adagradL1(feature, param, iter);\n            }\n        }\n        return param;\n    }\n\n    public double adagradL1(String feature, double param, long iter) {\n        double eta = (initialLearningRate*iter)/Math.sqrt(summedGradients.getCoordinate(feature));\n        double u = unnormalizedGradients.getCoordinate(feature);\n        double normalizedGradient = u/iter;\n        if (Math.abs(normalizedGradient) <= laplace) {\n            param = 0.0;\n        } else {\n            param = -(Math.signum(u) * eta * (normalizedGradient - laplace));\n        }\n        return param;\n    }\n\n    @Override\n    public void teardown() {\n        summedGradients = new StringKeyedVector();\n        unnormalizedGradients = new StringKeyedVector();\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/ClusteringModel.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.ClusterLabel;\nimport com.etsy.conjecture.data.MulticlassPrediction;\nimport com.etsy.conjecture.data.StringKeyedVector;\nimport com.etsy.conjecture.data.LazyVector;\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.Label;\nimport com.etsy.conjecture.data.LabeledInstance;\n\n\nimport java.io.Serializable;\nimport java.util.Collection;\nimport java.util.Iterator;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.HashMap;\n\nimport static com.google.common.base.Preconditions.checkArgument;\n\npublic abstract class ClusteringModel<ClusterLabel extends Label> implements UpdateableModel<ClusterLabel, ClusteringModel<ClusterLabel>>, Serializable {\n\n  static final long serialVersionUID = 666L;\n  protected double projectionErrorTolerance = 0.01;\n  protected double projectionBallRadius = 1.0;\n  protected int numClusters = 100;\n\n  protected Map<String, StringKeyedVector> param = new HashMap<String, StringKeyedVector>();\n\n  public void update(LabeledInstance<ClusterLabel> instance) {\n    update(instance.getVector());\n  }\n\n  public void update(Collection<LabeledInstance<ClusterLabel>> instances) {\n    for(LabeledInstance<ClusterLabel> instance : instances) {\n      update(instance.getVector());\n    }\n  }\n\n  public abstract void update(StringKeyedVector instance);\n\n  public abstract ClusterLabel predict(StringKeyedVector instance);\n\n  protected ClusteringModel() {\n    Map<String, StringKeyedVector> init_param = new HashMap<String, StringKeyedVector>();\n    for (int i = 0; i < numClusters; i++) {\n      init_param.put(Integer.toString(i), new StringKeyedVector());\n    }\n    this.param = init_param;\n  }\n\n  protected ClusteringModel(HashMap<String, StringKeyedVector> param) {\n    Map<String, StringKeyedVector> init_param = new HashMap<String, StringKeyedVector>();\n    Iterator it = param.entrySet().iterator();\n    while (it.hasNext()) {\n      Map.Entry<String,StringKeyedVector> pairs = (Map.Entry)it.next();\n      init_param.put(pairs.getKey(), pairs.getValue());\n      it.remove();\n    }\n    this.param = init_param;\n  }\n\n\n  public void setFreezeFeatureSet(boolean freeze) {\n  for(Map.Entry<String, StringKeyedVector> e : param.entrySet()) {\n      e.getValue().setFreezeKeySet(freeze);\n    }\n  }\n\n  public void reScale(double scale) {\n    for(String cat : param.keySet()) {\n      param.get(cat).mul(scale);\n    }\n  }\n\n  public void merge(ClusteringModel<ClusterLabel> model, double scale) {\n    for(String cat : param.keySet()) {\n      param.get(cat).addScaled(model.param.get(cat), scale);\n    }\n  }\n\n  public ClusteringModel<ClusterLabel> setNumClusters(int k) {\n    checkArgument(k >= 0, \"number of clusters must be non-negative, given: %s\", k);\n    this.numClusters = k;\n    return this;\n  }\n\n  public ClusteringModel<ClusterLabel> setL1ProjectionErrorTolerance(double e) {\n    checkArgument(e >= 0, \"error tolerance must be non-negative, given: %s\", e);\n    this.projectionErrorTolerance = e;\n    return this;\n  }\n\n  public ClusteringModel<ClusterLabel> setL1ProjectionBallRadius(double r) {\n    checkArgument(r >= 0, \"radius must be non-negative, given: %s\", r);\n    this.projectionBallRadius = r;\n    return this;\n  }\n\n  public Iterator<Map.Entry<String, Double>> decompose() {\n    throw new UnsupportedOperationException(\"not done yet\");\n  }\n\n  public void setParameter(String name, double value){\n    throw new UnsupportedOperationException(\"not done yet\");\n  }\n\n  public long getEpoch() {\n    return 0;\n  }\n\n  public void setEpoch(long epoch) {\n    // this class doesnt care about epoch.\n  }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/ControlOptimizer.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.*;\n\nimport java.util.*;\n\n/**\n *  Current search ads control. Remove after current exp.\n */\npublic class ControlOptimizer<L extends Label> extends SGDOptimizer<L> {\n\n    private StringKeyedVector summedGradients = new StringKeyedVector();\n\n    @Override\n    public StringKeyedVector getUpdate(LabeledInstance instance) {\n        StringKeyedVector gradients = model.getGradients(instance);\n        StringKeyedVector updateVec = new StringKeyedVector();\n        Iterator<Map.Entry<String, Double>> it = gradients.iterator();\n        while (it.hasNext()) {\n            Map.Entry<String,Double> pairs = (Map.Entry)it.next();\n            String feature = pairs.getKey();\n            double gradient = pairs.getValue();\n            double featureLearningRate = updateAndGetFeatureLearningRate(feature, gradient);\n            updateVec.setCoordinate(feature, gradient * -featureLearningRate);\n       }\n       return updateVec;\n    }\n\n    /**\n     *  Update adaptive feature specific learning rates\n     */\n    public double updateAndGetFeatureLearningRate(String feature, double gradient) {\n        double gradUpdate = 0.0;\n        if (summedGradients.containsKey(feature)) {\n            gradUpdate = gradient * gradient;\n        } else {\n            /**\n             *  Unmentioned in the literature, but initializing\n             *  the squared gradient at 1.0 rather than 0.0\n             *  helps avoid oscillation.\n             */\n            gradUpdate = 1d+(gradient * gradient);\n        }\n        summedGradients.addToCoordinate(feature, gradUpdate);\n        return getFeatureLearningRate(feature);\n    }\n\n    public double getFeatureLearningRate(String feature) {\n        return initialLearningRate/Math.sqrt(summedGradients.getCoordinate(feature));\n    }\n\n    @Override\n    public void teardown() {\n        summedGradients = new StringKeyedVector();\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/Decomposable.java",
    "content": "package com.etsy.conjecture.model;\n\nimport java.util.Iterator;\nimport java.util.Map;\n\n/**\n * Type of model to be used with the LargeModelTrainer.\n */\npublic interface Decomposable {\n\n    /**\n     * Present the model internals to be summed across submodels.\n     */\n    public Iterator<Map.Entry<String, Double>> decompose();\n\n    /**\n     * After rebuilding a blank model, fill in the parameters.\n     */\n    public void setParameter(String name, double value);\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/ElasticNetOptimizer.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.*;\n\npublic class ElasticNetOptimizer<L extends Label> extends SGDOptimizer<L> implements LazyVector.UpdateFunction {\n\n    @Override\n    public StringKeyedVector getUpdate(LabeledInstance instance) {\n        StringKeyedVector gradients = model.getGradients(instance);\n        double learningRate = getDecreasingLearningRate(model.epoch);\n        gradients.mul(-learningRate);\n        return gradients;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/FTRLOptimizer.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.LazyVector;\nimport com.etsy.conjecture.data.StringKeyedVector;\nimport static com.google.common.base.Preconditions.checkArgument;\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.Label;\nimport java.util.Map;\nimport java.util.Iterator;\n\n/**\n *  Implements  FTRL-Proximal online learning as described\n *  here: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41159.pdf\n */\npublic class FTRLOptimizer<L extends Label> extends SGDOptimizer<L> {\n\n    private double alpha;\n    private double beta;\n    private StringKeyedVector z = new StringKeyedVector();\n    private StringKeyedVector n = new StringKeyedVector();\n\n    @Override\n    public StringKeyedVector getUpdate(LabeledInstance<L> instance) {\n        FTRLRegularization(instance);\n        StringKeyedVector gradients = model.getGradients(instance);\n        Iterator<Map.Entry<String, Double>> it = gradients.iterator();\n        while (it.hasNext()) {\n            Map.Entry<String,Double> pairs = (Map.Entry)it.next();\n            String feature = pairs.getKey();\n            double gradient = pairs.getValue();\n            double eta = getFeatureLearningRate(feature, gradient);\n            double z_i = 0.0; // if first round, set z_i to 0.0\n            if (z.containsKey(feature)) {\n                z_i = z.getCoordinate(feature);\n            }\n            double update = (z_i + gradient) - eta * model.param.getCoordinate(feature);\n            z.setCoordinate(feature, update);\n            double n_i = 0.0; // if first round, set n_i to 0.0\n            if (n.containsKey(feature)) {\n                n_i = n.getCoordinate(feature);\n            }\n            n.setCoordinate(feature, n_i + gradient * gradient);\n       }\n       return new StringKeyedVector(); // Model updates happen in the FTRLRegularization step\n    }\n\n    public double getFeatureLearningRate(String feature, double gradient) {\n        double n_i = 0.0;\n        if (n.containsKey(feature)) {\n            n_i = n.getCoordinate(feature);\n        }\n        return 1d/alpha * (Math.sqrt(n_i + gradient * gradient) - Math.sqrt(n_i));\n    }\n\n\n    public void FTRLRegularization(LabeledInstance<L> instance) {\n        Iterator<Map.Entry<String,Double>> it = instance.getVector().iterator();\n        while (it.hasNext()) {\n            Map.Entry<String,Double> pairs = (Map.Entry)it.next();\n            String feature = pairs.getKey();\n            Double value = pairs.getValue();\n            double regularizedWeight = getRegularizedWeight(feature);\n            model.param.setCoordinate(feature, regularizedWeight);\n       }\n    }\n\n    /**\n     *  If z doesn't contain the key, it's initialized at 0.0\n     *  and therefore less than laplace which is always >= 0.0\n     */\n    public double getRegularizedWeight(String feature) {\n        if (z.containsKey(feature)){\n            double z_i = z.getCoordinate(feature);\n            if (Math.abs(z_i) <= laplace) {\n                return 0.0d;\n            } else {\n                double n_i = n.getCoordinate(feature);\n                double w_i = -1.0/(((beta + Math.sqrt(n_i))/alpha) + gaussian) * (z_i - Math.signum(z_i) * laplace);\n                return w_i;\n            }\n        } else {\n            return 0.0;\n        }\n    }\n\n    /**\n     *  Since we can do sparse regularization updates, lazyUpdate\n     *  does nothing and just returns the feature param.\n     */\n    @Override\n    public double lazyUpdate(String feature, double param, long start, long end) {\n        return param;\n    }\n\n    public FTRLOptimizer<L> setAlpha(double alpha) {\n        checkArgument(alpha > 0, \"alpha must be greater than 0. Given: %s\", alpha);\n        this.alpha = alpha;\n        return this;\n    }\n\n    public FTRLOptimizer<L> setBeta(double beta) {\n        checkArgument(beta > 0, \"beta must be greater than 0. Given: %s\", beta);\n        this.beta = beta;\n        return this;\n    }\n\n    @Override\n    public void teardown() {\n        z = new StringKeyedVector();\n        n = new StringKeyedVector();\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/Hinge.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.BinaryLabel;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.StringKeyedVector;\n\n\n/**\n *  Hinge loss for binary classification tasks with y in {-1,1}.\n *  When threshold=1.0, one gets the loss used by SVM.\n *  When threshold=0.0, one gets the loss used by the Perceptron.\n */\npublic class Hinge extends UpdateableLinearModel<BinaryLabel> {\n\n    private static final long serialVersionUID = 1L;\n    private double threshold = 0.0;\n\n    public Hinge(SGDOptimizer optimizer) {\n        super(optimizer);\n    }\n\n    public Hinge(StringKeyedVector param, SGDOptimizer optimizer) {\n        super(param, optimizer);\n    }\n\n    @Override\n    public BinaryLabel predict(StringKeyedVector instance) {\n        double inner = param.dot(instance);\n        return new BinaryLabel(Utilities.logistic(inner));\n    }\n\n    @Override\n    public double loss(LabeledInstance<BinaryLabel> instance) {\n        double inner = param.dot(instance.getVector());\n        double label = instance.getLabel().getAsPlusMinus();\n        double z = inner * label;\n        if (z <= this.threshold) {\n            return this.threshold - z;\n        } else {\n            return 0.0;\n        }\n    }\n\n    @Override\n    public StringKeyedVector getGradients(LabeledInstance<BinaryLabel> instance) {\n        StringKeyedVector gradients = instance.getVector().copy();\n        double inner = param.dot(instance.getVector());\n        double label = instance.getLabel().getAsPlusMinus();\n        double z = inner * label;\n        if (z <= this.threshold) {\n            gradients.mul(-label);\n            return gradients;\n        } else {\n            return new StringKeyedVector();\n        }        \n    }\n\n    @Override\n    protected String getModelType() {\n        return \"hinge\";\n    }\n\n    public Hinge setThreshold(double threshold) {\n        this.threshold = threshold;\n        return this;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/KMeans.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.ClusterLabel;\nimport com.etsy.conjecture.data.StringKeyedVector;\nimport com.etsy.conjecture.data.ClusterPrediction;\nimport com.etsy.conjecture.Utilities;\n\nimport java.util.Collection;\nimport java.util.Iterator;\nimport java.util.List;\nimport java.util.Map;\nimport java.util.HashMap;\nimport com.google.common.collect.Maps;\n\n/**\n *  Implements sparse, streaming kmeans as described here:\n *  http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf\n */ \npublic class KMeans extends ClusteringModel<ClusterLabel> {\n  \n  private static final long serialVersionUID = 1L;\n  private Map<String, Double> clusterCounts = new HashMap<String, Double>();\n\n  public KMeans(String[] categories) {\n    for(String s : categories) {\n        param.put(s, new StringKeyedVector());\n        clusterCounts.put(s, 0.0);\n    }\n  }\n\n  private Map<String, StringKeyedVector> predefinedCenters;\n\n  public KMeans(Map<String, StringKeyedVector> centers) {\n    this.predefinedCenters = Maps.newHashMap(centers);\n    for(String key : predefinedCenters.keySet()) {\n      param.put(key, predefinedCenters.get(key));\n      clusterCounts.put(key, 0.0);\n    }\n  }\n\n  public ClusterPrediction predict(StringKeyedVector instance) {\n    Map<String, Double> scores = new HashMap<String, Double>();\n    for(Map.Entry<String, StringKeyedVector> e : param.entrySet()) {\n      scores.put(e.getKey(), e.getValue().dot(instance));\n    }\n    return new ClusterPrediction(scores);\n  }\n\n  public void update(StringKeyedVector instance) {\n    // Get closest center to instance\n    String closest_center = predict(instance).getLabel();\n    // Update the per center count\n    Double current_count = clusterCounts.get(closest_center);\n    clusterCounts.put(closest_center, current_count+1.0);\n    // Get per center learning rate\n    Double learning_rate = 1.0/clusterCounts.get(closest_center);\n    // take gradient step\n    StringKeyedVector center = param.get(closest_center);\n    center.mul(1-learning_rate);\n    instance.mul(learning_rate);\n    center.add(instance);\n    l1Projection(center);\n    param.put(closest_center, center);\n  }\n\n  public Double getCurrent(StringKeyedVector center, Double theta) {\n    Double current = 0.0;\n    for (double v : center.values()) {\n      current += Math.max(0, Math.abs(v)-theta);\n    }\n    return current;\n  }\n\n  /*\n   *  Use bisection to find an approximate value of theta\n   */\n  public Double findTheta(StringKeyedVector center, Double norm) {\n    Double upper = center.max();\n    Double lower = 0.0;\n    Double current = norm;\n    Double theta = 0.0;\n    while (current > projectionBallRadius * (1 + projectionErrorTolerance)) {\n      theta = (upper + lower)/2.0;\n      current = getCurrent(center, theta);\n      if (current <= projectionBallRadius) {\n        upper = theta;\n      } else {\n        lower = theta;\n      }\n    }\n    return theta;\n  }\n\n  public void doProjection(StringKeyedVector center, Double theta) {\n    Iterator it = center.iterator();\n        while (it.hasNext()) {\n            Map.Entry<String,Double> pairs = (Map.Entry)it.next();\n            String key = pairs.getKey();\n            double value = pairs.getValue();\n            double projectedValue = Math.signum(value) * Math.max(0.0, Math.abs(value) - theta);\n            center.setCoordinate(key, projectedValue);\n        }\n  }\n\n  /**\n   * An e-Accurate projection to the L1 ball, described here:\n   * http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf\n   */\n  public void l1Projection(StringKeyedVector center) {\n    Double norm = center.LPNorm(1.0);\n    if (norm <= projectionBallRadius + projectionErrorTolerance) {\n      return;\n    } else {\n      Double theta = findTheta(center, norm);\n      doProjection(center, theta);\n    }\n  }\n}"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/LeastSquaresRegressionModel.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.RealValuedLabel;\nimport com.etsy.conjecture.data.StringKeyedVector;\n\npublic class LeastSquaresRegressionModel extends\n        UpdateableLinearModel<RealValuedLabel> {\n\n    private static final long serialVersionUID = 1L;\n\n    public LeastSquaresRegressionModel(SGDOptimizer optimizer) {\n        super(optimizer);\n    }\n\n    public LeastSquaresRegressionModel(StringKeyedVector param, SGDOptimizer optimizer) {\n        super(param, optimizer);\n    }\n\n    @Override\n    public RealValuedLabel predict(StringKeyedVector instance) {\n        return new RealValuedLabel(param.dot(instance));\n    }\n\n    @Override\n    public double loss (LabeledInstance<RealValuedLabel> instance) {\n        double label = instance.getLabel().getValue();\n        double hypothesis = param.dot(instance.getVector());\n        return 0.5 * (hypothesis - label) * (hypothesis - label);\n    }\n\n    @Override\n    public StringKeyedVector getGradients(LabeledInstance<RealValuedLabel> instance) {\n        StringKeyedVector gradients = instance.getVector().copy();\n        double hypothesis = param.dot(instance.getVector());\n        double label = instance.getLabel().getValue();\n        gradients.mul((2 * (hypothesis-label)));\n        return gradients;\n    }\n\n    @Override\n    protected String getModelType() {\n        return \"least_squares_regression\";\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/LogisticRegression.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.BinaryLabel;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.StringKeyedVector;\n\n/**\n *  Logistic regression loss for binary classification with y in {-1, 1}.\n */\npublic class LogisticRegression extends UpdateableLinearModel<BinaryLabel> {\n\n    private static final long serialVersionUID = 1L;\n\n    public LogisticRegression(SGDOptimizer optimizer) {\n        super(optimizer);\n    }\n\n    public LogisticRegression(StringKeyedVector param, SGDOptimizer optimizer) {\n        super(param, optimizer);\n    }\n\n    @Override\n    public BinaryLabel predict(StringKeyedVector instance) {\n        return new BinaryLabel(Utilities.logistic(instance.dot(param)));\n    }\n\n    @Override\n    public double loss(LabeledInstance<BinaryLabel> instance) {\n        double inner = instance.getVector().dot(param);\n        double label = instance.getLabel().getAsPlusMinus();\n        return Math.log(1.0 + Math.exp(-label * inner));\n    }\n\n    @Override\n    public StringKeyedVector getGradients(LabeledInstance<BinaryLabel> instance) {\n        StringKeyedVector gradients = instance.getVector().copy();\n        double label = instance.getLabel().getAsPlusMinus();\n        double inner = instance.getVector().dot(param);\n        double gradient = -label / (Math.exp(label * inner) + 1.0);\n        gradients.mul(gradient);\n        return gradients;\n    }\n\n    protected String getModelType() {\n        return \"logistic_regression\";\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/MIRA.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.BinaryLabel;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.StringKeyedVector;\n\npublic class MIRA extends UpdateableLinearModel<BinaryLabel> {\n\n    private static final long serialVersionUID = 1L;\n\n    public MIRA() {\n        super(new MIRAOptimizer());\n    }\n\n    public MIRA(StringKeyedVector param, SGDOptimizer optimizer) {\n        super(param, optimizer);\n    }\n\n    @Override\n    public double loss(LabeledInstance<BinaryLabel> instance) {\n        double label = instance.getLabel().getAsPlusMinus();\n        double prediction = param.dot(instance.getVector());\n        double loss = Math.max(0, 1d - label * prediction);\n        return loss;\n    }\n\n    @Override\n    public BinaryLabel predict(StringKeyedVector instance) {\n        double inner = param.dot(instance);\n        return new BinaryLabel(Utilities.logistic(inner));\n    }\n\n    @Override\n    public StringKeyedVector getGradients(LabeledInstance<BinaryLabel> instance) {\n        StringKeyedVector gradients = instance.getVector().copy();\n        double label = instance.getLabel().getAsPlusMinus();\n        double prediction = param.dot(instance.getVector());\n        double loss = Math.max(0, 1d - label * prediction);\n        if (loss > 0) {\n            double norm = instance.getVector().LPNorm(2d);\n            double tau = loss / (norm * norm);\n            gradients.mul(tau * label);\n            return gradients;\n        } else {\n        \treturn new StringKeyedVector();\n        }\n    }\n\n    @Override\n    protected String getModelType() {\n        return \"MIRA\";\n    }\n\n}"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/MIRAOptimizer.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.*;\n\n/**\n *  MIRA takes care of the full update. This is basically just a passthrough to\n *  the MIRA getGradients.\n */\npublic class MIRAOptimizer<L extends Label> extends SGDOptimizer<L> {\n\n    @Override\n    public StringKeyedVector getUpdate(LabeledInstance instance) {\n        return model.getGradients(instance);\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/Model.java",
    "content": "package com.etsy.conjecture.model;\n\nimport java.io.Serializable;\nimport com.etsy.conjecture.data.Label;\nimport com.etsy.conjecture.data.StringKeyedVector;\n\npublic interface Model<L extends Label> extends Serializable {\n\n    public L predict(StringKeyedVector instance);\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/PassiveAggressiveOptimizer.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.LazyVector;\nimport com.etsy.conjecture.data.StringKeyedVector;\nimport static com.google.common.base.Preconditions.checkArgument;\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.Label;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.RealValuedLabel;\n\n/**\n *  See http://eprints.pascal-network.org/archive/00002147/01/CrammerDeKeShSi06.pdf\n *  for a discussion of PA Regression.\n */\npublic class PassiveAggressiveOptimizer extends SGDOptimizer<RealValuedLabel> {\n\n    private double C;\n    private boolean isHinge;\n\n    @Override\n    public StringKeyedVector getUpdate(LabeledInstance<RealValuedLabel> instance) {\n        double norm = instance.getVector().LPNorm(2d);\n        double update = model.loss(instance) / (norm * norm + 0.5 / C);\n        if(isHinge) {\n            /**\n             *  Classification. Scale update by label in {-1, 1}.\n             */\n            update = update * (2.0 * (instance.getLabel().getValue() - 0.5));\n        } else if (instance.getLabel().getValue() - ((RealValuedLabel)model.predict(instance.getVector())).getValue() < 0.0) {\n            /** Regression **/\n            update = update * -1;\n        }\n        StringKeyedVector updateVec = instance.getVector().copy();\n        updateVec.mul(update);\n        return updateVec;\n    }\n\n    public PassiveAggressiveOptimizer setC(double C) {\n        checkArgument(C > 0, \"C must be greater than 0. Given: %s\", C);\n        this.C = C;\n        return this;\n    }\n\n    public PassiveAggressiveOptimizer isHinge(boolean isHinge) {\n        this.isHinge = isHinge;\n        return this;\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/SGDOptimizer.java",
    "content": "package com.etsy.conjecture.model;\n\nimport com.etsy.conjecture.data.LazyVector;\nimport com.etsy.conjecture.Utilities;\nimport static com.google.common.base.Preconditions.checkArgument;\nimport com.etsy.conjecture.data.Label;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.StringKeyedVector;\nimport java.util.Collection;\n\n/**\n *  Builds the weight updates as a function\n *  of learning rate and regularization schedule for SGD learning.\n *\n *  Default learning rate and regularization are:\n *  LR: Exponentially decreasing\n *  REG: Lazily applied L1 and L2 regularization\n *  Subclasses overwrite LR and REG functions as necessary\n */\npublic abstract class SGDOptimizer<L extends Label> implements LazyVector.UpdateFunction {\n\n    private static final long serialVersionUID = 9153480933266800474L;\n    double laplace = 0.0;\n    double gaussian = 0.0;\n    double initialLearningRate = 0.01;\n    transient UpdateableLinearModel model;\n\n    double examplesPerEpoch = 10000;\n    boolean useExponentialLearningRate = false;\n    double exponentialLearningRateBase = 0.99;\n\n    public SGDOptimizer() {}\n\n    public SGDOptimizer(double g, double l) {\n        gaussian = g;\n        laplace = l;\n    }\n\n    /**\n     *  Do minibatch gradient descent\n     */\n    public StringKeyedVector getUpdates(Collection<LabeledInstance<L>> minibatch) {\n        StringKeyedVector updateVec = new StringKeyedVector();\n        for (LabeledInstance<L> instance : minibatch) {\n            updateVec.add(getUpdate(instance)); // accumulate gradient\n            model.truncate(instance);\n            model.epoch++;\n        }\n        updateVec.mul(1.0/minibatch.size()); // do a single update, scaling weights by the\n                                           // average gradient over the minibatch\n        return updateVec;\n    }\n\n    /**\n     *  Get the update to the param vector using a chosen\n     *  learning rate / regularization schedule.\n     *  Returns a StringKeyedVector of updates for each \n     *  parameter.\n     */\n    public abstract StringKeyedVector getUpdate(LabeledInstance<L> instance);\n\n    public void teardown() {\n\n    }\n\n    /**\n     *  Implements lazy updating of regularization when the regularization\n     *  updates aren't sparse (e.g. elastic net l1 and l2, adagrad l1).\n     *  \n     *  When regularization can be done on just the non-zero elements of\n     *  the sample instance (e.g. FTRL proximal, HandsFree), the lazyUpdate\n     *  function does nothing (i.e. just returns the unscaled param).\n     */\n    public double lazyUpdate(String feature, double param, long start, long end) {\n        if (Utilities.floatingPointEquals(laplace, 0.0d)\n            && Utilities.floatingPointEquals(gaussian, 0.0d)) {\n            return param;\n        }\n        for (long iter = start + 1; iter <= end; iter++) {\n            if (Utilities.floatingPointEquals(param, 0.0d)) {\n                return 0.0d;\n            }\n            double eta = getDecreasingLearningRate(iter);\n            /**\n             * TODO: patch so that param cannot cross 0.0 during gaussian update\n             */\n            param -= eta * gaussian * param;\n            if (param > 0.0) {\n                param = Math.max(0.0, param - eta * laplace);\n            } else {\n                param = Math.min(0.0, param + eta * laplace);\n            }\n        }\n        return param;\n    }\n\n    /**\n     *  Computes a linearly or exponentially decreasing\n     *  learning rate as a function of the current epoch.\n     *  Even when we have per feature learning rates, it's \n     *  necessary to keep track of a decreasing learning rate \n     *  for things like truncation.\n     */\n    public double getDecreasingLearningRate(long t){\n        double epoch_fudged = Math.max(1.0, (t + 1) / examplesPerEpoch);\n        if (useExponentialLearningRate) {\n            return Math.max(\n                0d,\n                this.initialLearningRate\n                * Math.pow(this.exponentialLearningRateBase,\n                           epoch_fudged));\n        } else {\n            return Math.max(0d, this.initialLearningRate / epoch_fudged);\n        }\n    }\n\n    public SGDOptimizer<L> setInitialLearningRate(double rate) {\n        checkArgument(rate > 0, \"Initial learning rate must be greater than 0. Given: %s\", rate);\n        this.initialLearningRate = rate;\n        return this;\n    }\n\n    public SGDOptimizer<L> setExamplesPerEpoch(double examples) {\n        checkArgument(examples > 0,\n                \"examples per epoch must be positive, given %f\", examples);\n        this.examplesPerEpoch = examples;\n        return this;\n    }\n\n    public SGDOptimizer<L> setUseExponentialLearningRate(boolean useExponentialLearningRate) {\n        this.useExponentialLearningRate = useExponentialLearningRate;\n        return this;\n    }\n\n    public SGDOptimizer<L> setExponentialLearningRateBase(double base) {\n        checkArgument(base > 0,\n                \"exponential learning rate base must be positive, given: %f\",\n                base);\n        checkArgument(\n                base <= 1.0,\n                \"exponential learning rate base must be at most 1.0, given: %f\",\n                base);\n        this.exponentialLearningRateBase = base;\n        return this;\n    }\n\n    public SGDOptimizer<L> setGaussianRegularizationWeight(double gaussian) {\n        checkArgument(gaussian >= 0.0,\n                \"gaussian regularization weight must be non-negative, given: %f\",\n                gaussian);\n        this.gaussian = gaussian;\n        return this;\n    }\n\n    public SGDOptimizer<L> setLaplaceRegularizationWeight(double laplace) {\n        checkArgument(laplace >= 0.0,\n                \"laplace regularization weight must be non-negative, given: %f\",\n                laplace);\n        this.laplace = laplace;\n        return this;\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/UpdateableLinearModel.java",
    "content": "package com.etsy.conjecture.model;\n\nimport static com.google.common.base.Preconditions.checkArgument;\nimport gnu.trove.function.TDoubleFunction;\n\nimport java.io.Serializable;\nimport java.util.ArrayList;\nimport java.util.Collection;\nimport java.util.HashMap;\nimport java.util.Iterator;\nimport java.util.Map;\n\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.Label;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.LazyVector;\nimport com.etsy.conjecture.data.StringKeyedVector;\n\npublic abstract class UpdateableLinearModel<L extends Label> implements\n        UpdateableModel<L, UpdateableLinearModel<L>>,\n        Comparable<UpdateableLinearModel<L>>, Serializable {\n\n    private static final long serialVersionUID = 8549108867384062857L;\n    protected LazyVector param;\n    protected final String modelType;\n\n    protected long epoch;\n\n    protected SGDOptimizer optimizer;\n\n    // parameters for gradient truncation\n    // for more info, see:\n    // http://jmlr.csail.mit.edu/papers/volume10/langford09a/langford09a.pdf\n    protected int period = 0;\n    protected double truncationUpdate = 0.1;\n    protected double truncationThreshold = 0.0;\n\n    private String argString = \"NOT SET\";\n\n    public void setArgString(String s) {\n        argString = s;\n    }\n\n    public String getArgString() {\n        return argString;\n    }\n\n    public double dotWithParam(StringKeyedVector x) {\n        return param.dot(x);\n    }\n\n    protected UpdateableLinearModel(SGDOptimizer optimizer) {\n        this.optimizer = optimizer;\n        this.param = new LazyVector(100, optimizer);\n        epoch = 0;\n        modelType = getModelType();\n    }\n\n    protected UpdateableLinearModel(StringKeyedVector param, SGDOptimizer optimizer) {\n        this.optimizer = optimizer;\n        optimizer.model = this;\n        this.param = new LazyVector(param, optimizer);\n        epoch = 0;\n        modelType = getModelType();\n    }\n\n    /**\n     *  Get a StringKeyedVector holding the gradient of the loss w.r.t. every model parameter.\n     */\n    public abstract StringKeyedVector getGradients(LabeledInstance<L> instance);\n\n    /**\n     *  Minibatch gradient update\n     */\n    public void update(Collection<LabeledInstance<L>> instances) {\n        optimizer.model = this; // avoid serialization stackoverflow\n        if (epoch > 0) {\n            param.incrementIteration();\n        }\n        StringKeyedVector updates = optimizer.getUpdates(instances);\n        param.add(updates);\n    }\n\n    /**\n     *  Single gradient update\n     */\n    public void update(LabeledInstance<L> instance) {\n        optimizer.model = this; // avoid serialization stackoverflow\n        if (epoch > 0) {\n            param.incrementIteration();\n        }\n        StringKeyedVector update = optimizer.getUpdate(instance);\n        param.add(update);\n        truncate(instance);\n        epoch++;\n    }\n\n    public abstract L predict(StringKeyedVector instance);\n\n    public abstract double loss(LabeledInstance<L> instance);\n\n    protected abstract String getModelType();\n\n    public Iterator<Map.Entry<String, Double>> decompose() {\n        return param.iterator();\n    }\n\n    public void setParameter(String name, double value) {\n        param.setCoordinate(name, value);\n    }\n\n    public StringKeyedVector getParam() {\n        return param;\n    }\n\n    public void reScale(double scale) {\n        param.mul(scale);\n    }\n\n    public void setFreezeFeatureSet(boolean freeze) {\n        param.setFreezeKeySet(freeze);\n    }\n\n    public void merge(UpdateableLinearModel<L> model, double scaling) {\n        param.addScaled(model.param, scaling);\n        epoch += model.epoch;\n    }\n\n    public void teardown() {\n        optimizer.teardown();\n    }\n\n    /**\n     *  Decide based on period and epoch whether to truncate\n     */\n    public void truncate(LabeledInstance<L> instance) {\n        if (period > 0 && epoch > 0 && epoch % period == 0) {\n                applyTruncation(instance.getVector());\n        }\n    }\n\n    public void applyTruncation(StringKeyedVector instance) {\n        final double update = this.optimizer.getDecreasingLearningRate(epoch) * truncationUpdate;\n        final double threshold = truncationThreshold;\n\n        TDoubleFunction truncFn = new TDoubleFunction() {\n            public double execute(double parameter) {\n                if (parameter > 0 && parameter < threshold) {\n                    return Math.max(0, parameter - update);\n                } else if (parameter < 0 && parameter > -threshold) {\n                    return Math.min(0, parameter + update);\n                } else {\n                    return parameter;\n                }\n            }\n        };\n\n        param.transformValues(truncFn);\n        param.removeZeroCoordinates();\n    }\n\n    public long getEpoch() {\n        return epoch;\n    }\n\n    public void setEpoch(long e) {\n        epoch = e;\n    }\n\n    public UpdateableLinearModel<L> setTruncationPeriod(int period) {\n        checkArgument(period >= 0, \"period must be non-negative, given: %s\",\n                period);\n        this.period = period;\n        return this;\n    }\n\n    public UpdateableLinearModel<L> setTruncationThreshold(double threshold) {\n        checkArgument(threshold >= 0, \"update must be non-negative, given: %s\",\n                threshold);\n        this.truncationThreshold = threshold;\n        return this;\n    }\n\n    public UpdateableLinearModel<L> setTruncationUpdate(double update) {\n        checkArgument(update >= 0, \"update must be non-negative, given: %s\",\n                update);\n        this.truncationUpdate = update;\n        return this;\n    }\n\n    @Override\n    public int compareTo(UpdateableLinearModel<L> inputModel) {\n        return (int)Math.signum(inputModel.param.LPNorm(2d) - param.LPNorm(2d));\n    }\n\n    public void thresholdParameters(double t) {\n        for (Iterator<Map.Entry<String, Double>> it = param.iterator(); it\n                .hasNext();) {\n            if (Math.abs(it.next().getValue()) < t) {\n                it.remove();\n            }\n        }\n    }\n\n    public String explainPrediction(StringKeyedVector x) {\n        return explainPrediction(x, -1);\n    }\n\n    public String explainPrediction(StringKeyedVector x, int n) {\n        StringBuilder out = new StringBuilder();\n        Map<String, Double> weights = new HashMap<String, Double>();\n        for (String dim : x.keySet()) {\n            if (param.getCoordinate(dim) != 0.0) {\n                weights.put(\n                        dim,\n                        Math.abs(x.getCoordinate(dim)\n                                * param.getCoordinate(dim)));\n            }\n        }\n        ArrayList<String> keys = com.etsy.conjecture.Utilities\n                .orderKeysByValue(weights, true);\n        for (int i = 0; (n == -1 || i < n) && i < keys.size(); i++) {\n            String k = keys.get(i);\n            out.append(k + \":\" + String.format(\"%.2f\", x.getCoordinate(k))\n                    + \"->\" + String.format(\"%.2f\", param.getCoordinate(k))\n                    + \" \");\n        }\n        return out.toString();\n    }\n}"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/UpdateableModel.java",
    "content": "package com.etsy.conjecture.model;\n\nimport java.util.Collection;\n\nimport com.etsy.conjecture.data.Label;\nimport com.etsy.conjecture.data.LabeledInstance;\n\npublic interface UpdateableModel<L extends Label, M extends UpdateableModel<L, M>>\n        extends Model<L>, Decomposable {\n    // - update the model with a single labeled instance.\n    public void update(LabeledInstance<L> instance);\n\n    // - update the model with many labeled instances.\n    public void update(Collection<LabeledInstance<L>> instances);\n\n    // - merge two models together.\n    public void merge(M model, double weight);\n\n    // - multiply the parameter vector by a constant.\n    public void reScale(double scale);\n\n    // - set whether to add unseen-features when updating.\n    public void setFreezeFeatureSet(boolean freeze);\n\n    // - reset the epoch number after model merging.\n    public void setEpoch(long epoch);\n\n    public long getEpoch();\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/model/UpdateableMulticlassLinearModel.java",
    "content": "package com.etsy.conjecture.model;\n\nimport static com.google.common.base.Preconditions.checkArgument;\nimport gnu.trove.function.TDoubleFunction;\n\nimport java.io.Serializable;\nimport java.util.ArrayList;\nimport java.util.Collection;\nimport java.util.HashMap;\nimport java.util.Iterator;\nimport java.util.Map;\n\nimport com.etsy.conjecture.Utilities;\nimport com.etsy.conjecture.data.MulticlassLabel;\nimport com.etsy.conjecture.data.LabeledInstance;\nimport com.etsy.conjecture.data.BinaryLabeledInstance;\nimport com.etsy.conjecture.data.MulticlassLabeledInstance;\nimport com.etsy.conjecture.data.MulticlassPrediction;\nimport com.etsy.conjecture.data.LazyVector;\nimport com.etsy.conjecture.data.StringKeyedVector;\nimport com.etsy.conjecture.data.RealValuedLabel;\nimport com.etsy.conjecture.data.BinaryLabel;\n\npublic class UpdateableMulticlassLinearModel implements\n    UpdateableModel<MulticlassLabel, UpdateableMulticlassLinearModel>,\n    Comparable<UpdateableMulticlassLinearModel>, Serializable {\n\n    private static final long serialVersionUID = 8549108867384062857L;\n    protected String modelType;\n\n    private String argString = \"NOT SET\";\n\n    protected long epoch;\n\n    protected Map<String, UpdateableLinearModel<BinaryLabel>> param = new HashMap<String, UpdateableLinearModel<BinaryLabel>>();\n\n    public UpdateableMulticlassLinearModel(Map<String, UpdateableLinearModel<BinaryLabel>> param) {\n        this.param = param;\n        this.epoch = 0;\n        this.modelType = this.getModelType();\n    }\n\n    public void setArgString(String s) {\n        argString = s;\n    }\n\n    public String getArgString() {\n        return argString;\n    }\n\n    public void setModelType(String modelType) {\n        this.modelType = modelType;\n    }\n\n    public String getModelType() {\n        return modelType;\n    }\n\n    public Iterator<Map.Entry<String, Double>> decompose() {\n        throw new UnsupportedOperationException(\"not done yet\");\n    }\n\n    public void setParameter(String name, double value) {\n        throw new UnsupportedOperationException(\"not done yet\");\n    }\n\n    public void reScale(double scale) {\n        for (String cat : param.keySet()) {\n            param.get(cat).param.mul(scale);\n        }\n    }\n\n    public void setFreezeFeatureSet(boolean freeze) {\n        for (Map.Entry<String, UpdateableLinearModel<BinaryLabel>> e : param.entrySet()) {\n            e.getValue().param.setFreezeKeySet(freeze);\n        }\n    }\n\n    /**\n     *  Minibatch gradient update\n     */\n    public void update(Collection<LabeledInstance<MulticlassLabel>> instances) {\n        for (LabeledInstance<MulticlassLabel> instance : instances) {\n            update(instance);\n        }\n    }\n\n    /**\n     *  Single gradient update.\n     */\n    public void update(LabeledInstance<MulticlassLabel> instance) {\n        for (Map.Entry<String, UpdateableLinearModel<BinaryLabel>> e : param.entrySet()) {\n            String category = e.getKey();\n            UpdateableLinearModel<BinaryLabel> model = e.getValue();\n            double label = e.getKey().equals(instance.getLabel().getLabel()) ? 1.0 : 0.0;\n            BinaryLabeledInstance blInstance = new BinaryLabeledInstance(label, instance.getVector());\n            model.update(blInstance);\n        }\n        epoch++;\n    }\n\n    @Override\n    public MulticlassPrediction predict(StringKeyedVector instance) {\n        Map<String, Double> scores = new HashMap<String, Double>();\n        double normalization = 0;\n\n        for (Map.Entry<String, UpdateableLinearModel<BinaryLabel>> e : param.entrySet()) {\n            double prediction = ((RealValuedLabel)e.getValue().predict(instance)).getValue();\n            scores.put(e.getKey(), prediction);\n            normalization += prediction;\n        }\n\n        for (Map.Entry<String, Double> e : scores.entrySet()) {\n            scores.put(e.getKey(), e.getValue() / normalization);\n        }\n\n        return new MulticlassPrediction(scores);\n    }\n\n    public void merge(UpdateableMulticlassLinearModel model, double scale) {\n        for (String cat : param.keySet()) {\n            param.get(cat).param.addScaled(model.param.get(cat).param, scale);\n        }\n        epoch += model.epoch;\n    }\n\n    public void teardown() {\n        for (Map.Entry<String, UpdateableLinearModel<BinaryLabel>> e : param.entrySet()) {\n            e.getValue().teardown();\n        }\n    }\n\n    public long getEpoch() {\n        return epoch;\n    }\n\n    public void setEpoch(long e) {\n        epoch = e;\n    }\n\n    // what to do here?\n    @Override\n    public int compareTo(UpdateableMulticlassLinearModel inputModel) {\n        return (int)Math.signum(inputModel.getEpoch() - getEpoch());\n    }\n\n    public void thresholdParameters(double t) {\n        for (UpdateableLinearModel<BinaryLabel> m : param.values()) {\n            for (Iterator<Map.Entry<String, Double>> it = m.param.iterator(); it\n                     .hasNext();) {\n                if (Math.abs(it.next().getValue()) < t) {\n                    it.remove();\n                }\n            }\n        }\n    }\n\n    public String explainPrediction(StringKeyedVector x) {\n        return explainPrediction(x, -1);\n    }\n\n    public String explainPrediction(StringKeyedVector x, int n) {\n        throw new UnsupportedOperationException(\"not done yet\");\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDADenseTopics.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\nimport java.util.Random;\n\npublic class LDADenseTopics implements LDATopics, Serializable {\n\n    private static final long serialVersionUID = 8704084406257021101L;\n    int num_topics;\n    int dict_size;\n    double[][] topic_prob;\n    LDADict dict;\n    Random rnd = new Random();\n\n    public LDADenseTopics(double[][] topic_prob) {\n        this.num_topics = topic_prob.length;\n        this.dict_size = topic_prob[0].length;\n        this.topic_prob = topic_prob;\n    }\n\n    public void setTopicProb(int topic, double[] prob) {\n        topic_prob[topic] = prob;\n    }\n\n    public void setDict(LDADict dict_) throws Exception {\n        if (dict_.size() < dict_size)\n            throw new Exception(\"trying to set the dict with size \"\n                    + dict_.size() + \" on a topic model with dict size \"\n                    + dict_size);\n        dict = dict_;\n        dict_size = dict.size();\n    }\n\n    public LDADict getDict() {\n        return dict;\n    }\n\n    public double wordProb(int word, int topic) {\n        return topic_prob[topic][word];\n    }\n\n    public int numTopics() {\n        return num_topics;\n    }\n\n    public int dictSize() {\n        return dict_size;\n    }\n\n    public String toString() {\n        StringBuilder b = new StringBuilder();\n        for (int k = 0; k < num_topics; k++) {\n            b.append(k + \": \");\n            for (int w = 0; w < dict_size; w++) {\n                if (dict == null)\n                    b.append(w + \":\"\n                            + String.format(\"%.3f, \", topic_prob[k][w]));\n                else\n                    b.append(dict.word(w) + \":\"\n                            + String.format(\"%.3f, \", topic_prob[k][w]));\n            }\n            b.append(\"\\n\");\n        }\n        return b.toString();\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDADict.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\nimport java.util.ArrayList;\nimport java.util.HashMap;\nimport java.util.Set;\n\npublic class LDADict implements Serializable {\n\n    private static final long serialVersionUID = 2363682000942209420L;\n    private ArrayList<String> words;\n    private HashMap<String, Integer> dict;\n\n    public LDADict(Set<String> unique_words) {\n        words = new ArrayList<String>(unique_words.size());\n        dict = new HashMap<String, Integer>();\n        for (String s : unique_words) {\n            words.add(s);\n            dict.put(s, dict.size());\n        }\n    }\n\n    public String word(int index) {\n        return words.get(index);\n    }\n\n    public int index(String word) {\n        return dict.get(word);\n    }\n\n    public int size() {\n        return words.size();\n    }\n\n    public boolean contains(String word) {\n        return dict.containsKey(word);\n    }\n\n    public String toString() {\n        return \"LDADict(size: \" + size() + \")\";\n    }\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDADoc.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport com.etsy.conjecture.Utilities;\n\nimport java.io.Serializable;\nimport java.util.ArrayList;\nimport java.util.Arrays;\nimport java.util.HashMap;\nimport java.util.Map;\n\npublic class LDADoc implements Serializable {\n\n    private static final long serialVersionUID = 1536967875771864807L;\n    double[] topic_proportions;\n    double total_words;\n    int[] word_idx;\n    double[] word_count;\n    double[][] phi;\n    boolean phi_dirty;\n\n    public LDADoc(Map<String, Double> word_counts, LDADict dict) {\n        total_words = 0.0;\n        word_idx = new int[word_counts.size()];\n        word_count = new double[word_counts.size()];\n        phi_dirty = true;\n        int i = 0;\n        for (Map.Entry<String, Double> e : word_counts.entrySet()) {\n            word_idx[i++] = dict.index(e.getKey());\n            total_words += e.getValue();\n        }\n        // Keep parallel arrays in sorted order of word index, for easier\n        // aggregation\n        // of partial topic models.\n        Arrays.sort(word_idx);\n        for (int w = 0; w < word_idx.length; w++) {\n            word_count[w] = word_counts.get(dict.word(word_idx[w]));\n        }\n    }\n\n    public double[] topicProportions() {\n        return topic_proportions;\n    }\n\n    public double wordCount() {\n        return total_words;\n    }\n\n    public void updateTopicProportions(LDATopics topics, double alpha) {\n        int K = topics.numTopics();\n        // reuse old topic proportions unless the topic model has changed.\n        if (topic_proportions == null\n                || topic_proportions.length != topics.numTopics()) {\n            topic_proportions = new double[K];\n            for (int k = 0; k < K; k++) {\n                topic_proportions[k] = total_words / (double)K;\n            }\n        }\n        if (phi == null || phi[0].length != topics.numTopics()) {\n            phi = new double[word_idx.length][K];\n        }\n        // iterate the update procedure.\n        double[] topic_proportions_new = new double[K];\n        double[] phi_z = new double[word_idx.length];\n        while (true) {\n            // Compute phi.\n            for (int k = 0; k < K; k++) {\n                double digamma_k = LDAUtils.digamma(topic_proportions[k]);\n                for (int w = 0; w < word_idx.length; w++) {\n                    double wp = Math.log(topics.wordProb(word_idx[w], k));\n                    phi[w][k] = digamma_k + wp;\n                    if (k == 0) {\n                        phi_z[w] = phi[w][k];\n                    } else {\n                        phi_z[w] = LDAUtils.logSumExp(phi_z[w], phi[w][k]);\n                    }\n                }\n            }\n            // Compute updated gamma.\n            double conv = 0.0;\n            for (int k = 0; k < K; k++) {\n                topic_proportions_new[k] = alpha;\n                for (int w = 0; w < word_idx.length; w++) {\n                    phi[w][k] = Math.exp(phi[w][k] - phi_z[w]) * word_count[w];\n                    topic_proportions_new[k] += phi[w][k];\n                }\n                double diff = topic_proportions[k] - topic_proportions_new[k];\n                topic_proportions[k] = topic_proportions_new[k];\n                conv += diff * diff;\n            }\n            // Check convergence.\n            if (conv < 1000.0) {\n                break;\n            }\n        }\n        phi_dirty = false;\n    }\n\n    // You can only call this after calling updateTopicProportions..\n    public LDAPartialTopics toPartialTopics() throws Exception {\n        if (phi_dirty) {\n            throw new Exception(\n                    \"Called toPartialTopics() on a doc that hasnt been updated\");\n        }\n        return new LDAPartialTopics(word_idx, phi);\n    }\n\n    public LDAPartialTopics toPartialTopic(int topic) throws Exception {\n        if (phi_dirty) {\n            throw new Exception(\n                    \"Called toPartialTopics() on a doc that hasnt been updated\");\n        }\n        double[][] phi_k = new double[word_idx.length][1]; // duh\n        for (int i = 0; i < word_idx.length; i++) {\n            phi_k[i][0] = phi[i][topic];\n        }\n        return new LDAPartialTopics(word_idx, phi_k);\n    }\n\n    public LDAPartialSparseTopics toPartialSparseTopics(int n) throws Exception {\n        if (phi_dirty) {\n            throw new Exception(\n                    \"Called toPartialTopics() on a doc that hasnt been updated\");\n        }\n        int K = topic_proportions.length;\n        Map<Integer, Double> partial_phi = new HashMap<Integer, Double>();\n        Map<Integer, Double> word_topic_prob = new HashMap<Integer, Double>();\n        for (int w = 0; w < word_idx.length; w++) {\n            word_topic_prob.clear();\n            for (int k = 0; k < K; k++) {\n                word_topic_prob.put(k, phi[w][k]);\n            }\n            ArrayList<Integer> sorted_topics = Utilities.orderKeysByValue(\n                    word_topic_prob, true);\n            double z = 0.0;\n            for (int i = 0; i < n; i++) {\n                z += phi[w][sorted_topics.get(i)];\n            }\n            word_topic_prob.clear();\n            for (int i = 0; i < n; i++) {\n                int k = sorted_topics.get(i);\n                int v = word_idx[w];\n                partial_phi.put(v * K + k, (phi[w][k] / z) * word_count[w]);\n            }\n        }\n        return new LDAPartialSparseTopics(K, partial_phi);\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDAPartialSparseTopics.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\nimport java.util.Map;\nimport java.util.Set;\n\npublic class LDAPartialSparseTopics implements Serializable {\n\n    private static final long serialVersionUID = -5073459183590344302L;\n    private int K;\n    private Map<Integer, Double> phi;\n\n    public LDAPartialSparseTopics(int K, Map<Integer, Double> phi) {\n        this.K = K;\n        this.phi = phi;\n    }\n\n    public LDAPartialSparseTopics merge(LDAPartialSparseTopics rhs)\n            throws Exception {\n        if (K != rhs.K) {\n            throw new Exception(\n                    \"Try to merge partials with different nubmer of topics: \"\n                            + K + \" and \" + rhs.K);\n        }\n        Map<Integer, Double> a = phi.size() < rhs.phi.size() ? phi : rhs.phi;\n        Map<Integer, Double> b = phi.size() < rhs.phi.size() ? rhs.phi : phi;\n        for (Map.Entry<Integer, Double> e : a.entrySet()) {\n            if (b.containsKey(e.getKey())) {\n                b.put(e.getKey(), e.getValue() + b.get(e.getKey()));\n            } else {\n                b.put(e.getKey(), e.getValue());\n            }\n        }\n        return new LDAPartialSparseTopics(K, b);\n    }\n\n    public LDASparseTopics toTopics() {\n        // renormalize.\n        double[] z = new double[K];\n        for (Map.Entry<Integer, Double> e : phi.entrySet()) {\n            z[e.getKey() % K] += e.getValue();\n        }\n        Set<Integer> keys = phi.keySet();\n        for (int i : keys) {\n            phi.put(i, phi.get(i) / z[i % K]);\n        }\n        return new LDASparseTopics(K, phi);\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDAPartialTopics.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\n\npublic class LDAPartialTopics implements Serializable {\n\n    private static final long serialVersionUID = 3590284302630767864L;\n    private int[] word_index;\n    private double[][] phi;\n\n    public LDAPartialTopics(int[] word_index, double[][] phi) {\n        this.word_index = word_index;\n        this.phi = phi;\n    }\n\n    private int countUniqueWords(LDAPartialTopics rhs) {\n        // First determine the number of unique words in both sides.\n        int num_words = 0;\n        int lhs_idx = 0;\n        int rhs_idx = 0;\n        while (lhs_idx < word_index.length && rhs_idx < rhs.word_index.length) {\n            int lhs_word = word_index[lhs_idx];\n            int rhs_word = rhs.word_index[rhs_idx];\n            if (lhs_word <= rhs_word) {\n                lhs_idx++;\n            }\n            if (rhs_word <= lhs_word) {\n                rhs_idx++;\n            }\n            num_words++;\n        }\n        // add word for whatever pointers not reached the end\n        if (lhs_idx != word_index.length) {\n            num_words += word_index.length - lhs_idx;\n        }\n        if (rhs_idx != rhs.word_index.length) {\n            num_words += rhs.word_index.length - rhs_idx;\n        }\n        return num_words;\n    }\n\n    public LDAPartialTopics merge(LDAPartialTopics rhs) throws Exception {\n        if (phi[0].length != rhs.phi[0].length) {\n            throw new Exception(\n                    \"Try to merge partials with different nubmer of topics: \"\n                            + phi.length + \" and \" + rhs.phi.length);\n        }\n        int K = phi[0].length;\n        int num_words = countUniqueWords(rhs);\n        int[] word_idx_new = new int[num_words];\n        double[][] phi_new = new double[num_words][K];\n        int new_idx = 0;\n        int lhs_idx = 0;\n        int rhs_idx = 0;\n        while (lhs_idx < word_index.length && rhs_idx < rhs.word_index.length) {\n            int lhs_word = word_index[lhs_idx];\n            int rhs_word = rhs.word_index[rhs_idx];\n            if (lhs_word < rhs_word) {\n                word_idx_new[new_idx] = lhs_word;\n                for (int k = 0; k < K; k++) {\n                    phi_new[new_idx][k] = phi[lhs_idx][k];\n                }\n                lhs_idx++;\n            } else if (rhs_word < lhs_word) {\n                word_idx_new[new_idx] = rhs_word;\n                for (int k = 0; k < K; k++) {\n                    phi_new[new_idx][k] = rhs.phi[rhs_idx][k];\n                }\n                rhs_idx++;\n            } else {\n                word_idx_new[new_idx] = rhs_word;\n                for (int k = 0; k < K; k++) {\n                    phi_new[new_idx][k] = rhs.phi[rhs_idx][k] + phi[lhs_idx][k];\n                }\n                rhs_idx++;\n                lhs_idx++;\n            }\n            new_idx++;\n        }\n        // add word for whatever pointers not reached the end\n        for (; lhs_idx < word_index.length; lhs_idx++) {\n            int lhs_word = word_index[lhs_idx];\n            word_idx_new[new_idx] = lhs_word;\n            for (int k = 0; k < K; k++) {\n                phi_new[new_idx][k] = phi[lhs_idx][k];\n            }\n            new_idx++;\n        }\n        for (; rhs_idx != rhs.word_index.length; rhs_idx++) {\n            int rhs_word = rhs.word_index[rhs_idx];\n            word_idx_new[new_idx] = rhs_word;\n            for (int k = 0; k < K; k++) {\n                phi_new[new_idx][k] = rhs.phi[rhs_idx][k];\n            }\n            new_idx++;\n        }\n        return new LDAPartialTopics(word_idx_new, phi_new);\n    }\n\n    public String toString() {\n        StringBuilder b = new StringBuilder();\n        for (int i = 0; i < word_index.length; i++) {\n            b.append(i + \" - \" + word_index[i] + \": \");\n            for (int k = 0; k < phi[i].length; k++) {\n                b.append(phi[i][k] + \", \");\n            }\n            b.append(\"\\n\");\n        }\n        return b.toString();\n    }\n\n    public double[][] toTopicVectors() {\n        // Ensure that words_index has no gaps.\n        int word_max = word_index[word_index.length - 1];\n        int K = phi[0].length;\n        double[][] phi_new = new double[K][word_max + 1];\n        for (int k = 0; k < K; k++) {\n            double z = 0.0;\n            for (int i = 0; i < word_index.length; i++) {\n                int w = word_index[i];\n                phi_new[k][w] = phi[i][k];\n                z += phi[i][k];\n            }\n            for (int i = 0; i < phi_new[k].length; i++) {\n                phi_new[k][i] /= z;\n            }\n        }\n        return phi_new;\n    }\n\n    public double[] toTopicVector() throws Exception {\n        double[][] phi_new = toTopicVectors();\n        if (phi_new.length > 1) {\n            throw new Exception(\n                    \"called toTopicVector() on a thing with multiple vectors\");\n        }\n        return phi_new[0];\n    }\n\n    public LDADenseTopics toTopics() {\n        return new LDADenseTopics(toTopicVectors());\n    }\n\n    public static void main(String[] argv) throws Exception {\n        int[] words_lhs = new int[] { 1, 2, 4, 7, 10 };\n        double[][] phi_lhs = new double[][] { { 0.3, 0.7 }, { 0.2, 0.8 },\n                { 0.1, 0.9 }, { 0.5, 0.5 }, { 0.9, 0.1 } };\n        int[] words_rhs = new int[] { 4, 10, 11, 12, 15 };\n        double[][] phi_rhs = new double[][] { { 0.4, 0.6 }, { 0.3, 0.7 },\n                { 0.3, 0.7 }, { 0.1, 0.9 }, { 0.4, 0.6 }, { 0.5, 0.5 } };\n        LDAPartialTopics lhs = new LDAPartialTopics(words_lhs, phi_lhs);\n        LDAPartialTopics rhs = new LDAPartialTopics(words_rhs, phi_rhs);\n        System.out.println(lhs);\n        System.out.println(rhs);\n        System.out.println(rhs.merge(lhs));\n        System.out.println(lhs.merge(rhs));\n        System.out.println(lhs.merge(rhs).toTopics());\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDARandomTopics.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\nimport java.util.Random;\n\npublic class LDARandomTopics implements LDATopics, Serializable {\n\n    private static final long serialVersionUID = -3258304331549481829L;\n    int num_topics;\n    int dict_size;\n    LDADict dict;\n    Random rnd = new Random();\n\n    public LDARandomTopics(LDADict dict, int num_topics) {\n        this.num_topics = num_topics;\n        this.dict_size = dict.size();\n        this.dict = dict;\n    }\n\n    public double wordProb(int word, int topic) {\n        // not gonna normalize or anything, central limit theorem bro.\n        rnd.setSeed(topic * dict_size + word);\n        double mean = 1.0 / dict_size;\n        // So if theres 100 words, return something between 0.005 and 0.015\n        double rand = Math.max(0.0,\n                mean + (rnd.nextBoolean() ? 1 : -1) * rnd.nextDouble()\n                        * (mean / 2));\n        return rand;\n    }\n\n    public int numTopics() {\n        return num_topics;\n    }\n\n    public int dictSize() {\n        return dict_size;\n    }\n\n    public LDADict getDict() {\n        return dict;\n    }\n\n    public void setDict(LDADict d) {\n        dict = d;\n        dict_size = d.size();\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDASparseTopics.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\nimport java.util.Map;\n\npublic class LDASparseTopics implements LDATopics, Serializable {\n\n    private static final long serialVersionUID = 4878060449289865652L;\n    int K;\n    Map<Integer, Double> prob;\n    LDADict dict;\n\n    public LDASparseTopics(int K, Map<Integer, Double> prob) {\n        this.prob = prob;\n        this.K = K;\n    }\n\n    public void setDict(LDADict dict_) {\n        dict = dict_;\n    }\n\n    public LDADict getDict() {\n        return dict;\n    }\n\n    public double wordProb(int word, int topic) {\n        int key = word * K + topic;\n        if (prob.containsKey(key)) {\n            return prob.get(key);\n        } else {\n            return 0.00000001;\n        }\n    }\n\n    public int numTopics() {\n        return K;\n    }\n\n    public int dictSize() {\n        return dict.size();\n    }\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDATopics.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\n\npublic interface LDATopics extends Serializable {\n\n    public void setDict(LDADict dict) throws Exception;\n\n    public LDADict getDict();\n\n    public double wordProb(int word, int topic);\n\n    public int numTopics();\n\n    public int dictSize();\n\n}\n"
  },
  {
    "path": "src/main/java/com/etsy/conjecture/topics/lda/LDAUtils.java",
    "content": "package com.etsy.conjecture.topics.lda;\n\nimport java.io.Serializable;\n\npublic class LDAUtils implements Serializable {\n\n    private static final long serialVersionUID = -1142647262716539345L;\n\n    public static double digamma(double x) {\n        if (x > 6.0) {\n            double x2 = x * x;\n            double x4 = x2 * x2;\n            double x6 = x2 * x4;\n            double x8 = x4 * x4;\n            double x10 = x6 * x4;\n            double x12 = x6 * x6;\n            double x14 = x10 * x4;\n            return Math.log(x) - 1.0 / (2 * x) - 1.0 / (12 * x2) - 1.0\n                    / (120 * x4) - 1.0 / (252 * x6) + 1.0 / (240 * x8) - 5.0\n                    / (660 * x10) + 691.0 / (32760 * x12) - 1.0 / (12 * x14);\n        } else {\n            return digamma(x + 1.0) - (1.0 / x);\n        }\n    }\n\n    public static double logSumExp(double a, double b) {\n        double x = (a < b) ? a : b;\n        double y = (a < b) ? b : a;\n        if (y - x > 50) {\n            return y;\n        } else {\n            return x + Math.log(1.0 + Math.exp(y - x));\n        }\n    }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/VWReader.scala",
    "content": "package com.etsy.conjecture\n\nimport cascading.pipe.Pipe\nimport cascading.flow.FlowDef\nimport com.twitter.scalding._\nimport com.etsy.conjecture.data._\n\nimport scala.collection.generic\nimport scala.util.matching.Regex\n\n// Input: line file in VW format\n// Writes MulticlassLabeledInstances in JSON\n\ntrait VWReader {\n    import Dsl._\n\n    def parse(input: String): MulticlassLabeledInstance = {\n\n        // parse header\n        val a = input.split(\"\"\"\\s*\\|\"\"\").toList\n        var b = a(0).split(\"\"\"\\s+\"\"\").toList\n        var label = b(0)\n        var importance = 1.0\n        var tag = \"\"\n\n        try {\n            if (b.length > 1) importance = b(1).toDouble\n            if (b.length > 2) tag = b(2)\n        } catch {\n            case e: Exception => println(\"Ignoring header\")\n        }\n\n        // create inst with header info\n        val instObj = new MulticlassLabeledInstance(label)\n        instObj.setId(tag)\n\n        // parse remainder\n        val remainder = input.split(\"\\\\s+\").toList\n        val pipePattern = \"\"\"(.*)\\|(.*)\"\"\"\n        val pipeReg = (pipePattern).r\n\n        var pastHeader = false\n        var namespace = \"\"\n\n        remainder.map {\n            token: String =>\n                if (pipeReg.pattern.matcher(token).matches) {\n                    val pipeReg(before, after) = token\n                    if (pastHeader)\n                        addFeature(instObj, before, namespace)\n                    namespace = extractNamespace(after) // will be \"\" if no namespace \n                    pastHeader = true\n                } else {\n                    if (pastHeader)\n                        addFeature(instObj, token, namespace)\n                }\n        }\n        instObj\n\n    }\n\n    def extractNamespace(token: String): String = {\n        val pairReg = (\"\"\"(.+)\\:(.+)\"\"\").r\n        var namespace = token\n        if (pairReg.pattern.matcher(token).matches) {\n            val pairReg(term, value) = token\n            namespace = term\n            // TODO: return weight for namespace when models can handle that\n        }\n        namespace\n    }\n\n    def setId(instObj: MulticlassLabeledInstance, id: String): Boolean = {\n        instObj.setId(id)\n        true\n    }\n\n    def setImportance(instObj: MulticlassLabeledInstance, token: String): Boolean = {\n        // TODO: set importance weighting here once we support it\n        true\n    }\n\n    def addFeature(instObj: MulticlassLabeledInstance, token: String, namespace: String) {\n\n        val pairPattern = \"\"\"(.+)\\:(.+)\"\"\"\n        val pairReg = (pairPattern).r\n\n        if (token == \"\")\n            return\n\n        try {\n            if (pairReg.pattern.matcher(token).matches) {\n                val pairReg(term, value) = token\n                if (namespace == \"\") instObj.addTerm(term, value.toDouble) // catch numberFormatException\n                else instObj.addTermWithNamespace(term, namespace, value.toDouble)\n            } else {\n                if (namespace == \"\") instObj.addTerm(token)\n                else instObj.addTermWithNamespace(token, namespace)\n            }\n        } catch {\n            case e: Exception => println(\"Ignore line: \" + token)\n        }\n    }\n\n}\n\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/demo/DemoLinearHyperparameterSearch.scala",
    "content": "package com.etsy.scalding.jobs.conjecture\n\nimport scala.util.Random\nimport com.etsy.conjecture.scalding.util._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\nimport com.twitter.scalding._\nimport cascading.tuple.Fields\n\n/**\n * An example of a custom Job that would run the hyperparameter searcher for a Binary model\n * Takes command line arguments:\n * input: Path to training/testing data\n * out_dir: Path to output directory\n * num_trials: The number of models to train over random settings\n * model: Type of linear model to use for training\n */\nclass DemoLinearHyperparameterSearch(args : Args) extends BaseGridSearcher(args) {\n\n  /*\n   * Define the settings to be optimized as below.\n   * DynamicOptions are generic type containers for Args that can perform various metric caluations.\n   * All command line parameters given to this job are automatically added.\n   * DynamicOption takes output name of param and the default value of the param.\n   */\n  class DefaultClassifierOptions extends DynamicOptions(args) {\n    val laplace = new DynamicOption(\"laplace\", 0.0)\n    val gauss = new DynamicOption(\"gauss\", 0.0)\n    val rate = new DynamicOption(\"rate\", 0.1)\n    val numIters = new DynamicOption(\"iter\", 5)\n  }\n\n\n  //Define your parameters to optimize\n  val opts = new DefaultClassifierOptions\n\n  //For each parameter you wish to optimize and defined about, create a hyperparameter instance with the dynamic container and the sampler type\n  val parameters: Seq[HyperParameter[_]] = {\n    val laplace = new HyperParameter(opts.laplace, new LogUniformDoubleSampler(1e-8, 1e-1))\n    val gauss = new HyperParameter(opts.gauss, new LogUniformDoubleSampler(1e-8, 1e-1))\n    val rate = new HyperParameter(opts.rate, new SampleFromSeq(List(.01, .001, .0001, .00001, .000001)))\n    val iters = new HyperParameter(opts.numIters, new SampleFromSeq(List(3, 5)))\n    Seq(laplace, gauss, rate, iters)\n  }\n  //Define model type, Binary, Multiclass, or Regression\n  val searcher = new BinaryHyperparameterSearcher(opts, parameters, numTrials)\n          \n  // Call hyperparameter search to run and write to given file location\n  val (results, report) = searcher.search(instances, instance_field)\n  \n  //Write to the file of your choice. \n  results.write(SequenceFile(out_dir + \"/trialSummary\"))\n  report.write(SequenceFile(out_dir + \"/parameterReport\"))\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/demo/IrisDataToMulticlassLabeledInstances.scala",
    "content": "package com.etsy.conjecture.demo\n\nimport com.twitter.scalding._\nimport com.etsy.conjecture.data._\n\nclass IrisDataToMulticlassLabeledInstances(args: Args) extends Job(args) {\n\n    // This class just converts the tsv of iris data to a sequence file of multiclass labeled instances\n    // which the AdHocClassifier can then use to train.\n    // Note that for a dataset of this size, the use of a hadoop job is overkill, this is for demonstration\n    // puroses.\n    TextLine(args.getOrElse(\"input_file\", \"iris.tsv\"))\n        .mapTo('instance) {\n            l: String =>\n                val names = Array(\"sepal_length\", \"sepal_width\", \"petal_length\", \"petal_width\")\n                val parts = l.split(\"\\t\")\n                val instance = new MulticlassLabeledInstance(parts(4))\n                (0 until 4).foreach { i => instance.setCoordinate(names(i), parts(i).toDouble) }\n                instance\n        }\n        .write(SequenceFile(args.getOrElse(\"output_file\", \"iris_instances\")))\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/demo/LearnMulticlassClassifier.scala",
    "content": "package com.etsy.conjecture.demo\n\nimport com.twitter.scalding._\nimport com.etsy.conjecture.scalding.evaluate.{ MulticlassCrossValidator, MulticlassEvaluator }\nimport com.etsy.conjecture.scalding.train.MulticlassModelTrainer\nimport com.etsy.conjecture.data.{ MulticlassLabel, MulticlassLabeledInstance }\nimport com.etsy.conjecture.model.UpdateableMulticlassLinearModel\n\nimport com.google.gson.Gson\n\nimport cascading.tuple.Fields\n\nclass LearnMulticlassClassifier(args: Args) extends Job(args) {\n\n    val input = args(\"input\")\n    val out_dir = args.getOrElse(\"output\", \"multiclass_classifier\")\n    val class_names = args(\"class_names\").split(\",\")\n    val folds = args.getOrElse(\"folds\", \"0\").toInt\n\n    // Let the user configure the field names on the command line.\n    val data_field_names = args.getOrElse(\"data_fields\", \"instance\").split(\",\")\n    val data_fields = data_field_names.tail.foldLeft(new Fields(data_field_names.head)) { (x, y) => x.append(new Fields(y)) }\n    val instance_field = Symbol(args.getOrElse(\"instance_field\", \"instance\"))\n\n    val instances = SequenceFile(input, data_fields).project(instance_field)\n\n    val model_pipe = new MulticlassModelTrainer(args, class_names)\n        .train(instances, instance_field, 'model)\n\n    model_pipe\n        .write(SequenceFile(out_dir + \"/model\"))\n        .mapTo('model -> 'json) { x: UpdateableMulticlassLinearModel => new Gson().toJson(x) }\n        .write(Tsv(out_dir + \"/model_json\"))\n\n    if (folds > 0) {\n        val eval_pred = new MulticlassCrossValidator(args, folds, class_names)\n            .crossValidateWithPredictions(instances, instance_field, 'pred)\n        eval_pred._1\n            .write(Tsv(out_dir + \"/xval\"))\n        eval_pred._2\n            .write(Tsv(out_dir + \"/pred\"))\n    }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/ALSJob.scala",
    "content": "package com.etsy.conjecture.scalding\n\nimport cascading.pipe.Pipe\nimport cascading.pipe.joiner.InnerJoin\nimport com.twitter.scalding.{Args, Job, Mode, SequenceFile}\nimport org.apache.commons.math3.linear._\n\n/**\n * An abstract job class to implement alternating least squares for matrix factorization.\n * Since the method is iterative, this job overrides job.next rather than trying to\n * build a single massive cascading flow.  This means that the job is more robust to failure, and \n * also doesn't crash the cascading planner with a giant graph.\n *\n * The concrete job class which extends this just has to override the function s() which returns a pipe\n * having fields ('row, 'col, 'value) representing the matrix to factorize.  This is only computed on the\n * first iteration, and then written to disk.  Therefore the function should be self contained, so that the \n * job doesnt try to do pointless work on every iteration.\n *\n * There are some other fields which the child class can override in order to get specific behavior from\n * the method:\n *\n * - zero_weight: the weight of zeros in the matrix, where nonzeros are given weight 1.\n * - norm_constraint: whether to force the norms of rows of the factors to 1 (useful for doing LSH for max-product search).\n * - lambda_row, lambda_col: L2 regularization parameters on the two factors.\n *\n */\n\nabstract class ALSJob[R, C](args : Args) extends Job(args) {\n\n  override def config: Map[AnyRef, AnyRef] =\n    super.config + (\"mapred.child.java.opts\" -> \"-Xmx3G\")\n\n  // Dimension of latent factors.\n  val n = args.getOrElse(\"dim\", \"200\").toInt\n\n  val iter = args.getOrElse(\"iter\", \"0\").toInt\n\n  val max_iter = args.getOrElse(\"max_iter\", \"15\").toInt\n\n  val parallelism = args.getOrElse(\"parallelism\", \"500\").toInt\n\n  val base_dir = args.getOrElse(\"base_dir\", \"als\")\n\n  // The weight of zero terms in the matrices.\n  val zero_weight = 0.001\n\n  // data for s matrix, must have fields ('row, 'col, 'value)\n  def s() : Pipe\n\n  def norm_constraint : Boolean = false\n\n  def lambda_row : Double = 0.0f\n  def lambda_col : Double = 0.0f\n\n  val incremental = args.boolean(\"incremental\")\n\n  // allow overriding input and output paths.\n  val input_u_path = args.getOrElse(\"input_u_path\", base_dir+\"/U/\"+(iter-1))\n  val output_u_path = args.getOrElse(\"output_u_path\", base_dir+\"/U/\"+iter)\n  val output_v_path = args.getOrElse(\"output_v_path\", base_dir+\"/V/\"+iter)\n\n  // technique to initialize the vector\n  def initial_vector(row : R) : RealVector = {\n    val rand = new scala.util.Random(row.hashCode)\n    val vec = MatrixUtils.createRealVector((0 until n).map{i => rand.nextGaussian}.toArray)\n    vec.mapDivide(vec.getNorm)\n  }\n\n  val S = if(iter == 0 || args.boolean(\"update_matrix\")) {\n    s().project('row, 'col, 'value).write(SequenceFile(base_dir + \"/S\"))\n  } else {\n    SequenceFile(base_dir+\"/S\", ('row, 'col, 'value)).read\n  }\n\n  if(iter == 0 && !incremental) {\n    // Initial item factors.\n    S\n      .groupBy('row){_.size('count)}\n      .map('row -> 'u_vec)(initial_vector)\n      .project('row, 'u_vec)\n      .write(SequenceFile(base_dir+\"/U/0\"))\n  } else {\n    // Perform iteration of dual alternating least squares.\n    val U = SequenceFile(input_u_path, ('row, 'u_vec)).read\n\n    // -- Update V first.\n    // Compute U'U\n    val UU = U.mapTo('u_vec -> 'UU){u : RealVector => u.outerProduct(u)}\n      .groupAll{_.reduce[RealMatrix]('UU){(a, b) => a.add(b)}}\n\n    val V = S.joinWithSmaller('row -> 'row, U, new InnerJoin(), parallelism)\n      .groupBy('col){_.toList[(RealVector, Double)](('u_vec, 'value) -> 'u_list).reducers(parallelism).forceToReducers}\n      .crossWithTiny(UU)\n      .mapTo(('col, 'u_list, 'UU) -> ('col, 'v_vec)){\n        x : (C, List[(RealVector, Double)], RealMatrix) =>\n        val col_id = x._1\n        var XX = x._3.scalarMultiply(zero_weight)\n        var Xy = x._2.view.map{t => t._1.mapMultiply(t._2)}.reduce{(a,b) => a.add(b)}.mapMultiply(1.0 + zero_weight)\n        x._2.foreach{t => XX = XX.add(t._1.outerProduct(t._1))}\n        val lambda = if(norm_constraint) compute_lambda(XX, Xy) else lambda_col\n        val res = new LUDecomposition(XX.add(MatrixUtils.createRealIdentityMatrix(XX.getRowDimension).scalarMultiply(lambda))).getSolver.getInverse.operate(Xy)\n        (col_id, res)\n      }\n      .write(SequenceFile(output_v_path))\n\n    // -- Finally update U.\n    val VV = V.mapTo('v_vec -> 'VV){u : RealVector => u.outerProduct(u)}\n      .groupAll{_.reduce[RealMatrix]('VV){(a, b) => a.add(b)}}\n\n    S\n      .joinWithSmaller('col -> 'col, V, new InnerJoin(), parallelism)\n      .groupBy('row){_.toList[(RealVector, Double)](('v_vec, 'value) -> 'v_list).reducers(parallelism).forceToReducers}\n      .crossWithTiny(VV)\n      .mapTo(('row, 'v_list, 'VV) -> ('row, 'u_vec)){\n        x : (R, List[(RealVector, Double)], RealMatrix) =>\n        val row_id = x._1\n        var XX = x._3.scalarMultiply(zero_weight)\n        var Xy = x._2.view.map{t => t._1.mapMultiply(t._2)}.reduce{(a,b) => a.add(b)}.mapMultiply(1.0 + zero_weight)\n        x._2.foreach{t => XX = XX.add(t._1.outerProduct(t._1))}\n        val lambda = if(norm_constraint) compute_lambda(XX, Xy) else lambda_row\n        val res = new LUDecomposition(XX.add(MatrixUtils.createRealIdentityMatrix(XX.getRowDimension).scalarMultiply(lambda))).getSolver.getInverse.operate(Xy)\n        (row_id, res)\n      }\n      .write(SequenceFile(output_u_path))\n  }\n\n  // for the norm constrained version, compute the lambda necessary so that the output vector has unit norm.\n  def compute_lambda(XX : RealMatrix, Xy : RealVector) : Double = {\n    val eigen = new EigenDecomposition(XX)\n    val u = eigen.getVT.operate(Xy)\n    // approximate the lagrange multiplier\n    var lambda_max = math.sqrt(u.dotProduct(u))\n    var lambda_min = -eigen.getRealEigenvalues.min+0.000000001\n    var norm_max = (0 until u.getDimension).map{i => val ui = u.getEntry(i); val ei = eigen.getRealEigenvalue(i); ui*ui / ((ei+lambda_max)*(ei+lambda_max))}.sum\n    var norm_min = (0 until u.getDimension).map{i => val ui = u.getEntry(i); val ei = eigen.getRealEigenvalue(i); ui*ui / ((ei+lambda_min)*(ei+lambda_min))}.sum\n    while(math.abs(norm_max - norm_min) > 0.0001) {\n      val lambda_mid = (lambda_max + lambda_min) / 2.0\n      val norm_mid = (0 until u.getDimension).map{i => val ui = u.getEntry(i); val ei = eigen.getRealEigenvalue(i); ui*ui / ((ei+lambda_mid)*(ei+lambda_mid))}.sum\n      if(norm_mid < 1) {\n        lambda_max = lambda_mid\n        norm_max = norm_mid\n      } else {\n        lambda_min = lambda_mid\n        norm_min = norm_mid\n      }\n    }\n    val lambda = (lambda_max + lambda_min) / 2\n    lambda\n  }\n\n  override def next : Option[Job] = {\n    val new_args = args + (\"iter\", Some((iter+1).toString))\n    if(iter < max_iter && !incremental) {\n      Some(clone(new_args))\n    } else {\n      None\n    }\n  }\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/FastKNN.scala",
    "content": "package com.etsy.conjecture.scalding\n\nimport collection.mutable.PriorityQueue\nimport com.twitter.scalding._\nimport cascading.pipe.Pipe\nimport cascading.pipe.joiner.InnerJoin\nimport org.apache.commons.math3.linear.{MatrixUtils, RealVector}\n\nobject FastKNN extends Serializable {\n\n  import com.twitter.scalding.Dsl._\n\n  // The basic idea is that we do KNN on arbitrary types.\n  // These can be objects containing e.g., identifiers (user_id etc) and also correspond to some point in a metric space.\n  // Examples are objects like (user_id, vector from matrix factorization model).\n  // Typically when we do the KNN procedure, we dont actually care about returning the entire object for all the neighbors,\n  // but only the list of the ids, and their distances.\n  // E.g., we would return the list of (user_id, distance) rather than (user_id, vector, distance).\n  // This allows having larger lists of stuff in ram since the vector etc may be large.\n\n  // The main entry point for knn in a single pipe.\n  // X: Type of the element on which the distance is defined (the thing in the vec_field).\n  // Y: Type of id for the element (thing in the id_field).\n  // p: Pipe of stuff to knn\n  // id_field: Field name for id\n  // vec_field: field name for vec.\n  // neighb_field: field name for result (neighbors).\n  // k: the k from knn.\n  // dist: the distance function for X.  If the thing you give isnt a real distance function then probably this method will give you garbage results.\n  // init_num_centers: Number of blocks to partition the data into, should be probably around sqrt(n).\n  // bin_per_point: how many blocks to put each point into (increasing quality of the approximation).\n  def knn[X, Y](p : Pipe, id_field : Symbol, vec_field : Symbol, neighb_field : Symbol, k : Int, dist : (X, X) => Double,\n    init_num_centers : Int = 10000, bins_per_point : Int = 5, max_bin_size : Int = 20000) : Pipe = {\n\n    val centers = initialize_bins[X](p, id_field, vec_field, dist, init_num_centers, bins_per_point, max_bin_size)\n\n    // Do knn in each cluster, and aggregate.\n    construct_bins[X, Y](p, id_field, vec_field, 'list, centers, bins_per_point, max_bin_size, dist)\n      .filter('count){c : Int => c <= max_bin_size}\n      .flatMapTo('list -> (id_field, neighb_field)){l : List[(Y, X)] =>\n        println(l.size)\n        l.view.map{t => (t._1, knn_id[X, Y](t._2, l, k+1, dist).filter{_._1 != t._1})}\n      }\n      .groupBy(id_field){_.reduce[List[(Y, Double)]](neighb_field){(a, b) => (a++b).groupBy{_._1}.toList.map{t => (t._1, t._2.map{_._2}.min)}.sortBy{_._2}.take(k)}.reducers(1000).forceToReducers}\n      .project(id_field, neighb_field)\n  }\n\n  // The entry point for the 2 pipe version of knn.\n  // Z is the type for the id field of the candidates.\n  def knn2[X, Y, Z](targets : Pipe, target_id_field : Symbol, target_vec_field : Symbol,\n    candidates : Pipe, candidate_id_field : Symbol, candidate_vec_field : Symbol, neighb_field : Symbol, k : Int, dist : (X, X) => Double,\n    init_num_centers : Int, bins_per_point : Int, max_bin_size : Int) : Pipe = {\n\n    // Tesselate the candidates.\n    val candidate_centers = initialize_bins[X](candidates, candidate_id_field, candidate_vec_field, dist, init_num_centers, 1, max_bin_size)\n\n    val candidate_assignments = construct_bins[X, Y](candidates, candidate_id_field, candidate_vec_field, 'candidate_list,\n      candidate_centers, 1, max_bin_size, dist)\n\n    // Assign targets to same bins as candidates.\n    val target_assignments = assign_bins[X](targets, target_id_field, target_vec_field, candidate_centers, bins_per_point, dist)\n\n    // Replicate the candidates, and fragment the targets.\n    val bin_replicates = target_assignments\n      .groupBy('bin){_.size('count)}\n      .map('count -> 'num_fragments){c : Int => 1 + (c / max_bin_size)}\n      .groupAll{_.toList[(Int, Int)](('bin, 'num_fragments) -> 'bin_replicates)}\n      .mapTo('bin_replicates -> 'bin_replicates){l : List[(Int, Int)] => l.toMap}\n\n    val targets_fragmented = target_assignments\n      .crossWithTiny(bin_replicates)\n      .map((target_id_field, 'bin, 'bin_replicates) -> ('rep_bin, 'rep)){x : (Z, Int, Map[Int, Int]) => (x._2, math.abs(x._1.hashCode) % x._3.getOrElse(x._2, 1))}\n      .groupBy('rep_bin, 'rep){_.toList[(Z, X)]((target_id_field, target_vec_field) -> 'target_list).reducers(1000)}\n\n    val candidates_replicated = candidate_assignments\n      .crossWithTiny(bin_replicates)\n      .flatMap(('bin, 'bin_replicates) -> ('rep_bin, 'rep)){x : (Int, Map[Int, Int]) => (0 until x._2.getOrElse(x._1, 1)).map{i => (x._1, i)}}\n      .project('rep_bin, 'rep, 'candidate_list)\n\n    // Do knn in each cluster, and aggregate.\n    candidates_replicated\n      .joinWithSmaller(('rep_bin, 'rep) -> ('rep_bin, 'rep), targets_fragmented, new InnerJoin(), 1000)\n      .flatMapTo(('target_list, 'candidate_list) -> (target_id_field, neighb_field)){x : (List[(Z, X)], List[(Y, X)]) =>\n        println(x._1.size + \" \" + x._2.size)\n        x._1.view.map{t => (t._1, knn_id[X, Y](t._2, x._2, k, dist))}\n      }\n      .groupBy(target_id_field){_.reduce[List[(Y, Double)]](neighb_field){(a, b) => (a++b).groupBy{_._1}.toList.map{t => (t._1, t._2.map{_._2}.min)}.sortBy{_._2}.take(k)}.reducers(1000)}\n      .project(target_id_field, neighb_field)\n  }\n\n  // Return the ids of the closest elements to the target.\n  // X is the type of the element on which the distance is defined.\n  // Y is the type of the identifier for each element.\n  def knn_id[X, Y](target : X, candidates : List[(Y, X)], K : Int, dist : (X, X) => Double) : List[(Y, Double)] = {\n    if(K > 250) {\n      candidates.map{s => (s._1, dist(target, s._2))}.sortBy{_._2}.take(K)\n    } else {\n      val q = new PriorityQueue[(Y, Double)]()(Ordering.by[(Y, Double), Double](_._2))\n      var worst = 0.0\n      var size = 0\n      candidates.foreach{s =>\n        val ds = dist(target, s._2)\n        if(size < K || ds < worst) {\n          size += 1\n          q.enqueue((s._1, ds))\n          if(size > K) {\n            q.dequeue\n            size -= 1\n          }\n          worst = q.head._2\n        }\n      }\n      q.toList.sortBy{_._2}\n    }\n  }\n\n  // Return the indices of the closest elements.\n  def knn_idx[X](vec : X, l : List[X], K : Int, dist : (X, X) => Double) : List[Int] = {\n    val q = new PriorityQueue[(Int, Double)]()(Ordering.by[(Int, Double), Double](_._2))\n    var worst = 0.0\n    var size = 0\n    var idx = 0\n    l.foreach{r : X =>\n      val di = dist(vec, r)\n      if(size < K || di < worst) {\n        size += 1\n        q.enqueue((idx, di))\n        if(size > K) {\n          q.dequeue\n          size -= 1\n        }\n        worst = q.head._2\n      }\n      idx += 1\n    }\n    q.toList.sortBy{_._2}.map{_._1}\n  }\n\n  def initialize_bins[X](p : Pipe, id_field : Symbol, vec_field : Symbol, dist : (X, X) => Double,\n    init_num_centers : Int, bins_per_point : Int, max_bin_size : Int) : Pipe = {\n\n    // Choose init_num_centers points at random.\n    val centers = p\n      .map(vec_field -> 'rand){r : X => new scala.util.Random(r.toString.hashCode).nextDouble}\n      .groupRandomly(math.min(1000, init_num_centers)){_.sortWithTake[(X, Double)]((vec_field, 'rand) -> 'centers, 1 + (init_num_centers / 1000)){(a, b) => a._2 > b._2}}\n      .groupAll{_.reduce[List[(X, Double)]]('centers){(a, b) => a++b}}\n      .mapTo('centers -> 'centers){l : List[(X, Double)] => l.sortBy{-_._2}.take(init_num_centers).map{_._1}}\n\n    val centers_new = p.crossWithTiny(centers)\n      .flatMap(('centers, vec_field) -> 'bin){x : (List[X], X) => knn_idx[X](x._2, x._1, bins_per_point, dist)}\n      .project('bin, vec_field)\n      .map(vec_field -> 'rand){r : X => new scala.util.Random((r.toString+\"foo\").hashCode).nextDouble}\n      .groupBy('bin){_.size('count).sortWithTake[(X, Double)]((vec_field, 'rand) -> 'centers, 1000){(a, b) => a._2 > b._2}}\n      .filter('count){c : Int => c > max_bin_size}\n      .mapTo(('centers, 'count) -> 'centers_new){l : (List[(X, Double)], Int) => \n        val md = l._1.maxBy{_._2}._2\n        l._1.sortBy{t => val d = t._2 / md; -d * (1-d)}.take(l._2 / max_bin_size).map{_._1}\n      }\n      .groupAll{_.reduce[List[(X, Double)]]('centers_new){(a, b) => a++b}}\n\n    centers.crossWithTiny(centers_new)\n      .mapTo(('centers, 'centers_new) -> 'centers){x : (List[X], List[X]) => x._1 ++ x._2}\n  }\n\n  def assign_bins[X](p : Pipe, id_field : Symbol, vec_field : Symbol, centers : Pipe, bins_per_point : Int, dist : (X, X) => Double) : Pipe = {\n    // Make assignments to clusters.\n    p.crossWithTiny(centers)\n      .flatMap((vec_field, 'centers) -> 'bin){x : (X, List[X]) => knn_idx[X](x._1, x._2, bins_per_point, dist)}\n      .discard('centers)\n  }\n\n  def construct_bins[X, Y](p : Pipe, id_field : Symbol, vec_field : Symbol, list_field : Symbol,\n    centers : Pipe, bins_per_point : Int, max_bin_size : Int, dist : (X, X) => Double) : Pipe = {\n    assign_bins[X](p, id_field, vec_field, centers, bins_per_point, dist)\n      .groupBy('bin){\n        //_.toList[(Y, X)]((id_field, vec_field) -> list_field)\n        _.sortWithTake[(Y, X)]((id_field, vec_field) -> list_field, max_bin_size){(a, b) => false}\n        .size('count)\n        .reducers(1000)\n      }\n      .project('bin, 'count, list_field)\n  }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/LSH.scala",
    "content": "package com.etsy.conjecture.scalding\n\nimport collection.mutable.PriorityQueue\n\nimport cascading.pipe.Pipe\nimport cascading.pipe.joiner.InnerJoin\n\nimport org.apache.commons.math3.linear.RealVector\n\n/**\n * Class provides functions for doing approximate K-nearest neighbors.\n * hashes : The number of times to hash.\n * planes : The number of dividing planes (also bits in the hash).\n * max_bin_size : The max size for a hash bin to be considered (we do exact knn in each bin, so large ones will increase computation time).\n * parallelism : How many reducers to use for critical sections.\n * defaults are sane for most problems.\n * more hashes = more chance for true knn to be in the same hash bin as the target, but also means more computation.\n * more planes = less items in each hash bucket, which improves computation but also could degrade approximation quality.\n */ \nclass LSH(val hashes : Int = 50, val planes : Int = 12, val max_bin_size : Int = 10000, val parallelism : Int = 500) extends Serializable {\n  \n  // import neede to write scalding-like code.\n  import com.twitter.scalding.Dsl._\n\n  /**\n   * Just a class to hold an id and a vector together.\n   */\n  class Point[T](val id : T, val vector : RealVector) extends Serializable {}\n\n  /**\n   * Brute force knn for inside each hash bin.\n   * Works faster than just using obvious scala ways (map/sortBy etc).\n   */\n  def findKnn[T](vec : RealVector, points : Iterable[Point[T]], K : Int) : List[(Point[T], Double)] = { \n    val q = new PriorityQueue[(Point[T], Double)]()(Ordering.by[(Point[T], Double), Double](_._2))\n    var worst = 0.0 \n    var size = 0 \n    points.foreach{p : Point[T] =>\n      val dist = p.vector.getDistance(vec)\n      if(size < K || dist < worst) {\n        size += 1\n        q.enqueue((p, dist))\n        if(size > K) {\n          q.dequeue\n          size -= 1\n        }   \n        worst = q.head._2\n      }   \n    }   \n    q.toList.sortBy{_._2}\n  }\n\n  /**\n   * Hash repeatedly by dividing the space along origin-containing planes.\n   * v : The vector to hash.\n   * output is the list of hashes, each having its index as part of the value.\n   */ \n  def hash(v : RealVector) : IndexedSeq[Long] = {\n    (0 until hashes).map{h =>\n      (0 until planes).map{i =>\n        val r = new scala.util.Random(i+1000*h) // random suck with lil seeds.\n        val d = v.toArray.map{_*r.nextGaussian}.sum\n        if(d > 0.0)\n          1L << i\n        else\n          0L\n      }.sum + (h.toLong << planes)\n    }\n  }\n\n  /**\n   * Forms hash bins from a single pipe of vectors and ids.\n   */\n  def form_bins[I](p : Pipe, id_field : Symbol, vec_field : Symbol, bin_field : Symbol, hash_field : Symbol) : Pipe = {\n    p\n    .map((id_field, vec_field) -> 'point){x : (I, RealVector) => new Point[I](x._1, x._2)}\n    .flatMap(vec_field -> hash_field){v : RealVector => hash(v)}\n    .project('point, hash_field)\n    .groupBy(hash_field){\n      _.size('count)\n      .sortWithTake[Point[I]]('point -> bin_field, max_bin_size){(a,b) => false}\n      .reducers(parallelism)\n      .forceToReducers\n    }\n    .filter('count){c : Int => c <= max_bin_size}\n    .project(hash_field, bin_field)\n  }\n\n  /**\n   * Single pipe version of knn.\n   * Finds knn of each element in the pipe (i.e., every element is both a target and a candidate neighbor)\n   * A thing isnt its own nearest neighbor.\n   * I is the type of id used.\n   */\n  def knn[I](p : Pipe, id_field : Symbol, vec_field : Symbol, neighbors_field : Symbol, K : Int) : Pipe = {\n    form_bins[I](p, id_field, vec_field, 'bin, 'hash)\n    .flatMapTo('bin -> (id_field, neighbors_field)){\n      bin : List[Point[I]] =>\n      bin.view.map{p =>\n        (p.id, findKnn[I](p.vector, bin, K+1).filter{_._1.id != p.id}.map{t => (t._1.id, t._2)}) // (id, distance)\n      }\n    }\n    // - aggregate knn across hash bins.\n    .groupBy(id_field) {\n      _.reduce[List[(I, Double)]](neighbors_field){(a, b) => (a ++ b).groupBy{_._1}.mapValues{_.head._2}.toList.sortBy{_._2}.take(K)}\n      .forceToReducers\n      .reducers(parallelism)\n    }\n    .project(id_field, neighbors_field)\n  }\n\n  /**\n   * Two pipe version of knn.\n   * First pipe is targets (things we find the knn for) ids are of type I\n   * Second pipe is candidates (things that can be the knn) ids are of type J\n   * A thing can be its own neighbor if its in both pipes.\n   */\n  def knn[I,J](targets : Pipe, target_id_field : Symbol, target_vec_field : Symbol,\n    candidates : Pipe, candidate_id_field : Symbol, candidate_vec_field : Symbol,\n    neighbors_field : Symbol, K : Int) : Pipe = {\n    form_bins[I](targets, target_id_field, target_vec_field, 'target_bin, 'hash)\n    .joinWithSmaller('hash -> 'hash, form_bins[J](candidates, candidate_id_field, candidate_vec_field, 'candidate_bin, 'hash), new InnerJoin(), parallelism)\n    .flatMapTo(('target_bin, 'candidate_bin) -> (target_id_field, neighbors_field)){\n      x : (List[Point[I]], List[Point[J]]) =>\n      x._1.view.map{p =>\n        (p.id, findKnn[J](p.vector, x._2, K).map{t => (t._1.id, t._2)}) // (id, distance)\n      }\n    }\n    // - aggregate knn across hash bins.\n    .groupBy(target_id_field) {\n      _.reduce[List[(J, Double)]](neighbors_field){(a, b) => (a ++ b).groupBy{_._1}.mapValues{_.head._2}.toList.sortBy{_._2}.take(K)}\n      .forceToReducers\n      .reducers(parallelism)\n    }\n    .project(target_id_field, neighbors_field)\n  }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/NNMF.scala",
    "content": "package com.etsy.conjecture.scalding\n\nimport org.apache.commons.math3.linear._\n\nimport com.etsy.scalding._\nimport com.twitter.algebird.Operators._\nimport com.twitter.scalding._\n\nimport cascading.flow.FlowDef\nimport cascading.pipe.Pipe\nimport cascading.pipe.joiner.InnerJoin\nimport cascading.tuple.Fields\n\nobject NNMF extends Serializable {\n\n  import com.twitter.scalding.Dsl._\n\n  // based on http://research.microsoft.com/pubs/119077/dnmf.pdf\n\n  /**\n   * input:\n   * A: a sparse matrix in the form ('row, 'col, 'val), with tuples of type (R, C, Double).\n   * k: the dimension of the factorization.\n   */\n  def initGaussian(A : Pipe, k : Int, reducers : Int = 500) : (Pipe, Pipe) = {\n    val H0 = A.groupBy('row){_.size('count).reducers(reducers)}\n      .map(() -> 'vec){_ : Unit => MatrixUtils.createRealVector((0 until k).map{i => math.random}.toArray)}\n      .map(() -> 'bias){_ : Unit => math.random}\n      .project('row, 'vec, 'bias)\n    val W0 = A.groupBy('col){_.size('count).reducers(reducers)}\n      .map(() -> 'vec){_ : Unit => MatrixUtils.createRealVector((0 until k).map{i => math.random}.toArray)}\n      .map(() -> 'bias){_ : Unit => math.random}\n      .project('col, 'vec, 'bias)\n    (H0, W0)\n  }\n\n  /**\n   * These functions embed bias terms for both factors into the original factorization.\n   */\n  def createWVector(v : RealVector, b : Double) : RealVector = {\n    v.append(MatrixUtils.createRealVector(Array(1.0, b)))\n  }\n\n  def createHVector(v : RealVector, b : Double) : RealVector = {\n    v.append(MatrixUtils.createRealVector(Array(b, 1.0)))\n  }\n\n  def explodeWVector(u : RealVector) : (RealVector, Double) = {\n    val d = u.getDimension\n    (u.getSubVector(0, d - 2), u.getEntry(d - 1))\n  }\n\n  def explodeHVector(u : RealVector) : (RealVector, Double) = {\n    val d = u.getDimension\n    (u.getSubVector(0, d - 2), u.getEntry(d - 2))\n  }\n\n  /*\n   * input:\n   * A: a sparse matrix in the form ('row, 'col, 'val), with tuples of type (R, C, Double).\n   * H: a dense matrix of ('row, 'vec, 'bias)\n   * W: a dense matrix of ('col, 'vec, 'bias)\n   * With W,H from initGaussian or a previous iteration.\n   */\n  def updateGaussian(A : Pipe, H : Pipe, W : Pipe, reducers : Int = 500) : (Pipe, Pipe) = {\n\n    // Note that row and column vectors are both represented as a RealVector which doesnt have an orientation.\n    // Therefore whether it is a row or column will have to be inferred from context.\n\n    // -- First update H.\n    // W'W\n    val WW = W.mapTo(('vec, 'bias) -> 'WW){v : (RealVector, Double) =>\n        val u = createWVector(v._1, v._2)\n        u.outerProduct(u)\n      }\n      .groupAll{_.reduce[RealMatrix]('WW){(a, b) => a.add(b)}}\n\n    // W'WH\n    val WWH = H.crossWithTiny(WW)\n      .map(('WW, 'vec, 'bias) -> 'vec_wwh){x : (RealMatrix, RealVector, Double) => x._1.operate(createHVector(x._2, x._3))}\n      .project('row, 'vec_wwh)\n\n    // W'A\n    val WA = W.joinWithLarger('col -> 'col, A, new InnerJoin(), reducers)\n      .map(('val, 'vec, 'bias) -> 'vec){x : (Double, RealVector, Double) => createWVector(x._2, x._3).mapMultiply(x._1)}\n      .groupBy('row){_.reduce[RealVector]('vec -> 'vec_wa){(a, b) => a.add(b)}.reducers(reducers).forceToReducers}\n\n    // Pointwise multiplier to old H\n    val HM = WA.joinWithSmaller('row -> 'row, WWH, new InnerJoin(), reducers)\n      .map(('vec_wa, 'vec_wwh) -> 'vec_mult){x : (RealVector, RealVector) => x._1.ebeDivide(x._2)}\n      .map('vec_mult -> 'vec_mult){x : RealVector => MatrixUtils.createRealVector(x.toArray.map{i => if(i.isInfinite || i.isNaN) 1.0 else i})}\n      .project('row, 'vec_mult)\n\n    // new H.\n    val H_ = H.joinWithSmaller('row -> 'row, HM, new InnerJoin(), reducers)\n      .map(('vec, 'bias, 'vec_mult) -> ('vec, 'bias)){x : (RealVector, Double, RealVector) => explodeHVector(createHVector(x._1, x._2).ebeMultiply(x._3))}\n      .project('row, 'vec, 'bias)\n\n    // -- Then update W.\n    // HH'\n    val HH = H_.mapTo(('vec, 'bias) -> 'HH){v : (RealVector, Double) =>\n        val u = createHVector(v._1, v._2)\n        u.outerProduct(u)\n      }\n      .groupAll{_.reduce[RealMatrix]('HH){(a, b) => a.add(b)}}\n\n    // WHH'\n    val WHH = W.crossWithTiny(HH)\n      .map(('HH, 'vec, 'bias) -> 'vec_whh){x : (RealMatrix, RealVector, Double) => x._1.operate(createWVector(x._2, x._3))}\n      .project('col, 'vec_whh)\n\n    // AH'\n    val AH = H_.joinWithLarger('row -> 'row, A, new InnerJoin(), reducers)\n      .map(('val, 'vec, 'bias) -> 'vec){x : (Double, RealVector, Double) => createHVector(x._2, x._3).mapMultiply(x._1)}\n      .groupBy('col){_.reduce[RealVector]('vec -> 'vec_ah){(a, b) => a.add(b)}.reducers(reducers).forceToReducers}\n\n    // Pointwise multiplier to old W\n    val WM = AH.joinWithSmaller('col -> 'col, WHH, new InnerJoin(), reducers)\n      .map(('vec_ah, 'vec_whh) -> 'vec_mult){x : (RealVector, RealVector) => x._1.ebeDivide(x._2)}\n      .map('vec_mult -> 'vec_mult){x : RealVector => MatrixUtils.createRealVector(x.toArray.map{i => if(i.isInfinite || i.isNaN) 1.0 else i})}\n      .project('col, 'vec_mult)\n\n    // new W.\n    val W_ = W.joinWithSmaller('col -> 'col, WM, new InnerJoin(), reducers)\n      .map(('vec, 'bias, 'vec_mult) -> ('vec, 'bias)){x : (RealVector, Double, RealVector) => explodeWVector(createWVector(x._1, x._2).ebeMultiply(x._3))}\n      .project('col, 'vec, 'bias)\n\n    (H_, W_)\n  }\n\n  /**\n   * Possibly faster vec add that doesnt create a new object.\n   */\n  def addTo(acc : RealVector, v : RealVector) : RealVector = {\n    acc.combineToSelf(1.0, 1.0, v)\n    acc\n  }\n\n  /**\n   * Possibly faster matrix  add that doesnt create a new object.\n   */\n  def addTo(acc : RealMatrix, m : RealMatrix) : RealMatrix = {\n    var r = 0\n    while(r < acc.getRowDimension) {\n      var c = 0\n      while(c < acc.getColumnDimension) {\n        acc.addToEntry(r, c, m.getEntry(r, c))\n        c += 1\n      }\n      r += 1\n    }\n    acc\n  }\n\n  /*\n   * -- weighted version.\n   * Optimizes weighted l2 loss, where zeros have weight 1, non zeros have weight 1+alpha.\n   * input:\n   * A: a sparse matrix in the form ('row, 'col, 'val), with tuples of type (R, C, Double).\n   * H: a dense matrix of ('row, 'vec, 'bias)\n   * W: a dense matrix of ('col, 'vec, 'bias)\n   * alpha: alpha from loss function.\n   * With W,H from initGaussian or a previous iteration.\n   */\n  def updateGaussianWeighted(A : Pipe, H : Pipe, W : Pipe, alpha : Double, reducers : Int = 500) : (Pipe, Pipe) = {\n\n    // -- First update H.\n    // W'W\n    val WW = W.mapTo(('vec, 'bias) -> 'WW){v : (RealVector, Double) =>\n        val u = createWVector(v._1, v._2)\n        u.outerProduct(u)\n      }\n      .groupAll{_.reduce[RealMatrix]('WW){(a, b) => a.add(b)}}\n\n    // W'WH\n    val WWH = H.crossWithTiny(WW)\n      .map(('vec, 'bias) -> 'vec){x : (RealVector, Double) => createHVector(x._1, x._2)}\n      .map(('WW, 'vec) -> 'vec_wwh){x : (RealMatrix, RealVector) => x._1.operate(x._2)}\n      .project('row, 'vec_wwh, 'vec)\n\n    // W'A\n    val WA = W.joinWithLarger('col -> 'col, A, new InnerJoin(), reducers)\n      .map(('val, 'vec, 'bias) -> ('vec_wa, 'denom_vec)){x : (Double, RealVector, Double) =>\n        val v = createWVector(x._2, x._3)\n        (v.mapMultiply(x._1), v)\n      }\n      .groupBy('row){\n        _.reduce[RealVector]('vec_wa){(a, b) => addTo(a, b)}\n        .toList[RealVector]('denom_vec -> 'denom_vec_list)\n        .reducers(reducers)\n        //.forceToReducers\n      }\n      .project('row, 'vec_wa, 'denom_vec_list)\n\n    // Pointwise multiplier to old H\n    val HM = WA.joinWithSmaller('row -> 'row, WWH, new InnerJoin(), reducers)\n      .map(('vec_wa, 'vec_wwh, 'vec, 'denom_vec_list) -> 'vec_mult){x : (RealVector, RealVector, RealVector, List[RealVector]) =>\n        val den_vec = x._4.tail.foldLeft(x._4.head.mapMultiply(x._4.head.dotProduct(x._3))){(a, b) => a.combineToSelf(1.0, b.dotProduct(x._3), b); a}\n        val num = x._1.mapMultiply(1.0 + alpha)\n        val den = x._2.add(den_vec.mapMultiply(alpha))\n        num.ebeDivide(den)\n      }\n      .map('vec_mult -> 'vec_mult){x : RealVector => MatrixUtils.createRealVector(x.toArray.map{i => if(i.isInfinite || i.isNaN) 1.0 else i})}\n      .project('row, 'vec_mult)\n\n    // new H.\n    val H_ = H.joinWithSmaller('row -> 'row, HM, new InnerJoin(), reducers)\n      .map(('vec, 'bias, 'vec_mult) -> ('vec, 'bias)){x : (RealVector, Double, RealVector) => explodeHVector(createHVector(x._1, x._2).ebeMultiply(x._3))}\n      .project('row, 'vec, 'bias)\n\n    // -- Then update W.\n    // HH'\n    val HH = H_.mapTo(('vec, 'bias) -> 'HH){v : (RealVector, Double) =>\n        val u = createHVector(v._1, v._2)\n        u.outerProduct(u)\n      }\n      .groupAll{_.reduce[RealMatrix]('HH){(a, b) => a.add(b)}}\n\n    // WHH'\n    val WHH = W.crossWithTiny(HH)\n      .map(('vec, 'bias) -> 'vec){x : (RealVector, Double) => createWVector(x._1, x._2)}\n      .map(('HH, 'vec) -> 'vec_whh){x : (RealMatrix, RealVector) => x._1.operate(x._2)}\n      .project('col, 'vec_whh, 'vec)\n\n    // AH'\n    val AH = H_.joinWithLarger('row -> 'row, A, new InnerJoin(), reducers)\n      .map(('val, 'vec, 'bias) -> ('vec_ah, 'denom_vec)){x : (Double, RealVector, Double) =>\n        val v = createHVector(x._2, x._3)\n        (v.mapMultiply(x._1), v)\n      }\n      .groupBy('col){\n        _.reduce[RealVector]('vec_ah){(a, b) => addTo(a, b)}\n        .toList[RealVector]('denom_vec -> 'denom_vec_list)\n        .reducers(reducers)\n        //.forceToReducers\n      }\n      .project('col, 'vec_ah, 'denom_vec_list)\n\n    // Pointwise multiplier to old W\n    val WM = AH.joinWithSmaller('col -> 'col, WHH, new InnerJoin(), reducers)\n      .map(('vec_ah, 'vec_whh, 'vec, 'denom_vec_list) -> 'vec_mult){x : (RealVector, RealVector, RealVector, List[RealVector]) =>\n        val den_vec = x._4.tail.foldLeft(x._4.head.mapMultiply(x._4.head.dotProduct(x._3))){(a, b) => a.combineToSelf(1.0, b.dotProduct(x._3), b); a}\n        val num = x._1.mapMultiply(1.0 + alpha)\n        val den = x._2.add(den_vec.mapMultiply(alpha))\n        x._1.ebeDivide(x._2)\n      }\n      .map('vec_mult -> 'vec_mult){x : RealVector => MatrixUtils.createRealVector(x.toArray.map{i => if(i.isInfinite || i.isNaN) 1.0 else i})}\n      .project('col, 'vec_mult)\n\n    // new W.\n    val W_ = W.joinWithSmaller('col -> 'col, WM, new InnerJoin(), reducers)\n      .map(('vec, 'bias, 'vec_mult) -> ('vec, 'bias)){x : (RealVector, Double, RealVector) => explodeWVector(createWVector(x._1, x._2).ebeMultiply(x._3))}\n      .project('col, 'vec, 'bias)\n\n    (H_, W_)\n  }\n}\n\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/SVD.scala",
    "content": "package com.etsy.conjecture.scalding\n\nimport org.apache.commons.math3.linear._\nimport cascading.pipe.Pipe\nimport cascading.pipe.joiner.InnerJoin\nimport cascading.tuple.Fields\nimport scala.util.Random\n\nobject SVD extends Serializable {\n\n  import com.twitter.scalding.Dsl._\n\n  /**\n  * based on http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf\n  * page 121.\n  *\n  * generic parameters:\n  * R: the type of the row name variable.\n  * C: the type of the column name variable.\n  *\n  * input:\n  * X: a sparse matrix in the form ('row, 'col, 'val), with tuples of type (R, C, Double).\n  * d: number of principle components / singular values to compute\n  * extra_power: whether to take the second power of XX' in order to improve the approximation quality.\n  * reducers: how many reducers to use in the map-reduce stages.\n  *\n  * output:\n  * (U, E, V) with\n  * U : pipe of ('row, 'vec) where vec is a RealVector\n  * E : pipe of 'E which is an Array[Double] of singular values.\n  * V : pipe of ('col, 'vec) where vec is a RealVector\n  * note that the vectors are rows of the matrices U and V, not the columns which correspond to the left and right singular vectors.\n  */\n  def apply[R, C](X : Pipe, d : Int, extra_power : Boolean = true, reducers : Int = 500, no_power : Boolean = false) : (Pipe, Pipe, Pipe) = {\n\n    // Sample the columns, into the thin matrix.\n    val XS = X.groupBy('row){_.toList[(C, Double)](('col, 'val) -> 'list).reducers(reducers)}\n      .map('list -> 'vec){l : List[(C, Double)] =>\n        val a = new Array[Double](d+10)\n        l.foreach{i =>\n          val r = new Random(i._1.hashCode.toLong)\n          (0 until (d+10)).foreach{j =>\n            a(j) += r.nextGaussian * i._2\n          }\n        }\n        MatrixUtils.createRealVector(a)\n      }\n      .project('row, 'vec)\n\n    // Multiply by powers of XX'.  This improves the approximation quality.\n    val Y = if(!no_power) {\n      val XXXS = X\n        .joinWithSmaller('row -> 'row_, XS.rename('row -> 'row_), new InnerJoin(), reducers)\n        .map(('val, 'vec) -> 'vec){x : (Double, RealVector) => x._2.mapMultiply(x._1)}\n        .groupBy('col){_.reduce('vec -> 'vec){(a : RealVector, b : RealVector) => a.add(b)}.forceToReducers.reducers(reducers)}\n        .joinWithSmaller('col -> 'col_, X.rename('col -> 'col_), new InnerJoin(), reducers)\n        .map(('val, 'vec) -> 'vec){x : (Double, RealVector) => x._2.mapMultiply(x._1)}\n        .groupBy('row){_.reduce('vec -> 'vec2){(a : RealVector, b : RealVector) => a.add(b)}.forceToReducers.reducers(reducers)}\n\n      if(extra_power) {\n        val XXXXXS = X\n          .joinWithSmaller('row -> 'row_, XXXS.rename('row -> 'row_), new InnerJoin(), reducers)\n          .map(('val, 'vec2) -> 'vec2){x : (Double, RealVector) => x._2.mapMultiply(x._1)}\n          .groupBy('col){_.reduce('vec2 -> 'vec2){(a : RealVector, b : RealVector) => a.add(b)}.forceToReducers.reducers(reducers)}\n          .joinWithSmaller('col -> 'col_, X.rename('col -> 'col_), new InnerJoin(), reducers)\n          .map(('val, 'vec2) -> 'vec2){x : (Double, RealVector) => x._2.mapMultiply(x._1)}\n          .groupBy('row){_.reduce('vec2 -> 'vec2){(a : RealVector, b : RealVector) => a.add(b)}.forceToReducers.reducers(reducers)}\n\n        XS\n          .joinWithSmaller('row -> 'row, XXXS, new InnerJoin(), reducers)\n          .map(('vec, 'vec2) -> 'vec){x : (RealVector, RealVector) => x._1.append(x._2)}\n          .project('row, 'vec)\n          .joinWithSmaller('row -> 'row, XXXXXS, new InnerJoin(), reducers)\n          .map(('vec, 'vec2) -> 'vec){x : (RealVector, RealVector) => x._1.append(x._2)}\n          .project('row, 'vec)\n      } else {\n        XS\n          .joinWithSmaller('row -> 'row, XXXS, new InnerJoin(), reducers)\n          .map(('vec, 'vec2) -> 'vec){x : (RealVector, RealVector) => x._1.append(x._2)}\n          .project('row, 'vec)\n      }\n    } else {\n      XS\n    }\n\n    // What follows is a QR decomposition of Y.\n    // Note: Y = QR means Y'Y = R'R so R = chol(Y'Y)\n    val YY = Y.mapTo('vec -> 'mat){x : RealVector => x.outerProduct(x)}\n      // Could rewrite addition function to act in-place on a or b here.\n      .groupAll{_.reduce('mat -> 'mat){(a : RealMatrix, b : RealMatrix) => a.add(b)}}\n      .mapTo('mat -> 'mat){m : RealMatrix =>\n        val chol = new CholeskyDecomposition(m)\n        new LUDecomposition(chol.getL).getSolver.getInverse\n      }\n\n    // Determine Q = YR^{-1}\n    val Q = Y.crossWithTiny(YY)\n      .map(('vec, 'mat) -> 'vec){x : (RealVector, RealMatrix) => x._2.operate(x._1)}\n      .project('row, 'vec)\n\n    // B = X'Q\n    val B = X.joinWithSmaller('row -> 'row, Q, new InnerJoin(), reducers)\n      .map(('val, 'vec) -> 'vec){x : (Double, RealVector) => x._2.mapMultiply(x._1)}\n      .groupBy('col){_.reduce('vec -> 'vec){(a : RealVector, b : RealVector) => a.combineToSelf(1, 1, b)}.reducers(reducers).forceToReducers}\n\n    // Uee eig(B'B) to get at svd(B)  -- B = m * d\n    // RWR' = B'B -- R = d*d\n    // want UEV = B -- U = m*d, V = d*d\n    // so V'E^2V = B'B = RWR' so\n    // V = R'\n    // W = sqrt(E)\n    // U = BE^{-1}V'\n    val EB = B.mapTo('vec -> 'mat){x : RealVector => x.outerProduct(x)}\n      // Same re: optimizing the addition to not create temp objects.\n      .groupAll{_.reduce('mat -> 'mat){(a : RealMatrix, b : RealMatrix) => a.add(b)}}\n      .mapTo('mat -> ('eigs, 'eigmat)){m : RealMatrix =>\n        val e = new EigenDecomposition(m)\n        (e.getRealEigenvalues.map{ei => math.sqrt(ei)},\n         e.getVT)\n      }\n\n    val E = EB.project('eigs).map('eigs -> 'eigs){x : Array[Double] => x.take(d)}\n\n    val U = Q.crossWithTiny(EB.project('eigmat))\n      .map(('vec, 'eigmat) -> 'vec){x : (RealVector, RealMatrix) => x._2.operate(x._1).getSubVector(0,d)}\n      .project('row, 'vec)\n\n    val V = B.crossWithTiny(EB)\n      .map(('vec, 'eigmat, 'eigs) -> 'vec){x : (RealVector, RealMatrix, Array[Double]) =>\n        MatrixUtils.createRealDiagonalMatrix(x._3.map{1.0/ _}).operate(x._2.operate(x._1)).getSubVector(0,d)\n      }\n      .project('col, 'vec)\n\n    (U, E, V)\n  }\n}\n\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/evaluate/GenericCrossValidator.scala",
    "content": "package com.etsy.conjecture.scalding.evaluate\n\nimport com.twitter.scalding._\n\nimport cascading.pipe.Pipe\nimport cascading.tuple.{ Fields, TupleEntry, Tuple }\n\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.evaluation._\nimport com.etsy.conjecture.model._\nimport com.etsy.conjecture.scalding.train._\n\nclass GenericCrossValidator[L <: Label, M <: UpdateableModel[L, M], E <: ModelEvaluation[L]](val evaluator: GenericEvaluator[L], val builder: AbstractModelTrainer[L, M], val folds: Int, val salt: String = \"\") extends Serializable {\n\n    import Dsl._\n\n    def crossValidateWithPredictions(pipe: Pipe, instanceField: Symbol, predictionField: Symbol, labelField: Symbol = '__actual): (Pipe, Pipe) = {\n        val folded = pipe.map(instanceField -> '__fold) { li: LabeledInstance[L] => (li.getVector.hashCode.toString + salt).hashCode % folds }\n            .forceToDisk\n        val preds = (0 until folds).map { i: Int => predictFold(folded, '__model, instanceField, labelField, predictionField, i) }\n        val eval = preds.map { i: Pipe => evaluator.evaluate(i, predictionField, labelField, '__eval) }.reduce { _ ++ _ }\n            .groupAll {\n                _.foldLeft('__eval -> '__eval)(new EvaluationAggregator[L]()) {\n                    (a: EvaluationAggregator[L], e: E) => a.add(e); a\n                }\n            }\n            .map('__eval -> '__eval) { a: EvaluationAggregator[L] => a.toString }\n        val preds_all = preds.reduce { _ ++ _ }.project(labelField, predictionField)\n        (eval, preds_all)\n    }\n\n    // note that the models on each fold may be calibrated differently, which will mess up the AUC calculation.\n    // this may not be a big problem though.\n    def predictFold(folded: Pipe, modelField: Symbol, instanceField: Symbol, labelField: Symbol, predictionField: Symbol, fold: Int): Pipe = {\n        val train_inst = folded.filter('__fold) { x: Int => x != fold }\n        val test_inst = folded.filter('__fold) { x: Int => x == fold }\n        val model = builder.train(train_inst, instanceField, modelField)\n        evaluator.assign_predictions(test_inst, instanceField, labelField, model, modelField, predictionField)\n    }\n\n    def crossValidate(pipe: Pipe, instanceField: Symbol): Pipe = {\n        crossValidateWithPredictions(pipe, instanceField, '__prediction, '__actual)._1\n    }\n\n    def evaluateFold(folded: Pipe, modelField: Symbol, instanceField: Symbol, labelField: Symbol, evalField: Symbol, fold: Int): Pipe = {\n        val train_inst = folded.filter('__fold) { x: Int => x != fold }\n        val test_inst = folded.filter('__fold) { x: Int => x == fold }\n        val model = builder.train(train_inst, instanceField, modelField)\n        evaluator.evaluate(test_inst, instanceField, labelField, model, modelField, evalField)\n    }\n}\n\nclass BinaryCrossValidator(args: Args, folds: Int) extends GenericCrossValidator[BinaryLabel, UpdateableLinearModel[BinaryLabel], BinaryModelEvaluation](\n    new BinaryEvaluator(), new BinaryModelTrainer(args), folds, args.getOrElse(\"salt\", \"\"))\n\nclass RegressionCrossValidator(args: Args, folds: Int) extends GenericCrossValidator[RealValuedLabel, UpdateableLinearModel[RealValuedLabel], RegressionModelEvaluation](\n    new RegressionEvaluator(), new RegressionModelTrainer(args), folds, args.getOrElse(\"salt\", \"\"))\n\nclass MulticlassCrossValidator(args: Args, folds: Int, categories: Array[String]) extends GenericCrossValidator[MulticlassLabel, UpdateableMulticlassLinearModel, MulticlassModelEvaluation](\n    new MulticlassEvaluator(categories), new MulticlassModelTrainer(args, categories), folds, args.getOrElse(\"salt\", \"\"))\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/evaluate/GenericEvaluator.scala",
    "content": "package com.etsy.conjecture.scalding.evaluate\n\nimport com.twitter.scalding._\nimport com.etsy.conjecture._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.evaluation._\nimport com.etsy.conjecture.model._\n\nimport cascading.pipe.Pipe\n\nabstract class GenericEvaluator[L <: Label] extends Serializable {\n\n    import Dsl._\n\n    def build(): ModelEvaluation[L]\n\n    def evaluate(instance_pipe: Pipe, predict_field: Symbol, label_field: Symbol, evaluation_field: Symbol): Pipe = {\n        val partialEval = '__partial_eval\n        instance_pipe\n          .map((label_field, predict_field) -> partialEval){\n            pair : (L, L) =>\n            val eval = build\n            eval.add(pair._1, pair._2)\n            eval\n          }\n          .groupAll{\n            _.reduce(partialEval -> evaluation_field){\n              (eval : ModelEvaluation[L], final_eval : ModelEvaluation[L]) =>\n              final_eval.merge(eval)\n              final_eval\n            }\n          }\n          .project(evaluation_field)\n    }\n\n    def evaluate(instance_pipe: Pipe, instance_field: Symbol, label_field: Symbol, model_pipe: Pipe, model_field: Symbol, evaluation_field: Symbol): Pipe = {\n        val instances_with_predictions = assign_predictions(instance_pipe, instance_field, label_field, model_pipe, model_field, 'prediction)\n        evaluate(instances_with_predictions, label_field, 'prediction, evaluation_field)\n    }\n\n    def assign_predictions(instance_pipe: Pipe, instance_field: Symbol, label_field: Symbol, model_pipe: Pipe, model_field: Symbol, prediction_field: Symbol = 'prediction) = {\n        instance_pipe.crossWithTiny(model_pipe)\n            .map((instance_field, model_field) -> (label_field, prediction_field)) { x: (LabeledInstance[L], Model[L]) => (x._1.getLabel, x._2.predict(x._1.getVector)) }\n            .project(label_field, prediction_field)\n    }\n}\n\nclass BinaryEvaluator extends GenericEvaluator[BinaryLabel] {\n    def build() = new BinaryModelEvaluation()\n}\n\nclass MulticlassEvaluator(categories: Array[String]) extends GenericEvaluator[MulticlassLabel] {\n    def build() = new MulticlassModelEvaluation(categories)\n}\n\nclass RegressionEvaluator extends GenericEvaluator[RealValuedLabel] {\n    def build() = new RegressionModelEvaluation()\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/factorize/FactorizationTools.scala",
    "content": "package com.etsy.conjecture.scalding.factorize\n\nimport cascading.pipe.Pipe\nimport org.apache.commons.math3.linear._\nimport cascading.pipe.joiner.InnerJoin\n\n\nobject FactorizationTools {\n\n  def approxLeftFactorsLeastSquaresBinary(rightFactors : Pipe, id_sym : Symbol, right_vec_sym : Symbol,\n                                          designMatrix : Pipe, left_id : Symbol, right_id : Symbol,\n                                          left_vec_symbol : Symbol,\n                                          spill_threshold : Int = 1000000, parallelism : Int = 1000) : Pipe = {\n\n    import com.twitter.scalding.Dsl._\n    approxLeftFactorsLeastSquares(rightFactors, id_sym, right_vec_sym,\n                                  designMatrix.insert('value, 1.0), left_id, right_id,\n                                  'value, left_vec_symbol, spill_threshold, parallelism)\n  }\n\n  def approxLeftFactorsLeastSquares(rightFactors : Pipe, id_sym : Symbol, right_vec_sym : Symbol,\n                                    designMatrix : Pipe, left_id : Symbol, right_id : Symbol,\n                                    value_sym : Symbol, left_vec_symbol : Symbol,\n                                    spill_threshold : Int = 1000000, parallelism : Int = 1000) : Pipe = {\n\n    import com.twitter.scalding.Dsl._\n    val inv_sym = 'inverse\n\n    val inv_self_outer = rightFactors\n      .mapTo(right_vec_sym -> right_vec_sym) {\n        l : RealVector => l.outerProduct(l)\n      }\n      .groupAll{ _.reduce[RealMatrix](right_vec_sym){ (x, y) => x.add(y) } }\n      .mapTo(right_vec_sym -> inv_sym) {\n        ll : RealMatrix => new LUDecomposition(ll).getSolver.getInverse\n      }\n\n    val premultiplied_right_factors = rightFactors\n      .crossWithTiny(inv_self_outer)\n      .map((right_vec_sym, inv_sym) -> right_vec_sym) {\n        x:(RealVector, RealMatrix) => x._2.operate(x._1)\n      }\n      .project(id_sym, right_vec_sym)\n\n\n    designMatrix.joinWithSmaller(right_id -> id_sym, premultiplied_right_factors, new InnerJoin(), parallelism)\n      // Save an alloc if we do the binary case.\n      .map((right_vec_sym, value_sym) -> right_vec_sym) { x : (RealVector, Double) => if(x._2 == 1.0) x._1 else x._1.mapMultiply(x._2) }\n      .groupBy(left_id) {\n        _.reduce[RealVector](right_vec_sym -> left_vec_symbol){ (x, y) => x.combineToSelf(1, 1, y) }\n         .reducers(parallelism)\n         .spillThreshold(spill_threshold)\n      }\n      .project(left_id, left_vec_symbol)\n  }\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/AbstractModelTrainer.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.pipe.Pipe\n\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\n\nimport com.twitter.scalding._\n\ntrait AbstractModelTrainer[L <: Label, M <: UpdateableModel[L, M]] extends Serializable {\n\n    def train(instances: Pipe, instanceField: Symbol, modelField: Symbol): Pipe\n\n    def reTrain(instances: Pipe, instanceField: Symbol, model: Pipe, modelField: Symbol): Pipe\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/BinaryModelTrainer.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.pipe.Pipe\nimport com.twitter.scalding._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\n\nimport java.io.File\nimport scala.io.Source\n\nclass BinaryModelTrainer(args: Args) extends AbstractModelTrainer[BinaryLabel, UpdateableLinearModel[BinaryLabel]] with ModelTrainerStrategy[BinaryLabel, UpdateableLinearModel[BinaryLabel]] {\n\n    /** \n     * Number of iterations for sequential gradient descent\n     */\n    var iters = args.getOrElse(\"iters\", \"1\").toInt\n\n    override def getIters: Int = iters\n\n    /** \n     *  What type of linear model should be used?\n     *  Options are:\n     *  1. perceptron\n     *  2. linear_svm\n     *  3. logistic_regression\n     *  4. mira\n     */\n    val modelType = args.getOrElse(\"model\", \"logistic_regression\")\n\n    /**\n     *  What kind of learning rate schedule / regularization\n     *  should we use?\n     *\n     *  Options:\n     *  1. elastic_net\n     *  2. adagrad\n     *  3. passive_aggressive\n     *  4. ftrl\n     */\n    val optimizerType = args.getOrElse(\"optimizer\", \"elastic_net\")\n\n    /** Aggressiveness parameter for passive aggressive classifier **/\n    val aggressiveness = args.getOrElse(\"aggressiveness\", \"2.0\").toDouble\n\n    val finalThresholding = args.getOrElse(\"final_thresholding\", \"0.0\").toDouble\n\n    /**\n     * Initial learning rate used for SGD learning.\n     */\n    val initialLearningRate = args.getOrElse(\"rate\", \"0.1\").toDouble\n\n    /** Base of the exponential learning rate (e.g., 0.99^{# examples seen}). **/\n    val exponentialLearningRateBase = args.getOrElse(\"exponential_learning_rate_base\", \"1.0\").toDouble\n\n    /** Whether to use the exponential learning rate.  If not chosen then the learning rate is like 1.0 / epoch. **/\n    val useExponentialLearningRate = args.boolean(\"exponential_learning_rate_base\")\n\n    /** \n     * A fudge factor so that an \"epoch\" for the purpose of learning rate computation can be more than one example,\n     * in which case the \"epoch\" will take a fractional amount equal to {# examples seen} / examples_per_epoch.\n     */\n    val examplesPerEpoch = args.getOrElse(\"examples_per_epoch\", \"10000\").toDouble\n\n    /** How to subsample each class, in the case of imbalanced data. **/\n    val zeroClassProb = args.getOrElse(\"zero_class_prob\", \"1.0\").toDouble\n    val oneClassProb = args.getOrElse(\"one_class_prob\", \"1.0\").toDouble\n\n    /**\n     * Weight on laplace regularization- a laplace prior on the parameters\n     * sparsity inducing ala lasso\n     */\n    val laplace = args.getOrElse(\"laplace\", \"0.0\").toDouble\n\n    /**\n     * Weight on gaussian prior on the parameters\n     * similar to ridge \n     */\n    val gauss = args.getOrElse(\"gauss\", \"0.0\").toDouble\n\n    /**\n     *  Learning rate parameters for FTRL\n     */\n    val ftrlAlpha = args.getOrElse(\"ftrlAlpha\", \"1.0\").toDouble\n    val ftrlBeta = args.getOrElse(\"ftrlBeta\", \"1.0\").toDouble\n\n    /**\n     *  Choose an optimizer to use\n     */\n    val o = optimizerType match {\n            case \"elastic_net\" => new ElasticNetOptimizer()\n            case \"adagrad\" => new AdagradOptimizer()\n            case \"passive_aggressive\" => new PassiveAggressiveOptimizer().setC(aggressiveness).isHinge(true)\n            case \"ftrl\" => new FTRLOptimizer().setAlpha(ftrlAlpha).setBeta(ftrlBeta)\n            case \"control\" => new ControlOptimizer()\n            case \"mira\" => new MIRAOptimizer()\n        }\n\n    val optimizer = o.setGaussianRegularizationWeight(gauss)\n        .setLaplaceRegularizationWeight(laplace)\n        .setExamplesPerEpoch(examplesPerEpoch)\n        .setUseExponentialLearningRate(useExponentialLearningRate)\n        .setExponentialLearningRateBase(exponentialLearningRateBase)\n        .setInitialLearningRate(initialLearningRate)\n\n    /** Period of gradient truncation updates **/\n    val truncationPeriod = args.getOrElse(\"period\", Int.MaxValue.toString).toInt\n\n    /** \n     * Aggressiveness of gradient truncation updates, how much shrinkage\n     * is applied to the model's parameters\n     */\n    val truncationAlpha = args.getOrElse(\"alpha\", \"0.0\").toDouble\n\n    /**\n     * Threshold for applying gradient truncation updates\n     * parameter values smaller than this in magnitude are truncated\n     */\n    val truncationThresh = args.getOrElse(\"thresh\", \"0.0\").toDouble\n\n    /** Size of minibatch for mini-batch training, defaults to 1 which is just SGD. **/\n    val batchsz = args.getOrElse(\"mini_batch_size\", \"1\").toInt\n    override def miniBatchSize: Int = batchsz\n\n    override def sampleProb(l: BinaryLabel): Double = {\n        if (l.getValue < 0.5)\n            zeroClassProb\n        else\n            oneClassProb\n    }\n\n    override def modelPostProcess(m: UpdateableLinearModel[BinaryLabel]): UpdateableLinearModel[BinaryLabel] = {\n        m.thresholdParameters(finalThresholding)\n        m.setArgString(args.toString)\n        m.teardown()\n        m\n    }\n\n    if(modelType == \"mira\" && optimizerType != \"mira\"){\n        throw new IllegalArgumentException(\"MIRA only uses a MIRAOptimizer\");\n    }\n\n    def getModel: UpdateableLinearModel[BinaryLabel] = {\n        val model = modelType match {\n            case \"perceptron\" => new Hinge(optimizer).setThreshold(0.0)\n            case \"linear_svm\" => new Hinge(optimizer).setThreshold(1.0)\n            case \"logistic_regression\" => new LogisticRegression(optimizer)\n            case \"mira\" => new MIRA()\n        }\n        model.setTruncationPeriod(truncationPeriod)\n             .setTruncationThreshold(truncationThresh)\n             .setTruncationUpdate(truncationAlpha)\n        model\n    }\n\n    val bins = args.getOrElse(\"bins\", \"100\").toInt\n\n    val trainer = if (args.boolean(\"large\")) new LargeModelTrainer(this, bins) else new SmallModelTrainer(this)\n\n    def train(instances: Pipe, instanceField: Symbol = 'instance, modelField: Symbol = 'model): Pipe = {\n        trainer.train(instances, instanceField, modelField)\n    }\n\n    def reTrain(instances: Pipe, instanceField: Symbol, model: Pipe, modelField: Symbol): Pipe = {\n        trainer.reTrain(instances, instanceField, model, modelField)\n    }\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/ClusteringModelTrainer.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.pipe.Pipe\nimport com.twitter.scalding._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\nimport scala.collection.JavaConversions._\nimport java.io.File\nimport scala.io.Source\n\nclass ClusteringModelTrainer(args: Args, centers: Map[String, StringKeyedVector]) extends AbstractModelTrainer[ClusterLabel, ClusteringModel[ClusterLabel]] with ModelTrainerStrategy[ClusterLabel, ClusteringModel[ClusterLabel]] {\n\n    // number of iterations for\n    // sequential gradient descent\n    var iters = args.getOrElse(\"iters\", \"1\").toInt\n\n    /*\n     * Number of clusters to build\n     */\n    val num_clusters = args.getOrElse(\"num_clusters\",\"100\").toInt\n\n    /*\n     * Error tolerance for the l1 projection in 'web scale kmeans'\n     */\n    val error_tolerance = args.getOrElse(\"error_tolerance\",\"0.01\").toDouble\n\n    /*\n     * Ball radius for the l1 projection in 'web scale kmeans'\n     */\n    val ball_radius = args.getOrElse(\"ball_radius\",\"1.0\").toDouble\n\n    override def getIters: Int = iters\n\n    def getModel: ClusteringModel[ClusterLabel] = {\n        new KMeans(centers)\n        .setNumClusters(num_clusters)\n        .setL1ProjectionErrorTolerance(error_tolerance)\n        .setL1ProjectionBallRadius(ball_radius)\n    }\n\n    val bins = args.getOrElse(\"bins\", \"100\").toInt\n\n    val trainer = if (args.boolean(\"large\")) new LargeModelTrainer(this, bins) else new SmallModelTrainer(this)\n\n    def train(instances: Pipe, instanceField: Symbol = 'instance, modelField: Symbol = 'model): Pipe = {\n        trainer.train(instances, instanceField, modelField)\n    }\n\n    def reTrain(instances: Pipe, instanceField: Symbol, model: Pipe, modelField: Symbol): Pipe = {\n        trainer.reTrain(instances, instanceField, model, modelField)\n    }\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/LargeModelTrainer.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.flow._\nimport cascading.operation._\nimport cascading.pipe._\nimport cascading.pipe.joiner.InnerJoin\n\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\n\nimport com.twitter.scalding._\n\nclass LargeModelTrainer[L <: Label, M <: UpdateableModel[L, M]](strategy: ModelTrainerStrategy[L, M], training_bins: Int) extends AbstractModelTrainer[L, M] {\n    import Dsl._\n\n    def train(instances: Pipe, instanceField: Symbol = 'instance, modelField: Symbol = 'model): Pipe = {\n        trainRecursively(None, modelField, binTrainingData(instances, instanceField), instanceField, strategy.getIters)\n    }\n\n    def reTrain(instances: Pipe, instanceField: Symbol, model: Pipe, modelField: Symbol): Pipe = {\n        throw new UnsupportedOperationException(\"not implemented due to expensiveness of model duplication\")\n    }\n\n    def binTrainingData(instances: Pipe, instanceField: Symbol): Pipe = {\n        instances\n            .project(instanceField)\n            .map(instanceField -> 'bin) { b: LabeledInstance[L] => b.hashCode % training_bins }\n    }\n\n    // This implements a full iteration of training, ending with a pipe with a model.\n    protected def trainIteration(modelPipe: Option[Pipe], modelField: Symbol, instancePipe: Pipe, instanceField: Symbol): Pipe = {\n        val iterationField = '__iteration__\n        val modelCountField = '__model_count__\n        // Subsample instances.\n        val subsampled = instancePipe.filter(instanceField) { i: LabeledInstance[L] => math.random < strategy.sampleProb(i.getLabel) }\n        // Get models on each mapper.\n        (modelPipe match {\n            case Some(pipe) => subsampled.joinWithSmaller('bin -> 'bin, pipe, new InnerJoin(), training_bins)\n            case _ => subsampled.map(instanceField -> (instanceField, modelField)) { x: LabeledInstance[L] => (x, strategy.getModel) }\n        })\n            // Count iteration numbers.\n            .insert(iterationField, 0)\n            .insert(modelCountField, 1)\n            // Convert instances to instance list.\n            .map(instanceField -> instanceField) { i: LabeledInstance[L] => List(i) }\n            // Perform map-side aggregation of models, which are then sent to a single reduce node for merging.\n            .groupBy('bin) {\n                _.reduce[(M, List[LabeledInstance[L]], Int, Int)](\n                    (modelField, instanceField, iterationField, modelCountField) -> (modelField, instanceField, iterationField, modelCountField))(strategy.modelReduceFunction)\n                    .reducers(training_bins)\n            }\n            .mapTo((modelField, iterationField) -> modelField) { x: (M, Int) => strategy.endIteration(x._1, x._2, training_bins) }\n            // flatten submodels and aggregate on different reducers.\n            .flatMapTo(modelField -> ('param, 'value)) { m: M =>\n                println(\"epoch: \" + m.getEpoch)\n                m.setParameter(\"__epoch__\", m.getEpoch)\n                new Iterable[(String, Double)]() {\n                    def iterator() = {\n                        new Iterator[(String, Double)]() {\n                            val it = m.decompose\n                            def hasNext: Boolean = { it.hasNext }\n                            def next: (String, Double) = { val e = it.next(); (e.getKey, e.getValue) }\n                        }\n                    }\n                }\n            }\n            .groupBy('param) { _.sum[Double]('value).forceToReducers }\n            // Duplicate the summed parameters rather than duplicating the reconstructed model, for speed reasons.\n            .flatMapTo(('param, 'value) -> ('bin, 'param, 'value)) {\n                b: (String, Double) =>\n                    (0 until training_bins).map { i => (i, b._1, b._2) }\n            }\n            // Reconstruct the model for each bin.  Uses a hacked on Scalding operator due to kryo serialization not supporting copy().\n            .groupBy('bin) {\n                _.every {\n                    pipe =>\n                        new Every(\n                            pipe,\n                            ('param, 'value),\n                            new FoldAggregator[(String, Double), M](\n                                {\n                                    (model: M, param: (String, Double)) =>\n                                        if (param._1 == \"__epoch__\") {\n                                            val epoch = (param._2 / training_bins).toLong\n                                            println(\"epoch: \" + epoch)\n                                            model.setEpoch(epoch)\n                                        } else {\n                                            model.setParameter(param._1, param._2)\n                                        }\n                                        model\n                                },\n                                strategy.getModel,\n                                modelField,\n                                implicitly[TupleConverter[(String, Double)]],\n                                implicitly[TupleSetter[M]]) {\n                                override def start(flowProcess: FlowProcess[_], call: AggregatorCall[M]) = call.setContext(strategy.getModel)\n                            })\n                }\n                    .reducers(training_bins)\n            }\n            .project('bin, modelField)\n    }\n\n    protected def trainRecursively(modelPipe: Option[Pipe], modelField: Symbol, instancePipe: Pipe, instanceField: Symbol, iterations: Int): Pipe = {\n        val updatedPipe = trainIteration(modelPipe, modelField, instancePipe, instanceField)\n        if (iterations == 1) {\n            updatedPipe.filter('bin) { b: Int => b == 0 }.mapTo(modelField -> modelField) { strategy.modelPostProcess }.groupAll { _.pass }\n        } else {\n            trainRecursively(Some(updatedPipe), modelField, instancePipe, instanceField, iterations - 1)\n        }\n    }\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/ModelTrainerStrategy.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.pipe.Pipe\n\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\n\nimport com.twitter.scalding._\n\ntrait ModelTrainerStrategy[L <: Label, M <: UpdateableModel[L, M]] extends Serializable {\n    import Dsl._\n\n    // How many iterations of training to perform.\n    def getIters: Int = 1\n\n    // The subclass just implements a thing that creates an initial model, from a set of\n    // initial parameters.\n    def getModel: M\n\n    // Optionally subsample depending on the lable.\n    def sampleProb(label: L): Double = 1.0\n\n    // Optionally perform some post-processing on the model after training.\n    def modelPostProcess(m: M): M = m\n\n    def miniBatchSize: Int = 1\n\n    // Function to merge two sub-models.\n    // Can be overriden to change default behavior.\n    def mergeModels(model1: M, model2: M, iteration1: Int, iteration2: Int): M = {\n        model1.merge(model2, 1.0)\n        model1\n    }\n\n    // Do something at the end of the iteration.\n    def endIteration(model: M, iteration: Int, models: Int): M = {\n        model.reScale(1.0 / models)\n        model\n    }\n\n    // Train the model on a mini batch.\n    // Returns the trained model, remining instances from the list, and the iteration number.\n    def updateModelOnMiniBatch(model: M, instances: List[LabeledInstance[L]], start_iteration: Int): (M, List[LabeledInstance[L]], Int) = {\n        val batch_sz = miniBatchSize // might be something that needs computing who knows.   \n        val mini_batch = new java.util.ArrayList[LabeledInstance[L]](miniBatchSize)\n        val n_batches = instances.size / batch_sz\n        var batch = 0\n        val iterator = instances.iterator\n        while (batch < n_batches) { // one extra iteration to get the remainder into the array.\n            (0 until batch_sz).foreach { i => mini_batch.add(i, iterator.next) }\n            model.update(mini_batch)\n            batch += 1\n        }\n        val remainder = iterator.toList\n        (model, remainder, start_iteration + n_batches)\n    }\n\n    // This implements the associative operation of the model training.\n    // Note that each tuple contains an instance not yet used for training.\n    def modelReduceFunction(a: (M, List[LabeledInstance[L]], Int, Int), b: (M, List[LabeledInstance[L]], Int, Int)): (M, List[LabeledInstance[L]], Int, Int) = {\n        if (a._3 > 0 && b._3 > 0) {\n            // Both models have some prior training.\n            val ua = updateModelOnMiniBatch(a._1, a._2, a._3)\n            val ub = updateModelOnMiniBatch(b._1, b._2, b._3)\n            // Merge together\n            (mergeModels(ua._1, ub._1, ua._3, ub._3), ua._2 ++ ub._2, ua._3 + ub._3, a._4 + b._4)\n        } else if (b._3 > 0) {\n            // Only model b has some prior training.\n            // Update model b using all instances.\n            val uba = updateModelOnMiniBatch(b._1, a._2 ++ b._2, b._3)\n            (uba._1, uba._2, uba._3, b._4)\n        } else {\n            // Either no model is trained, or only a is.\n            // Update model a using whatever intances are available.\n            val uab = updateModelOnMiniBatch(a._1, a._2 ++ b._2, a._3)\n            (uab._1, uab._2, uab._3, a._4)\n        }\n    }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/MulticlassModelTrainer.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.pipe.Pipe\nimport com.twitter.scalding._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\nimport scala.io.Source\nimport scala.collection.JavaConversions._\n\nclass MulticlassModelTrainer(args: Args, categories: Array[String]) extends AbstractModelTrainer[MulticlassLabel, UpdateableMulticlassLinearModel] with ModelTrainerStrategy[MulticlassLabel, UpdateableMulticlassLinearModel] {\n\n    /** \n     * Number of iterations for sequential gradient descent\n     */\n    val iters = args.getOrElse(\"iters\", \"1\").toInt\n\n    /** \n     *  What type of linear model should be used?\n     *  Options are:\n     *  1. perceptron\n     *  2. linear_svm\n     *  3. logistic_regression\n     *  4. mira\n     */\n    val modelType = args.getOrElse(\"model\", \"logistic_regression\").toString\n\n    /**\n     *  What kind of learning rate schedule / regularization\n     *  should we use?\n     *\n     *  Options:\n     *  1. elastic_net\n     *  2. adagrad\n     *  3. passive_aggressive\n     *  4. ftrl\n     */\n    val optimizerType = args.getOrElse(\"optimizer\", \"elastic_net\")\n\n    /** Aggressiveness parameter for passive aggressive classifier **/\n    val aggressiveness = args.getOrElse(\"aggressiveness\", \"2.0\").toDouble\n\n    val finalThresholding = args.getOrElse(\"final_thresholding\", \"0.0\").toDouble\n\n    /**\n     * Initial learning rate used for SGD learning.\n     */\n    val initialLearningRate = args.getOrElse(\"rate\", \"0.1\").toDouble\n\n   /** Base of the exponential learning rate (e.g., 0.99^{# examples seen}). **/\n    val exponentialLearningRateBase = args.getOrElse(\"exponential_learning_rate_base\", \"1.0\").toDouble\n\n    /** Whether to use the exponential learning rate.  If not chosen then the learning rate is like 1.0 / epoch. **/\n    val useExponentialLearningRate = args.boolean(\"exponential_learning_rate_base\")\n\n    /** \n     * A fudge factor so that an \"epoch\" for the purpose of learning rate computation can be more than one example,\n     * in which case the \"epoch\" will take a fractional amount equal to {# examples seen} / examples_per_epoch.\n     */\n    val examplesPerEpoch = args.getOrElse(\"examples_per_epoch\", \"10000\").toDouble\n\n    /**\n     * Weight on laplace regularization- a laplace prior on the parameters\n     * sparsity inducing ala lasso\n     */\n    val laplace = args.getOrElse(\"laplace\", \"0.0\").toDouble\n\n    /**\n     * Weight on gaussian prior on the parameters\n     * similar to ridge \n     */\n    val gauss = args.getOrElse(\"gauss\", \"0.0\").toDouble\n\n    /** Period of gradient truncation updates **/\n    val truncationPeriod = args.getOrElse(\"period\", Int.MaxValue.toString).toInt\n\n    /** \n     * Aggressiveness of gradient truncation updates, how much shrinkage\n     * is applied to the model's parameters\n     */\n    val truncationAlpha = args.getOrElse(\"alpha\", \"0.0\").toDouble\n\n    /**\n     * Threshold for applying gradient truncation updates\n     * parameter values smaller than this in magnitude are truncated\n     */\n    val truncationThresh = args.getOrElse(\"thresh\", \"0.0\").toDouble\n\n    /**\n     *  Learning rate parameters for FTRL\n     */\n    val ftrlAlpha = args.getOrElse(\"ftrlAlpha\", \"1.0\").toDouble\n    val ftrlBeta = args.getOrElse(\"ftrlBeta\", \"1.0\").toDouble\n\n    val classSampleProbabilities = args.optional(\"class_probs\")\n      .map { entries : String =>\n        entries.split(\",\").map {\n          s:String =>\n            val p = s.split(\":\")\n          (p(0), p(1).toDouble)\n        }.toMap\n      }\n      .getOrElse(Map[String, Double]())\n\n    val classSampleProbabilityFile = args.optional(\"class_prob_file\")\n\n    // stores sampling rates for different classes\n    lazy val probabilityMap : Map[String, Double] = {\n      val probs = categories.map{ c:String => (c, classSampleProbabilities.getOrElse(c, 1.0)) }.toMap\n\n      classSampleProbabilityFile match {\n        case Some(f) => probs ++ Source.fromFile(f).getLines().map{\n          s:String =>\n            val p = s.split(\":\")\n            (p(0), p(1).toDouble)\n        }.toMap\n        case None => probs\n      }\n    }\n\n    override def getIters: Int = iters\n\n    override def sampleProb(l : MulticlassLabel) : Double = {\n      probabilityMap.getOrElse(l.getLabel(), 1.0)\n    }\n\n    override def modelPostProcess(m: UpdateableMulticlassLinearModel) : UpdateableMulticlassLinearModel = {\n        m.thresholdParameters(finalThresholding)\n        m.setArgString(args.toString)\n        m.teardown()\n        m\n    }\n\n    /**\n     *  Choose an optimizer to use\n     */\n    val o = optimizerType match {\n            case \"elastic_net\" => new ElasticNetOptimizer()\n            case \"adagrad\" => new AdagradOptimizer()\n            case \"passive_aggressive\" => new PassiveAggressiveOptimizer().setC(aggressiveness).isHinge(true)\n            case \"ftrl\" => new FTRLOptimizer().setAlpha(ftrlAlpha).setBeta(ftrlBeta)\n            case \"mira\" => new MIRAOptimizer()\n        }\n\n    val optimizer = o.setGaussianRegularizationWeight(gauss)\n        .setLaplaceRegularizationWeight(laplace)\n        .setExamplesPerEpoch(examplesPerEpoch)\n        .setUseExponentialLearningRate(useExponentialLearningRate)\n        .setExponentialLearningRateBase(exponentialLearningRateBase)\n        .setInitialLearningRate(initialLearningRate)\n\n    def buildMultiClassModel(buildSubModel : () => UpdateableLinearModel[BinaryLabel], categories : Array[String]) : UpdateableMulticlassLinearModel = {\n        val param = categories.map{ i : String => \n            (i, buildSubModel().setTruncationPeriod(truncationPeriod)\n                .setTruncationThreshold(truncationThresh)\n                .setTruncationUpdate(truncationAlpha))\n            }.toMap\n        new UpdateableMulticlassLinearModel(new java.util.HashMap[String,UpdateableLinearModel[BinaryLabel]](param) )\n    }\n\n    if(modelType == \"mira\" && optimizerType != \"mira\"){\n        throw new IllegalArgumentException(\"MIRA only uses a MIRAOptimizer\");\n    }\n\n    def getModel: UpdateableMulticlassLinearModel = {\n      val model = modelType match {\n        case \"perceptron\" => buildMultiClassModel({() => new Hinge(optimizer).setThreshold(0.0)}, categories)\n        case \"linear_svm\" => buildMultiClassModel({() => new Hinge(optimizer).setThreshold(1.0)}, categories)\n        // TODO: re-make proper multiclass logistic regression instead of this one vs all thing.\n        case \"logistic_regression\" => buildMultiClassModel({() => new LogisticRegression(optimizer)}, categories)\n        // TODO: re-make multiclass mira.\n        case \"mira\" => buildMultiClassModel({() => new MIRA()}, categories)\n      }\n      model.setModelType(modelType)\n      model\n    }\n\n    val bins = args.getOrElse(\"bins\", \"100\").toInt\n\n    val trainer = if (args.boolean(\"large\")) new LargeModelTrainer(this, bins) else new SmallModelTrainer(this)\n\n    /** Size of minibatch for mini-batch training, defaults to 1 which is just SGD. **/\n    val batchsz = args.getOrElse(\"mini_batch_size\", \"1\").toInt\n    override def miniBatchSize: Int = batchsz\n\n    def train(instances: Pipe, instanceField: Symbol = 'instance, modelField: Symbol = 'model): Pipe = {\n        trainer.train(instances, instanceField, modelField)\n    }\n\n    def reTrain(instances: Pipe, instanceField: Symbol, model: Pipe, modelField: Symbol): Pipe = {\n        trainer.reTrain(instances, instanceField, model, modelField)\n    }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/RegressionModelTrainer.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.pipe.Pipe\nimport com.twitter.scalding._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\n\nclass RegressionModelTrainer(args: Args) extends AbstractModelTrainer[RealValuedLabel, UpdateableLinearModel[RealValuedLabel]]\n    with ModelTrainerStrategy[RealValuedLabel, UpdateableLinearModel[RealValuedLabel]] {\n\n    // number of iterations for\n    // sequential gradient descent\n    val iters = args.getOrElse(\"iters\", \"1\").toInt\n\n    override def getIters: Int = iters\n\n    // weight on laplace regularization- a laplace prior on the parameters\n    // sparsity inducing ala lasso\n    val laplace = args.getOrElse(\"laplace\", \"0.5\").toDouble\n\n    // weight on gaussian prior on the parameters\n    // similar to ridge \n    val gauss = args.getOrElse(\"gauss\", \"0.5\").toDouble\n\n    val modelType = \"least_squares\" // just one model type for regression at the moment\n\n    /**\n     *  What kind of learning rate schedule / regularization\n     *  should we use?\n     *\n     *  Options:\n     *  1. elastic_net\n     *  2. adagrad\n     *  3. passive_aggressive\n     *  4. ftrl\n     */\n    val optimizerType = args.getOrElse(\"optimizer\", \"elastic_net\")\n\n    // aggressiveness parameter for passive aggressive classifier\n    val aggressiveness = args.getOrElse(\"aggressiveness\", \"2.0\").toDouble\n\n    val ftrlAlpha = args.getOrElse(\"ftrlAlpha\", \"1.0\").toDouble\n\n    val ftrlBeta = args.getOrElse(\"ftrlBeta\", \"1.0\").toDouble\n\n    // initial learning rate used for SGD learning. this decays according to the\n    // inverse of the epoch\n    val initialLearningRate = args.getOrElse(\"rate\", \"0.1\").toDouble\n\n    // Base of the exponential learning rate (e.g., 0.99^{# examples seen}).\n    val exponentialLearningRateBase = args.getOrElse(\"exponential_learning_rate_base\", \"1.0\").toDouble\n\n    // Whether to use the exponential learning rate.  If not chosen then the learning rate is like 1.0 / epoch.\n    val useExponentialLearningRate = args.boolean(\"exponential_learning_rate_base\")\n\n    // A fudge factor so that an \"epoch\" for the purpose of learning rate computation can be more than one example,\n    // in which case the \"epoch\" will take a fractional amount equal to {# examples seen} / examples_per_epoch.\n    val examplesPerEpoch = args.getOrElse(\"examples_per_epoch\", \"10000\").toDouble\n\n    /**\n     *  Choose an optimizer to use\n     */\n    val o = optimizerType match {\n        case \"elastic_net\" => new ElasticNetOptimizer()\n        case \"adagrad\" => new AdagradOptimizer()\n        case \"passive_aggressive\" => new PassiveAggressiveOptimizer().setC(aggressiveness).isHinge(false)\n        case \"ftrl\" => new FTRLOptimizer().setAlpha(ftrlAlpha).setBeta(ftrlBeta)\n    }\n    val optimizer = o.setExamplesPerEpoch(examplesPerEpoch)\n                     .setUseExponentialLearningRate(useExponentialLearningRate)\n                     .setExponentialLearningRateBase(exponentialLearningRateBase)\n                     .setInitialLearningRate(initialLearningRate)\n\n    def getModel: UpdateableLinearModel[RealValuedLabel] = {\n        val model = modelType match {\n            case \"least_squares\" => new LeastSquaresRegressionModel(optimizer)\n        }\n        model\n    }\n\n    val bins = args.getOrElse(\"bins\", \"100\").toInt\n\n    val trainer = if (args.boolean(\"large\")) new LargeModelTrainer(this, bins) else new SmallModelTrainer(this)\n\n    def train(instances: Pipe, instanceField: Symbol = 'instance, modelField: Symbol = 'model): Pipe = {\n        trainer.train(instances, instanceField, modelField)\n    }\n\n    def reTrain(instances: Pipe, instanceField: Symbol, model: Pipe, modelField: Symbol): Pipe = {\n        trainer.reTrain(instances, instanceField, model, modelField)\n    }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/train/SmallModelTrainer.scala",
    "content": "package com.etsy.conjecture.scalding.train\n\nimport cascading.pipe.Pipe\n\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\n\nimport com.twitter.scalding._\n\nclass SmallModelTrainer[L <: Label, M <: UpdateableModel[L, M]](strategy: ModelTrainerStrategy[L, M]) extends AbstractModelTrainer[L, M] {\n    import Dsl._\n\n    // Functionality to train a small model (hundreds of thousands of features, arbitrarily many instances)\n    // Trains a model on each mapper, then aggregates them on one reducer.\n    // The last step is expensive if the dimensionality is great, since the reducer has to deserialize large StringKeyedVectors.\n    def train(instances: Pipe, instanceField: Symbol = 'instance, modelField: Symbol = 'mode): Pipe = {\n        // Begin training.\n        trainRecursively(None, modelField, instances, instanceField, strategy.getIters)\n    }\n\n    // Additional training for a small model.\n    def reTrain(instances: Pipe, instanceField: Symbol, model: Pipe, modelField: Symbol): Pipe = {\n        // Begin training.\n        trainRecursively(Some(model), modelField, instances, instanceField, strategy.getIters)\n    }\n\n    // This implements a full iteration of training, ending with a pipe with a model.\n    protected def trainIteration(modelPipe: Option[Pipe], modelField: Symbol, instancePipe: Pipe, instanceField: Symbol): Pipe = {\n        val iterationField: Symbol = '__iteration__\n        val modelCountField: Symbol = '__model_count__\n        // Subsample instances.\n        val subsampled = instancePipe.filter(instanceField) { i: LabeledInstance[L] => math.random < strategy.sampleProb(i.getLabel) }\n        // Get models on each mapper.\n        (modelPipe match {\n            case Some(pipe) => subsampled.project(instanceField).crossWithTiny(pipe.project(modelField))\n            case _ => subsampled.mapTo(instanceField -> (instanceField, modelField)) { x: LabeledInstance[L] => (x, strategy.getModel) }\n        })\n            // Count iteration numbers.\n            .insert(iterationField, 0)\n            .insert(modelCountField, 1)\n            // Convert instances to instance list.\n            .map(instanceField -> instanceField) { i: LabeledInstance[L] => List(i) }\n            // Perform map-side aggregation of models, which are then sent to a single reduce node for merging.\n            .groupAll {\n                _.reduce[(M, List[LabeledInstance[L]], Int, Int)](\n                    (modelField, instanceField, iterationField, modelCountField) -> (modelField, instanceField, iterationField, modelCountField))(strategy.modelReduceFunction)\n            }\n            .mapTo((modelField, iterationField, modelCountField) -> modelField) { x: (M, Int, Int) => strategy.endIteration(x._1, x._2, x._3) }\n    }\n\n    protected def trainRecursively(modelPipe: Option[Pipe], modelField: Symbol, instancePipe: Pipe, instanceField: Symbol, iterations: Int): Pipe = {\n        val updatedPipe = trainIteration(modelPipe, modelField, instancePipe, instanceField)\n        if (iterations == 1) {\n            updatedPipe.map(modelField -> modelField) { strategy.modelPostProcess }\n        } else {\n            trainRecursively(Some(updatedPipe), modelField, instancePipe, instanceField, iterations - 1)\n        }\n    }\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/util/BaseGridSearcher.scala",
    "content": "package com.etsy.conjecture.scalding.util\n\n\nimport scala.util.Random\nimport com.etsy.conjecture.scalding.util._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\nimport com.twitter.scalding._\nimport cascading.tuple.Fields\n\n/**\n  * Interface for using conjecture's hyperparamter tuner\n  * See DefaultGridSearcher job for an example of how to extend this class.\n  */\nabstract class BaseGridSearcher(args : Args) extends Job(args) {\n  val input = args.getOrElse(\"input\", \"specify_an_input_dir\")\n  val out_dir = args.getOrElse(\"out_dir\", \"hypertuned\")\n  val folds = args.getOrElse(\"folds\", \"0\").toInt\n  val problemName = args.getOrElse(\"name\", \"demo_problem\")\n  val xmx = args.getOrElse(\"xmx\", \"3\").toInt\n  val containerMemory = (xmx * 1024 * 1.16).toInt\n\n  // Let the user configure the field names on the command line.\n  val data_field_names = args.getOrElse(\"data_fields\", \"instance\").split(\",\")\n  val data_fields = data_field_names.tail.foldLeft(new Fields(data_field_names.head)) { (x,y) => x.append(new Fields(y)) }\n  val instance_field = Symbol(args.getOrElse(\"instance_field\", \"instance\"))\n\n  val salt = args.getOrElse(\"salt\", \"\")\n  val numTrials = args.getOrElse(\"num_trials\", \"10\").toInt\n  val instances = SequenceFile(input, data_fields).project(instance_field)\n\n  //Define your parameters to optimize\n  val opts: DynamicOptions\n  val parameters: Seq[HyperParameter[_]]\n  //Define the searcher to use based on the classifier\n  val searcher: HyperparameterSearcher[_,_,_]\n  override def config: Map[AnyRef, AnyRef] = super.config ++\n    Map(\"mapred.child.java.opts\" -> \"-Xmx%dG\".format(xmx),\n        \"mapreduce.map.memory.mb\" -> containerMemory.toString,\n        \"mapreduce.reduce.memory.mb\" -> containerMemory.toString\n       )\n  \n}\n\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/util/DynamicOptions.scala",
    "content": "package com.etsy.conjecture.scalding.util\n\nimport java.io.Serializable\nimport com.twitter.scalding.Args\nimport scala.collection.mutable.{ArrayBuffer, HashMap}\nimport scala.reflect.runtime.universe._\n\nclass DynamicOptions(args: Args) extends Serializable {\n  private val opts = new HashMap[String, DynamicOption[_]]\n  parse(args)\n\n  def get(key: String) = opts.get(key)\n\n  def size = opts.size\n\n  var strict = true\n\n  def values = opts.values\n\n\n  def +=(c: DynamicOption[_]): this.type = {\n    if (opts.contains(c.name)) throw new Error(\"DynamicOption \" + c.name + \" already exists.\")\n    opts(c.name) = c\n    this\n  }\n\n  def -=(c: DynamicOption[_]): this.type = {\n    opts -= c.name\n    this\n  }\n\n  /** The arguments that were unqualified by dashed options.\n    * Currently unused but held for future work.\n    */\n  private val _remaining = new ArrayBuffer[String]\n  def remaining: Seq[String] = _remaining\n\n  /** Parse sequence of command-line arguments. */\n  def parse(args: Args): Unit = {\n    args.m.filter(!_._2.isEmpty).foreach {\n      case (key, listValues) =>\n      //Only take first value from the values\n      val value = listValues.head\n      key match {\n        case k: String if (opts.contains(k)) => opts(k).parseValue(value)\n        case \"\" =>  _remaining += value\n        case _ => {\n          opts.+=((key, new DynamicOption(key, value)))\n          opts(key).parseValue(value)\n        }\n      }\n    }\n  }\n\n  def unParse: Args = {\n      val newM = opts.map{cmd => cmd._1->List(cmd._2.value.toString())}.toMap\n      new Args(newM)\n  }\n\n\n  class DynamicOption[T](val name:String, val defaultValue:T)(implicit m: Manifest[T]) extends com.etsy.conjecture.scalding.util.DynamicOption[T] with Serializable {\n    def this(name: String)(implicit m: Manifest[T]) = this(name, null.asInstanceOf[T])\n\n    DynamicOptions.this += this\n    private def valueClass: Class[_] = m.runtimeClass\n    private def valueType = m.runtimeClass\n\n    private def matches(str: String): Boolean = str == (\"--\" + name)\n\n    var _value = defaultValue\n\n    def value: T = {\n      _value\n    }\n\n    def setValue(v: T) {\n      _value = v\n    }\n\n    def hasValue = !(valueType eq classOf[Nothing])\n\n    var setCount = 0\n\n    /** Parses each of the supported value, increments set counter, and sets the value */\n    def parseValue(args: String) = {\n      setCount += 1\n      setValue(valueType match {\n        case t if t eq classOf[List[String]] => args.asInstanceOf[T]\n        case t if t eq classOf[List[Int]] => args.map(_.toInt).asInstanceOf[T]\n        case t if t eq classOf[List[Double]] => args.map(_.toDouble).asInstanceOf[T]\n        case t if t eq classOf[Char] => args.head.asInstanceOf[T]\n        case t if t eq classOf[String] => args.asInstanceOf[T]\n        case t if t eq classOf[Short] => args.toShort.asInstanceOf[T]\n        case t if t eq classOf[Int] => args.toInt.asInstanceOf[T]\n        case t if t eq classOf[Long] => args.toLong.asInstanceOf[T]\n        case t if t eq classOf[Double] => args.toDouble.asInstanceOf[T]\n        case t if t eq classOf[Float] => args.toFloat.asInstanceOf[T]\n        case t if t eq classOf[Boolean] => args.toBoolean.asInstanceOf[T]\n        case otw => throw new Error(\"DynamicOption does not handle values of type \" + otw)\n      })\n    }\n  }\n\n  override def toString: String = values.map(_.toString).mkString(\"\\t\")\n}\n\ntrait DynamicOption[T] {\n  def name: String\n  def defaultValue: T\n  def value: T\n  def setValue(v: T): Unit\n  def hasValue: Boolean\n  def setCount: Int\n  def wasSet = setCount > 0\n  override def toString: String = {\n    if (hasValue)\n      value match {\n        case a: Seq[_] => Seq(\"--\" + name + \" \") ++ a.map(_.toString)\n        case \"\" => Seq()\n        case a: Any => Seq(\"--\" + name + \" \" + value.toString)\n      }\n    else\n      Seq()\n  }.mkString(\"; \")\n  override def hashCode = name.hashCode\n  override def equals(other:Any) = name.equals(other)\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/scalding/util/HyperparameterSearcher.scala",
    "content": "package com.etsy.conjecture.scalding.util\n\nimport java.io.Serializable\n\nimport cascading.pipe.Pipe\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.evaluation._\nimport com.etsy.conjecture.model._\nimport com.etsy.conjecture.scalding.evaluate.{BinaryEvaluator, GenericEvaluator, MulticlassEvaluator, RegressionEvaluator}\nimport com.etsy.conjecture.scalding.train._\nimport com.twitter.scalding.Dsl._\nimport com.twitter.scalding._\n\nimport scala.util.Random\n\n/**\n  * Samples random parameter values to perform a fast efficient hyperparameter search\n  */\nabstract class HyperparameterSearcher[L <: Label, M <: UpdateableModel[L, M], E <: ModelEvaluation[L]]\n                                      (val options: DynamicOptions, val parameters: Seq[HyperParameter[_]],\n                                       val numTrials: Int, rng: Random = new Random(0)) extends Serializable {\n\n  //Map of trial id to a randomly sampled set of parameters\n  val settings = (0 until numTrials).map { trial : Int => trial -> sampledParameters(rng) }.toMap\n\n  def getModelTrainer(args: Args): ModelTrainerStrategy[L, M]\n  val evaluator: GenericEvaluator[L]\n\n  //draw parameter values from given sample method\n  //save values in a new Arg instance\n  def sampledParameters(rng: Random): Args = {\n    parameters.foreach(_.set(rng))\n    options.unParse\n  }\n\n  def search (instances: Pipe, instance_field: Symbol): (Pipe, Pipe) = {\n    //Split train test by ratio\n    //TODO Should make this into a parameter\n    val splitSet = instances.map(instance_field -> '__fold) { li: LabeledInstance[L] => rng.nextInt(10) <= 7 }\n    val trainSet = splitSet.filter('__fold) { foldId: Boolean => foldId }\n    val testSet = splitSet.filter('__fold) { foldId: Boolean => !foldId }\n\n    //Restructure data pipes into trainable format and assign an ID that defines the random setting to use\n    val train = generateTrials(trainSet, instance_field)\n    val test = generateTrials(testSet, instance_field)\n\n    val models = trainTrials(train)\n\n    val rawResults = evaluate(models, test)\n\n    val runResults = rawResults\n      .mapTo(('settings, 'eval) -> 'result) {\n        x: (Args, Double) =>\n          x._2 + \"\\t\" + x._1.toString\n      }\n    //Tally up test accs to find metrics on tested param values\n    val paramReport = createParameterReport(rawResults)\n\n    (runResults, paramReport)\n  }\n\n  def generateTrials(instances: Pipe, instance_field: Symbol): Pipe = {\n    instances\n      .flatMapTo('instance -> ('instance, 'trial)) {\n         instance: LabeledInstance[L] =>\n             settings.keySet.map {\n               trial: Int =>\n                  (instance, trial)\n             }\n       }\n       .groupBy('trial) {\n          _.toList[LabeledInstance[L]]('instance -> 'instances).reducers(1000)\n       }.project('trial, 'instances)\n  }\n\n  def trainTrials(instances: Pipe): Pipe = {\n       instances\n         .mapTo(('trial, 'instances) -> ('trial, 'model)) {\n           x: (Int, List[LabeledInstance[L]]) =>\n           //In the unlike case that a setting does not exist, use the default value\n           val args: Args = settings.getOrElse(x._1, options.unParse)\n           val instanceSet = x._2\n           val modelTrainer = getModelTrainer(args)\n           val model = modelTrainer.getModel\n           //Train model\n           instanceSet.foreach(model.update)\n           (x._1, model)\n        }\n  }\n\n  def evaluate (models: Pipe, testSet: Pipe): Pipe = {\n     val eval = models\n       .joinWithSmaller('trial -> 'trial, testSet)\n       .mapTo(('model, 'instances, 'trial) -> ('eval, 'settings)) {\n          x: (M, List[LabeledInstance[L]], Int) =>\n          val model = x._1\n          val testList = x._2\n          val args = settings.getOrElse(x._3, options.unParse)\n          val acc = evaluateAccuracy(testList, model)\n          (acc, args)\n       }\n     eval\n  }\n\n  def evaluateAccuracy(instances: List[LabeledInstance[L]], model: M): Double = {\n      val eval = evaluator.build\n      instances.map {\n        instance: LabeledInstance[L] =>\n              val realLabel = instance.getLabel\n              val prediction = model.predict(instance.getVector)\n              eval.add(realLabel, prediction)\n      }\n      val agg = new EvaluationAggregator[L]()\n      agg.add(eval)\n      agg.getValue(\"Acc (avg)\")\n  }\n\n\n  //Tally up evaluation scores of random parameter values and create a report of the mean/stdDev/max/count_of_runs\n  def createParameterReport(rawResults: Pipe): Pipe = {\n    rawResults\n      .groupAll{\n        _.toList[(Args, Double)](('settings, 'eval) -> 'results)\n      }\n      .mapTo('results -> 'report) {\n         results: List[(Args,Double)] =>\n           results.map{ x =>\n             options.parse(x._1)\n             parameters.foreach(_.accumulate(x._2))\n            }\n            parameters.map(_.report).mkString(\"\\n\")\n      }\n  }\n}\n\n\nclass BinaryHyperparameterSearcher(option: DynamicOptions, parameters: Seq[HyperParameter[_]],\n                                 numTrials: Int) extends HyperparameterSearcher[BinaryLabel, UpdateableLinearModel[BinaryLabel], BinaryModelEvaluation](option, parameters, numTrials) {\n    val evaluator = new BinaryEvaluator()\n    def getModelTrainer(args: Args) = new BinaryModelTrainer(args)\n}\n\nclass RegressionHyperparameterSearcher(option: DynamicOptions, parameters: Seq[HyperParameter[_]],\n                                 numTrials: Int) extends HyperparameterSearcher[RealValuedLabel, UpdateableLinearModel[RealValuedLabel], RegressionModelEvaluation](option, parameters, numTrials) {\n    val evaluator = new RegressionEvaluator()\n    def getModelTrainer(args: Args) = new RegressionModelTrainer(args)\n}\n\nclass MulticlassHyperparameterSearcher(option: DynamicOptions, parameters: Seq[HyperParameter[_]], numTrials: Int, categories: Array[String]) extends HyperparameterSearcher[MulticlassLabel, UpdateableMulticlassLinearModel, MulticlassModelEvaluation](option, parameters, numTrials) {\n    val evaluator = new MulticlassEvaluator(categories)\n    def getModelTrainer(args: Args) = new MulticlassModelTrainer(args, categories)\n}\n\n/**\n * Sampling method for hyperparameters\n * Also defines how to bucket parameter values and accuracies for hyperparameter reports\n */\ntrait ParameterSampler[T] extends Serializable {\n  def sample(rng: scala.util.Random): T\n  //Array corresponds to bucketed parameter value, then the sum, sumSq, max, count of times that param bucket was run\n  val buckets: Array[(T, Double, Double, Double, Int)]\n  def valueToBucket(v: T): Int\n  def accumulate(value: T, d: Double) {\n    val v = math.max(0, math.min(valueToBucket(value), buckets.length-1))\n    val (_, sum, sumSq, max, count) = this.buckets(v)\n    this.buckets(v) = (value, sum+d, sumSq+d*d, math.max(max, d), count+1)\n  }\n}\n\n/**\n * Samples uniformly one value from the sequence.\n */\nclass SampleFromSeq[T](seq: Seq[T]) extends ParameterSampler[T] {\n  val buckets = seq.map(s => (s, 0.0, 0.0, 0.0, 0)).toArray\n  def valueToBucket(v: T) = buckets.toSeq.map(_._1).indexOf(v)\n  def sample(rng: Random) = seq(rng.nextInt(seq.length))\n}\n\n/**\n * Samples uniformly a double that falls within the range\n */\nclass UniformDoubleSampler(lower: Double, upper: Double, numBuckets: Int = 10) extends ParameterSampler[Double] {\n  val dif = upper - lower\n  val buckets = (0 to numBuckets).map(i => (0.0, 0.0, 0.0, 0.0, 0)).toArray\n  def valueToBucket(d: Double) = (numBuckets*(d - lower)/dif).toInt\n  def sample(rng: Random) = rng.nextDouble()*dif + lower\n}\n/**\n * Samples Doubles in the range such that their logarithm is uniform.\n * Useful for learning rates, variances, alphas, and other things which\n * vary in order of magnitude.\n */\nclass LogUniformDoubleSampler(lower: Double, upper: Double, numBuckets: Int = 10) extends ParameterSampler[Double] {\n  val inner = new UniformDoubleSampler(math.log(lower), math.log(upper), numBuckets)\n  def valueToBucket(v: Double) = inner.valueToBucket(math.log(v))\n  val buckets = (0 to numBuckets).map(i => (0.0, 0.0, 0.0, 0.0, 0)).toArray\n  def sample(rng: Random) = math.exp(inner.sample(rng))\n}\n\n/**\n * A container for a hyperparameter\n * @param option The DynamicOption wrapper for the parameter\n * @param sampler Sampler to use to return values for the parameter\n */\nclass HyperParameter[T](option: DynamicOption[T], val sampler: ParameterSampler[T]) {\n  val buckets = sampler.buckets\n  def set(rng: Random) { option.setValue(sampler.sample(rng)) }\n  def accumulate(objective: Double) { sampler.accumulate(option.value, objective) }\n  def report(): String = {\n    val buff = new StringBuilder(\"Parameter: \"+option.name+\"\\tMean\\tStdDev\\tMax\\tCount\\n\")\n    for ((value, sum, sumSq, max, count) <- buckets) {\n      val mean = sum/count\n      val stdDev = math.sqrt(sumSq/count - mean*mean)\n      val metrics = value match {\n        case v: Double => Vector(f\"${v.toDouble}%2.15f\",  f\"$mean%2.4f\", f\"$stdDev%2.4f\",f\"$max%1.4f\", count).mkString(\"\\t\")\n        case _ => Vector(f\"${value.toString}%20s\",  f\"$mean%2.4f\", f\"$stdDev%2.4f\",f\"$max%1.4f\", count).mkString(\"\\t\")\n      }\n      buff.append(metrics + \"\\n\")\n    }\n    buff.toString()\n  }\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/text/FeatureHelper.scala",
    "content": "package com.etsy.conjecture.text\n\nimport com.etsy.conjecture.data.{ AbstractInstance, BinaryLabeledInstance, LabeledInstance, StringKeyedVector }\n\nimport com.twitter.algebird.Operators._\n\nimport cascading.tuple.Fields\nimport cascading.pipe.Pipe\n\nimport scala.collection.JavaConverters._\n\nimport spray.json._\nimport DefaultJsonProtocol._\n\n\nobject FeatureHelper {\n\n    import com.twitter.scalding.Dsl._\n\n    def keepFeaturesWithCountGreaterThan(pipe: Pipe, instance_field: Fields, n: Int): Pipe = {\n        val counts = pipe\n            .flatMapTo(instance_field -> ('term, '__count)) {\n                v: AnyRef =>\n                    val vector = v match {\n                        case skv: StringKeyedVector => skv\n                        case ins: AbstractInstance[_] => ins.getVector\n                        case lin: LabeledInstance[_] => lin.getVector\n                        case _ => throw new IllegalArgumentException(\"keepFeaturesWithCountGreaterThan does not expect class: \" + v.getClass.getName)\n                    }\n                    vector.keySet.asScala.map { k => k -> 1 }\n            }\n            .groupBy('term) { _.sum[Long]('__count) }\n            .filter('__count) { c: Long => c > n }\n            .mapTo('term -> 'set) { t: String => Set(t) }\n            .groupAll { _.sum[Set[String]]('set) }\n\n        pipe\n            .crossWithTiny(counts)\n            .map(instance_field.append('set) -> instance_field) { x: (AnyRef, Set[String]) =>\n                val skv = x._1 match {\n                    case s: StringKeyedVector => s\n                    case i: AbstractInstance[_] => i.getVector\n                    case l: LabeledInstance[_] => l.getVector\n                    case _ => throw new IllegalArgumentException(\"keepFeaturesWithCountGreaterThan does not expect class: \" + x._1.getClass.getName)\n                }\n                val it = skv.iterator\n                while (it.hasNext) {\n                    val e = it.next\n                    if (!x._2.contains(e.getKey)) {\n                        it.remove\n                    }\n                }\n                x._1\n            }\n    }\n\n    def nGramsUpTo(string: String, n: Int = 2, prefix: String = \"\"): List[String] = {\n        val toks = Text(string.toLowerCase).standardTextFilter.toString.split(\" \").toList\n        val toks_pad = \"\" +: toks :+ \"\"\n        val grams = (1 to n).map { m => toks_pad.sliding(m).toList.map { p => p.mkString(\"::\") } }.foldLeft(List[String]()) { _ ++ _ }\n        grams.filter { g => g != \"\" }.map { g => prefix + g }\n    }\n\n    def stringListToSKV(list: List[String], weight: Double = 1.0): StringKeyedVector = {\n        val skv = new StringKeyedVector();\n        list.foreach { f => skv.setCoordinate(f, weight) }\n        skv\n    }\n\n    def getEmailBody(body: String): Option[String] = {\n        val p = parseEmailBodyToTextAndType(body)\n        if (p._1 != null)\n            Some(p._1)\n        else\n            None\n    }\n\n    def parseEmailBodyToTextAndType(body: String): (String, String) = {\n        try {\n            val email = JsonParser(body).convertTo[List[Map[String, String]]]\n            val textParts = email.filter(part => part(\"type\") == \"text/plain\")\n            if (textParts.length > 0)\n                (textParts.map(part => part(\"body\")).mkString(\" \"), \"text/plain\")\n            else {\n                val htmlParts = email.filter(part => part(\"type\") == \"text/html\")\n                if (htmlParts.length > 0)\n                    (htmlParts.map(part => part(\"body\")).mkString(\" \"), \"text/html\")\n                else\n                    (null, \"filter\") // Filter this email\n            }\n        } catch {\n            case _ : Exception => (null, \"filter\")\n        }\n    }\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/text/Text.scala",
    "content": "package com.etsy.conjecture.text\n\ncase class Text(val input: String) {\n\n    private implicit def text2str(txt: Text): String = txt.input\n    private implicit def str2text(str: String): Text = new Text(str)\n\n    override def toString = input.toString\n\n    def replaceNumbers(replacement: String = \"_num_\") = Text(input.replaceAll(\"[0-9]+\", replacement).replaceAll(replacement + \"\\\\s+\" + replacement, replacement))\n\n    def replaceHTMLEscapes(replacement: String = \" \") = Text(input.replaceAll(\"&[^;]+;\", replacement))\n\n    def removeHTMLTags() = Text(input.replaceAll(\"<.*?>\", \" \")) //Text(XML.loadString(input).text)\n\n    def replaceHTMLTags(replacement: String = \" \") = Text(input.replaceAll(\"<[^>]+>\", \" \"))\n\n    def replaceNonAlphaNumeric(replacement: String = \" \") = Text(input.replaceAll(\"[^a-zA-Z0-9\\\\.\\\\s\\\\-]+\", replacement))\n\n    def replaceNonAlphaNumericUnderscore(replacement: String = \" \") = Text(input.replaceAll(\"[^a-zA-Z0-9\\\\.\\\\s\\\\-_]+\", replacement))\n\n    def replaceNonAlpha(replacement: String = \" \") = Text(input.replaceAll(\"[^a-zA-Z]+\", replacement))\n\n    def collapseHyphens() = Text(input.replaceAll(\"--+\", \"--\"))\n\n    def collapseUnderscores() = Text(input.replaceAll(\"__+\", \"__\"))\n\n    def collapsePeriods() = Text(input.replaceAll(\"\\\\.\\\\.+\", \"..\"))\n\n    def toLowerCase() = Text(input.toLowerCase)\n\n    def toUpperCase() = Text(input.toUpperCase)\n\n    def stripPunctuation() = Text(input.replaceAll(\"^[^A-Za-z0-0]+\", \"\").replaceAll(\"[^A-Za-z0-9]+$\", \"\"))\n\n    // compact any white space\n    def collapse() = Text(input.replaceAll(\"\\\\s+\", \" \"))\n\n    // remove any whitespace from the right of a string\n    def rstrip() = Text(input.replaceAll(\"\\\\s+$\", \"\"))\n\n    // remove any whitespace from the left of a string\n    def lstrip() = Text(input.replaceAll(\"^\\\\s+\", \"\"))\n\n    // remove any leading or trailing whitespace\n    def strip() = Text(input.trim)\n\n    // clean up any whitespace\n    def wsclean() = strip().collapse()\n\n    // remove any unprintable non-ASCII characters\n    def removeUnprintables(input: String) = Text(input.replaceAll(\"[^\\\\x20-\\\\x7E]\", \"\"))\n\n    def collapseWhitespaceAndPunc = Text(input.replaceAll(\"\\\\s+\", \" \")\n        .replaceAll(\"[\\\\-]+\", \"-\")\n        .replaceAll(\"[\\\\.]+\", \".\"))\n\n    def standardTextFilter = Text(removeHTMLTags()\n        .replaceHTMLEscapes()\n        .replaceNumbers()\n        .replaceNonAlphaNumericUnderscore()\n        .collapseHyphens()\n        .collapseUnderscores()\n        .wsclean())\n\n    def toListFromShingles(n: Int, ns: Int*): List[String] = (List(n) ++ ns.toList).flatMap{ i: Int => input.sliding(i) }.toList\n\n    def toSequenceFromShingles(n: Int, ns: Int*): TextSequence = new TextSequence(toListFromShingles(n, ns: _*))\n\n    def toList(sep: String = \" \"): List[String] = input.split(sep).toList\n\n    def toSequence(sep: String = \" \"): TextSequence = new TextSequence(toList(sep))\n\n    def isEmpty(): Boolean = input.isEmpty()\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/conjecture/text/TextSequence.scala",
    "content": "package com.etsy.conjecture.text\n\nimport com.etsy.conjecture.data.{BinaryLabeledInstance,BinaryLabel,MulticlassLabel,MulticlassLabeledInstance}\n\ncase class TextSequence(tokens: Seq[String]) {\n\n    def ++(that: TextSequence): TextSequence = TextSequence(tokens ++ that.tokens)\n\n    def mkString(glue: String = \" \"): String = tokens.mkString(glue)\n\n    override def toString = mkString(\" \")\n\n    def intersect(that: TextSequence): TextSequence = TextSequence(tokens.intersect(that.tokens))\n\n    def filterBlank = TextSequence(tokens.filter { x => x.isEmpty })\n\n    def filterStopwords = TextSequence(tokens.filter { x => !Stopwords(x.toLowerCase) })\n\n    def stopwords = TextSequence(tokens.filter { x => Stopwords(x.toLowerCase) })\n\n    def filterBadwords = TextSequence(tokens.filter { x => !BadWords(x.toLowerCase) })\n\n    def badwords = TextSequence(tokens.filter { x => BadWords(x.toLowerCase) })\n\n    def filterAllCaps = TextSequence(tokens.filter { x => !x.matches(\"^[A-Z]+$\") })\n\n    def allCaps = TextSequence(tokens.filter { x => x.matches(\"^[A-Z]+$\") })\n\n    def filterCapitalized = TextSequence(tokens.filter { x => !x.matches(\"^[A-Z][^A-Z]*\") })\n\n    def capitalized = TextSequence(tokens.filter { x => x.matches(\"^[A-Z][^A-Z]*\") })\n\n    def filterLowercase = TextSequence(tokens.filter { x => !x.matches(\"^[a-z]+$\") })\n\n    def allLowercase = TextSequence(tokens.filter { x => x.matches(\"^[a-z]+$\") })\n\n    def filterURLs = TextSequence(tokens.filter { x => !x.matches(\"^https?://.+\") })\n\n    def allURLs = TextSequence(tokens.filter { x => x.matches(\"^https?://.+\") })\n\n    def filterListings = TextSequence(tokens.filter { x => !x.matches(\"^https?://.+etsy.+/listing/[0-9]+.*\") })\n\n    def allListings = TextSequence(tokens.filter { x => x.matches(\"^https?://.+etsy.+/listing/[0-9]+.*\") })\n\n    def size: Int = tokens.size\n\n    def stopWordCount: Int = stopwords.size\n\n    def stopWordFraq(bins: Int = 10): Int = (math.round(bins * stopWordCount / size) / bins.toDouble).toInt\n\n    def badWordCount: Int = badwords.size\n\n    def badWordFraq(bins: Int = 10): Int = (math.round(bins * badWordCount / size) / bins.toDouble).toInt\n\n    def capsCount: Int = allCaps.size\n\n    def capFraq(bins: Int = 10): Int = (math.round(bins * capsCount / size) / bins.toDouble).toInt\n\n    def urlCount: Int = allURLs.size\n\n    def urlFraq(bins: Int = 10): Int = (math.round(bins * urlCount / size) / bins.toDouble).toInt\n\n    def listingsCount: Int = allListings.size\n\n    def listingsFraq(bins: Int = 10): Int = (math.round(bins * listingsCount / size) / bins.toDouble).toInt\n\n    def sizeBin = math.floor(math.log(size)).toInt\n\n    // filtering methods\n\n    def replaceNumbers(replacement: String = \"_num_\") = TextSequence(tokens.map { input => input.replaceAll(\"[0-9]+\", replacement).replaceAll(replacement + \"\\\\s+\" + replacement, replacement) })\n\n    def replaceHTMLEscapes(replacement: String = \" \") = TextSequence(tokens.map { input => input.replaceAll(\"&[^;]+;\", replacement) })\n\n    def removeHTMLTags() = TextSequence(tokens.map { input => input.replaceAll(\"<.*?>\", \" \") })\n\n    def replaceHTMLTags(replacement: String = \" \") = TextSequence(tokens.map { input => input.replaceAll(\"<[^>]+>\", \" \") })\n\n    def replaceNonAlphaNumeric(replacement: String = \" \") = TextSequence(tokens.map { input => input.replaceAll(\"[^a-zA-Z0-9\\\\.\\\\s\\\\-]+\", replacement) })\n\n    def replaceNonAlphaNumericUnderscore(replacement: String = \" \") = TextSequence(tokens.map { input => input.replaceAll(\"[^a-zA-Z0-9\\\\.\\\\s\\\\-_]+\", replacement) })\n\n    def replaceNonAlpha(replacement: String = \" \") = TextSequence(tokens.map { input => input.replaceAll(\"[^a-zA-Z]+\", replacement) })\n\n    def collapseHyphens() = TextSequence(tokens.map { input => input.replaceAll(\"--+\", \"--\") })\n\n    def collapseUnderscores() = TextSequence(tokens.map { input => input.replaceAll(\"__+\", \"__\") })\n\n    def collapsePeriods() = TextSequence(tokens.map { input => input.replaceAll(\"\\\\.\\\\.+\", \"..\") })\n\n    def stripPunctuation() = TextSequence(tokens.map { input => input.replaceAll(\"^[^A-Za-z0-0]+\", \"\").replaceAll(\"[^A-Za-z0-9]+$\", \"\") })\n\n    // compact any white space\n    def collapse() = TextSequence(tokens.map { input => input.replaceAll(\"\\\\s+\", \" \") })\n\n    // remove any whitespace from the right of a string\n    def rstrip() = TextSequence(tokens.map { input => input.replaceAll(\"\\\\s+$\", \"\") })\n\n    // remove any whitespace from the left of a string\n    def lstrip() = TextSequence(tokens.map { input => input.replaceAll(\"^\\\\s+\", \"\") })\n\n    // remove any leading or trailing whitespace\n    def strip() = TextSequence(tokens.map { input => (input.trim) })\n\n    // clean up any whitespace\n    def wsclean() = strip().collapse()\n\n    // remove any unprintable non-ASCII characters\n    def removeUnprintables(input: String) = TextSequence(tokens.map { input => input.replaceAll(\"[^\\\\x20-\\\\x7E]\", \"\") })\n\n    def collapseWhitespaceAndPunc = TextSequence(tokens.map { input =>\n        input.replaceAll(\"\\\\s+\", \" \")\n            .replaceAll(\"[\\\\-]+\", \"-\")\n            .replaceAll(\"[\\\\.]+\", \".\")\n    })\n\n    def ngrams(n: Int, glue: String = \" \") = new TextSequence(tokens.sliding(n).map { x => x.mkString(glue) }.toList)\n\n    def shingles(n: Int, whitespace: String = \"_\"): TextSequence = {\n        val str = tokens.mkString(whitespace)\n        TextSequence(str.sliding(n).toList)\n    }\n\n    def prependNameSpace(namespace: String) = new TextSequence(tokens.map { x => namespace + x })\n\n    def toList = tokens.toList\n\n    def toBinaryLabeledInstance(label: Double): BinaryLabeledInstance = {\n      toBinaryLabeledInstance(new BinaryLabel(label))\n    }\n\n    def toBinaryLabeledInstance(label: BinaryLabel): BinaryLabeledInstance = {\n        val instance = new BinaryLabeledInstance(label)\n\n        tokens.foreach {\n            x => instance.addTerm(x)\n        }\n\n        instance\n    }\n\n\n    def toMulticlassLabeledInstance(label: MulticlassLabel): MulticlassLabeledInstance = {\n        val instance = new MulticlassLabeledInstance(label)\n\n        tokens.foreach {\n            x => instance.addTerm(x)\n        }\n\n        instance\n    }\n\n\n}\n\nobject Stopwords {\n    def apply(input: String): Boolean = stopwords.contains(input)\n\n    val stopwords = Set(\"a\", \"as\", \"able\", \"about\", \"above\", \"according\", \"accordingly\", \"across\", \"actually\", \"after\", \"afterwards\", \"again\", \"against\", \"aint\", \"all\", \"allow\", \"allows\", \"almost\", \"alone\", \"along\", \"already\", \"also\", \"although\", \"always\", \"am\", \"among\", \"amongst\", \"amoungst\", \"amount\", \"an\", \"and\", \"another\", \"any\", \"anybody\", \"anyhow\", \"anyone\", \"anything\", \"anyway\", \"anyways\", \"anywhere\", \"apart\", \"appear\", \"appreciate\", \"appropriate\", \"are\", \"arent\", \"around\", \"as\", \"aside\", \"ask\", \"asking\", \"associated\", \"at\", \"available\", \"away\", \"awfully\", \"b\", \"back\", \"be\", \"became\", \"because\", \"become\", \"becomes\", \"becoming\", \"been\", \"before\", \"beforehand\", \"behind\", \"being\", \"believe\", \"below\", \"beside\", \"besides\", \"best\", \"better\", \"between\", \"beyond\", \"bill\", \"both\", \"bottom\", \"brief\", \"but\", \"by\", \"c\", \"cmon\", \"cs\", \"call\", \"came\", \"can\", \"cant\", \"cannot\", \"cant\", \"cause\", \"causes\", \"certain\", \"certainly\", \"changes\", \"clearly\", \"co\", \"com\", \"come\", \"comes\", \"con\", \"concerning\", \"consequently\", \"consider\", \"considering\", \"contain\", \"containing\", \"contains\", \"corresponding\", \"could\", \"couldnt\", \"couldnt\", \"course\", \"cry\", \"currently\", \"d\", \"de\", \"definitely\", \"describe\", \"described\", \"despite\", \"detail\", \"did\", \"didnt\", \"different\", \"do\", \"does\", \"doesnt\", \"doing\", \"dont\", \"done\", \"down\", \"downwards\", \"due\", \"during\", \"e\", \"each\", \"edu\", \"eg\", \"eight\", \"either\", \"eleven\", \"else\", \"elsewhere\", \"empty\", \"enough\", \"entirely\", \"especially\", \"et\", \"etc\", \"even\", \"ever\", \"every\", \"everybody\", \"everyone\", \"everything\", \"everywhere\", \"ex\", \"exactly\", \"example\", \"except\", \"f\", \"far\", \"few\", \"fifteen\", \"fifth\", \"fify\", \"fill\", \"find\", \"fire\", \"first\", \"five\", \"followed\", \"following\", \"follows\", \"for\", \"former\", \"formerly\", \"forth\", \"forty\", \"found\", \"four\", \"from\", \"front\", \"full\", \"further\", \"furthermore\", \"g\", \"get\", \"gets\", \"getting\", \"give\", \"given\", \"gives\", \"go\", \"goes\", \"going\", \"gone\", \"got\", \"gotten\", \"greetings\", \"h\", \"had\", \"hadnt\", \"happens\", \"hardly\", \"has\", \"hasnt\", \"hasnt\", \"have\", \"havent\", \"having\", \"he\", \"hes\", \"hello\", \"help\", \"hence\", \"her\", \"here\", \"heres\", \"hereafter\", \"hereby\", \"herein\", \"hereupon\", \"hers\", \"herself\", \"hi\", \"him\", \"himself\", \"his\", \"hither\", \"hopefully\", \"how\", \"howbeit\", \"however\", \"hundred\", \"i\", \"id\", \"ill\", \"im\", \"ive\", \"ie\", \"if\", \"ignored\", \"immediate\", \"in\", \"inasmuch\", \"inc\", \"indeed\", \"indicate\", \"indicated\", \"indicates\", \"inner\", \"insofar\", \"instead\", \"interest\", \"into\", \"inward\", \"is\", \"isnt\", \"it\", \"itd\", \"itll\", \"its\", \"its\", \"itself\", \"j\", \"just\", \"k\", \"keep\", \"keeps\", \"kept\", \"know\", \"known\", \"knows\", \"l\", \"last\", \"lately\", \"later\", \"latter\", \"latterly\", \"least\", \"less\", \"lest\", \"let\", \"lets\", \"like\", \"liked\", \"likely\", \"little\", \"look\", \"looking\", \"looks\", \"ltd\", \"m\", \"made\", \"mainly\", \"many\", \"may\", \"maybe\", \"me\", \"mean\", \"meanwhile\", \"merely\", \"might\", \"mill\", \"mine\", \"more\", \"moreover\", \"most\", \"mostly\", \"move\", \"much\", \"must\", \"my\", \"myself\", \"n\", \"name\", \"namely\", \"nd\", \"near\", \"nearly\", \"necessary\", \"need\", \"needs\", \"neither\", \"never\", \"nevertheless\", \"new\", \"next\", \"nine\", \"no\", \"nobody\", \"non\", \"none\", \"noone\", \"nor\", \"normally\", \"not\", \"nothing\", \"novel\", \"now\", \"nowhere\", \"o\", \"obviously\", \"of\", \"off\", \"often\", \"oh\", \"ok\", \"okay\", \"old\", \"on\", \"once\", \"one\", \"ones\", \"only\", \"onto\", \"or\", \"other\", \"others\", \"otherwise\", \"ought\", \"our\", \"ours\", \"ourselves\", \"out\", \"outside\", \"over\", \"overall\", \"own\", \"p\", \"part\", \"particular\", \"particularly\", \"per\", \"perhaps\", \"placed\", \"please\", \"plus\", \"possible\", \"presumably\", \"probably\", \"provides\", \"put\", \"q\", \"que\", \"quite\", \"qv\", \"r\", \"rather\", \"rd\", \"re\", \"really\", \"reasonably\", \"regarding\", \"regardless\", \"regards\", \"relatively\", \"respectively\", \"right\", \"s\", \"said\", \"same\", \"saw\", \"say\", \"saying\", \"says\", \"second\", \"secondly\", \"see\", \"seeing\", \"seem\", \"seemed\", \"seeming\", \"seems\", \"seen\", \"self\", \"selves\", \"sensible\", \"sent\", \"serious\", \"seriously\", \"seven\", \"several\", \"shall\", \"she\", \"should\", \"shouldnt\", \"show\", \"side\", \"since\", \"sincere\", \"six\", \"sixty\", \"so\", \"some\", \"somebody\", \"somehow\", \"someone\", \"something\", \"sometime\", \"sometimes\", \"somewhat\", \"somewhere\", \"soon\", \"sorry\", \"specified\", \"specify\", \"specifying\", \"still\", \"sub\", \"such\", \"sup\", \"sure\", \"system\", \"t\", \"ts\", \"take\", \"taken\", \"tell\", \"ten\", \"tends\", \"th\", \"than\", \"thank\", \"thanks\", \"thanx\", \"that\", \"thats\", \"thats\", \"the\", \"thea\", \"their\", \"theirs\", \"them\", \"themselves\", \"then\", \"thence\", \"there\", \"theres\", \"thereafter\", \"thereby\", \"therefore\", \"therein\", \"theres\", \"thereupon\", \"these\", \"they\", \"theyd\", \"theyll\", \"theyre\", \"theyve\", \"thickv\", \"thin\", \"think\", \"third\", \"this\", \"thorough\", \"thoroughly\", \"those\", \"though\", \"three\", \"through\", \"throughout\", \"thru\", \"thus\", \"to\", \"together\", \"too\", \"took\", \"top\", \"toward\", \"towards\", \"tried\", \"tries\", \"truly\", \"try\", \"trying\", \"twelve\", \"twenty\", \"twice\", \"two\", \"u\", \"un\", \"under\", \"unfortunately\", \"unless\", \"unlikely\", \"until\", \"unto\", \"up\", \"upon\", \"us\", \"use\", \"used\", \"useful\", \"uses\", \"using\", \"usually\", \"uucp\", \"v\", \"value\", \"various\", \"very\", \"via\", \"viz\", \"vs\", \"w\", \"want\", \"wants\", \"was\", \"wasnt\", \"way\", \"we\", \"wed\", \"well\", \"were\", \"weve\", \"welcome\", \"well\", \"went\", \"were\", \"werent\", \"what\", \"whats\", \"whatever\", \"when\", \"whence\", \"whenever\", \"where\", \"wheres\", \"whereafter\", \"whereas\", \"whereby\", \"wherein\", \"whereupon\", \"wherever\", \"whether\", \"which\", \"while\", \"whither\", \"who\", \"whos\", \"whoever\", \"whole\", \"whom\", \"whose\", \"why\", \"will\", \"willing\", \"wish\", \"with\", \"within\", \"without\", \"wont\", \"wonder\", \"would\", \"wouldnt\", \"x\", \"y\", \"yes\", \"yet\", \"you\", \"youd\", \"youll\", \"youre\", \"youve\", \"your\", \"yours\", \"yourself\", \"yourselves\", \"z\", \"zero\")\n}\n\nobject BadWords {\n\n    def apply(input: String): Boolean = badwords.contains(input)\n\n    val badwords = Set(\"ahole\", \"arse\", \"ass\", \"asshole\", \"asswipe\", \"bastard\", \"batty\", \"bender\", \"bitch\", \"bloody\", \"bollocks\", \"boner\", \"bumboy\", \"bugger\", \"coon\", \"cock\", \"cocksucker\", \"cracker\", \"crap\", \"cumsucker\", \"cunt\", \"damn\", \"dick\", \"dildo\", \"douchebag\", \"faggot\", \"fistfucker\", \"fuck\", \"fucker\", \"fuckwit\", \"fucktwat\", \"gaylord\", \"ho\", \"honky\", \"jackass\", \"jism\", \"joey\", \"knobcheese\", \"minge\", \"minger\", \"mong\", \"motherfucker\", \"munter\", \"pickle\", \"piss\", \"piss\", \"prick\", \"pussy\", \"rimmer\", \"schmuck\", \"shit\", \"slut\", \"spakka\", \"spaz\", \"skank\", \"taint\", \"tit\", \"tool\", \"tosser\", \"twat\", \"whore\", \"wanker\")\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/scalding/jobs/conjecture/AdHocClassifier.scala",
    "content": "package com.etsy.scalding.jobs.conjecture\n\nimport com.twitter.scalding.{Args, Job, Mode, SequenceFile, Tsv}\nimport com.etsy.conjecture.scalding.evaluate.BinaryCrossValidator\nimport com.etsy.conjecture.scalding.train.BinaryModelTrainer\nimport com.etsy.conjecture.data.{BinaryLabel,BinaryLabeledInstance,StringKeyedVector}\nimport com.etsy.conjecture.model.UpdateableLinearModel\n\nimport com.google.gson.Gson\n\nimport cascading.tuple.Fields\n\nclass AdHocClassifier(args : Args) extends Job(args) {\n\n  val input = args.getOrElse(\"input\", \"specify_an_input_dir\")\n  val out_dir = args.getOrElse(\"out_dir\", \"adhoc_classifier\")\n  val folds = args.getOrElse(\"folds\", \"0\").toInt\n  val problemName = args.getOrElse(\"name\", \"demo_problem\")\n  val xmx = args.getOrElse(\"xmx\", \"3\").toInt\n  val containerMemory = (xmx * 1024 * 1.16).toInt\n\n  // Let the user configure the field names on the command line.\n  val data_field_names = args.getOrElse(\"data_fields\", \"instance\").split(\",\")\n  val data_fields = data_field_names.tail.foldLeft(new Fields(data_field_names.head)) { (x,y) => x.append(new Fields(y)) }\n  val instance_field = Symbol(args.getOrElse(\"instance_field\", \"instance\"))\n\n  // assumes input instances are a sequence file\n  val instances = SequenceFile(input, data_fields).project(instance_field)\n\n  val model_pipe = new BinaryModelTrainer(args)\n    .train(instances, instance_field, 'model)\n\n  model_pipe\n    .write(SequenceFile(out_dir + \"/model\"))\n    .mapTo('model -> 'json) { x : UpdateableLinearModel[BinaryLabel] => new Gson().toJson(x) }\n    .write(Tsv(out_dir + \"/model_json\"))\n\n  if(folds > 0) {\n    val eval_pred = new BinaryCrossValidator(args, folds)\n      .crossValidateWithPredictions(instances, instance_field, 'pred)\n    eval_pred._1\n      .write(Tsv(out_dir + \"/xval\"))\n    eval_pred._2\n      .write(SequenceFile(out_dir + \"/pred\"))\n  }\n\n  override def config = super.config ++\n    Map(\"mapred.child.java.opts\" -> \"-Xmx%dG\".format(xmx),\n        \"mapreduce.map.memory.mb\" -> containerMemory.toString,\n        \"mapreduce.reduce.memory.mb\" -> containerMemory.toString\n    )\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/scalding/jobs/conjecture/AdHocClusterer.scala",
    "content": "package com.etsy.scalding.jobs.conjecture\n\nimport com.twitter.scalding.{Args, Job, Mode, SequenceFile, Tsv}\nimport com.etsy.conjecture.data.StringKeyedVector\nimport cascading.pipe.Pipe\nimport com.twitter.scalding._\nimport com.etsy.conjecture.data._\nimport com.etsy.conjecture.model._\nimport scala.collection.JavaConversions._\nimport java.io.File\nimport scala.io.Source\n\n/**\n *  Implements kmeans|| as described here: http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf\n *  Also includes fast L1 projection step to find sparse cluster centers as described here:\n *  http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf\n *\n *  Usage:\n *    --curr_iter : Set the current iteration.\n *    --num_starting_centers : Number of starting points to select at random to initialize C.\n *    --init_iters : The number of initial iterations to do to find C oversampled centers.\n *    --finish_iters : The number of iterations to cluster the C oversampled centers into\n *                     K starting centers.\n *    --oversampling_factor : The number of points to oversample on each iteration of the\n *                            parallel kmeans initialization, described as a fraction\n *                            of the number of centers.\n *    --kmeans_iters : The number of iterations to cluster the original dataset.\n *    --input : Path on hdfs to the dataset to be clustered. Dataset should be a pipe\n *              of (id_field : String, instance_field : StringKeyedVector).\n *    --out_dir : Path where intermediate data, final cluster centers, and assignments\n *                will be written.\n *    --id_field : Symbol for the id of the point being clustered, (e.g. doc_id).\n *    --instance_field : Symbol for the point being clustered, (e.g. document).\n *    --sparsify : Whether or not to enforce cluster center sparsity.\n *    --ball_radius : Radius of ball to project cluster centers on to in l1 projection.\n *                    E.g. 10^-1 == more sparse, 10^2 == less sparse.\n *    --error_tolerance : Error tolerance in the e-accurate l1 projection.\n */\nclass AdHocClustererTest(args: Args) extends Job(args) {\n\n    val curr_iter = args.getOrElse(\"curr_iter\",\"0\").toInt\n    val num_starting_centers = args.getOrElse(\"num_starting_centers\",\"10\").toInt\n    val init_iters = args.getOrElse(\"init_iters\",\"5\").toInt\n    val finish_iters = args.getOrElse(\"finish_iters\",\"5\").toInt\n    val oversampling_factor = args.getOrElse(\"oversampling_factor\",\"1.0\").toDouble\n    val kmeans_iters = args.getOrElse(\"kmeans_iters\",\"5\").toInt\n\n    val input = args.getOrElse(\"input\", \"specify_an_input_dir\")\n    val out_dir = args.getOrElse(\"out_dir\", \"specify_an_output_dir\")+\"/\"\n    val id_field = Symbol(args.getOrElse(\"id_field\", \"id\"))\n    val instance_field = Symbol(args.getOrElse(\"instance_field\", \"instance\"))\n    val xmx = args.getOrElse(\"xmx\", \"3\").toInt\n    val containerMemory = (xmx * 1024 * 1.16).toInt\n    \n    val max_finish_iters = init_iters + finish_iters\n    val total_iter = init_iters + finish_iters + kmeans_iters\n\n    /*\n     * Number of clusters to build\n     */\n    val num_clusters = args.getOrElse(\"num_clusters\",\"100\").toInt\n\n    /*\n     * Number of centers to oversample in the initialization phase\n     */\n    val take_per_round = math.floor(num_clusters * oversampling_factor).toInt\n\n    /*\n     * Whether or not to enforce cluster center sparsity via l1 projection\n     */\n    val sparsify = args.getOrElse(\"sparsify\",\"true\").toBoolean\n\n    /*\n     * Error tolerance for the l1 projection\n     */\n    val error_tolerance = args.getOrElse(\"error_tolerance\",\"0.01\").toDouble\n\n    /*\n     * Ball radius for the l1 projection\n     */\n    val ball_radius = args.getOrElse(\"ball_radius\",\"10.0\").toDouble\n\n    /**\n     * Read in the pipe of data to be clustered\n     */\n    lazy val instances = SequenceFile(input, (id_field, instance_field)).read\n\n    /**\n     * Define centers based on the current iteration\n     */\n    var centers : Pipe = if(curr_iter == 0){\n      /**\n       * First iteration: Select some starting centers at random from the dataset\n       */\n      instances\n        .map(instance_field -> 'rand){ i : StringKeyedVector => math.random}\n        .groupAll{_.sortWithTake[(Double, StringKeyedVector)](('rand, instance_field) -> 'list, num_starting_centers){(a, b) => a._1 > b._1}}\n        .map('list -> 'centers){ l : List[(Double,StringKeyedVector)] => l.map(i => i._2) }\n        .project('centers)\n    } else {\n      /**\n       *  If curr_iter <= init_iters, do kmeans|| iterations.\n       *\n       *  If init_iters < curr_iter <= max_finish_iters, cluster the oversampled set of initial centers \n       *  into num_clusters true initial centers.\n       *\n       *  If max_finish_iter < curr_iter <= total_iter, cluster the initial dataset using \n       *  the clusters obtained from previous steps as initial centers.\n       */\n      SequenceFile(out_dir+\"iter_\"+(curr_iter - 1)+\"/centers\", ('centers)).read\n    }\n\n    lazy val oversampled_cluster_centers = SequenceFile(out_dir+\"iter_\"+(init_iters - 1)+\"/centers\", ('centers)).read\n      .flattenTo[StringKeyedVector]('centers -> 'center)\n      .rename('center -> instance_field)\n\n    val new_centers = if (curr_iter < init_iters) {\n      /** Over sample (oversampling_factor * num_clusters) factors **/\n      kmeansPlusPlusIter(instances, centers, take_per_round)\n    } else if (curr_iter == init_iters) {\n      /** Get ready to cluster oversampled factors by doing a kmeans++ pass over them **/\n      val init_final_centers = kmeansPlusPlusReclusterInit(centers, num_clusters)\n      kmeansIter(oversampled_cluster_centers, init_final_centers, num_clusters, instance_field, curr_iter)\n    } else if (curr_iter <= max_finish_iters) {\n      /** Recluster oversampled centers into num_clusters final centers **/\n      kmeansIter(oversampled_cluster_centers, centers, num_clusters, instance_field, curr_iter)\n    } else {\n      /** Cluster the original dataset **/\n      kmeansIter(instances, centers, num_clusters, instance_field, curr_iter)\n    }\n\n    /** \n     *  If it's the last iteration, flatten the centers and \n     *  write them out; optionally generate cluster assignments\n     *  for each instance. Else, write the centers map out at \n     *  the end of each iteration.\n     */\n    if(curr_iter == total_iter){\n      new_centers\n      .flattenTo[StringKeyedVector]('centers -> 'center)\n      .project('center)\n      .write(SequenceFile(out_dir+\"centers\"))\n\n      if(args.boolean(\"generate_assignments\")) {\n        instances\n        .crossWithTiny(new_centers)\n        .map((instance_field, 'centers) -> 'cluster_assignment){ i : (StringKeyedVector, Map[String, StringKeyedVector]) => assignCluster(i._1, i._2) }\n        .project(id_field, 'cluster_assignment)\n        .write(SequenceFile(out_dir+\"assignments\"))\n      }\n    } else {\n      new_centers.write(SequenceFile(out_dir+\"iter_\"+curr_iter+\"/centers\"))\n    }\n\n    /**\n     *  Kmeans|| initialization:\n     *    C <- sample some points uniformly at random from instances\n     *    For init_iters:\n     *      C' <- top take_per_round points in instances by distance to current centers C\n     *      C  <- union(C,C')\n     */\n    def kmeansPlusPlusIter(instances : Pipe, centers : Pipe, take_per_round : Int) : Pipe = {\n      /** Get each points' distance to it's nearest cluster center **/\n      val closest_distances = instances\n        .crossWithTiny(centers)\n        .map((instance_field, 'centers) -> 'closest_distance){ in : (StringKeyedVector, List[StringKeyedVector]) => distanceToClosestCenter(in._1, in._2) }\n\n      /** Sum all closest distances into a normalizer **/\n      val normalizer = closest_distances\n        .groupAll{ _.sum[Double]('closest_distance -> 'denominator) }\n    \n      /** \n       * Normalize each points' distance to it's nearest cluster center to a probability. \n       * Take the top take_per_round descending points as new centers.\n       */\n      val top_by_distance = closest_distances\n        .crossWithTiny(normalizer)\n        .map(('closest_distance, 'denominator) -> 'normalized_distance){ i : (Double, Double) => i._1 / i._2 }\n        .groupAll{ _.sortWithTake[(Double, StringKeyedVector)](('normalized_distance, instance_field) -> 'top_by_distance, take_per_round){(a, b) => a._1 > b._1} }\n        .flattenTo[(Double, StringKeyedVector)]('top_by_distance -> ('distance, instance_field))\n        .project(instance_field)\n    \n      /** Union the set of new centers and old centers **/\n      val new_centers = ((top_by_distance.rename(instance_field -> 'center)) ++ (centers.flattenTo[StringKeyedVector]('centers -> 'center)))\n        .groupAll{ _.toList[StringKeyedVector]('center -> 'centers) }\n      new_centers\n    }\n\n    /**\n     *  Recluster the points in C into the final num_clusters kmeans|| centers.\n     */\n    def kmeansPlusPlusReclusterInit(centers : Pipe, num_clusters : Int) : Pipe = {  \n      centers\n        .map('centers -> 'centers){ data : List[StringKeyedVector] =>\n          val rand_idx = scala.util.Random.nextInt(data.size)\n          val starting_C = data(rand_idx)\n          var starting_centers = List(starting_C)\n          val init_centers = kmeansPlusPlusInit(num_clusters, data, starting_centers)     \n          init_centers.zipWithIndex.map(i=> (i._2.toString,i._1)).toMap\n        }\n    }\n\n    /**\n     *  Takes a pipe of points to cluster, a pipe of grouped clusters\n     */\n    def kmeansIter(data : Pipe, centers : Pipe, K : Int, point_sym : Symbol, iter : Int) : Pipe = {\n      val data_with_centers = data.crossWithTiny(centers)\n  \n      val cluster_assignments = data_with_centers\n        .map((point_sym, 'centers) -> 'assignment){\n          fields : (StringKeyedVector, Map[String,StringKeyedVector]) =>\n          val (point, centers) = fields\n          assignCluster(point, centers) \n        }\n        .project(point_sym, 'assignment)\n\n      val grouped = cluster_assignments\n        .groupBy('assignment){ _.size('denom).reduce[StringKeyedVector](point_sym){(a, b) => a.add(b); a} }\n        .map((point_sym, 'denom) -> 'cluster){\n          fields : (StringKeyedVector, Double) =>\n          var (centroid, denom) = fields\n          centroid.mul(1.0/denom)\n          if(sparsify){\n            l1Projection(centroid, error_tolerance, ball_radius)\n          }\n          centroid\n        }\n        .project('assignment, 'cluster)\n\n      val debug = grouped\n        .map('cluster -> 'top){ i : StringKeyedVector => i.getMap().toList.sortBy(_._2).reverse.take(100).map(i => i._1).mkString(\" \") }\n        .project('assignment, 'top)\n        .write(SequenceFile(out_dir+\"debug/iter_\"+iter+\"_top_terms\"))\n  \n      grouped\n      .groupAll{ _.toList[(String,StringKeyedVector)](('assignment, 'cluster) -> 'centers) }\n      .map('centers -> 'centers){ l : List[(String, StringKeyedVector)] => l.toMap }\n    }\n\n    /**\n     *  Generates initial centers for kmeans clustering to speed up convergence.\n     *  See more here: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf\n     */\n    def kmeansPlusPlusInit(iters : Int, data : List[StringKeyedVector], centers : List[StringKeyedVector]) : List[StringKeyedVector] = {\n      var new_centers = centers\n      var temp_data = data\n      (0 until iters).foreach{ iter =>\n        val dists = temp_data.map(i => (i, distanceToClosestCenter(i, centers)))\n        val norm = dists.map(i => i._2).sum\n        val x = dists.map(i => (i._1, i._2/norm)).sortBy(_._2).reverse.map(i=>i._1).take(1)(0)\n        new_centers = (new_centers ++ List(x)).toSet.toList\n        temp_data = temp_data.filter(i => i != x)\n      }\n      new_centers\n    }\n\n    /**\n     *  Returns the cosine distance between a point and its closest center.\n     */\n    def distanceToClosestCenter(point : StringKeyedVector, centers : List[StringKeyedVector]) : Double = {\n      centers.map(center => computeDistance(point, center)).min\n    }\n\n    /**\n     *  Computes the cosine distance between a point and a center.\n     */\n    def computeDistance(point : StringKeyedVector, center : StringKeyedVector) : Double = {\n      val dot_product = point.dot(center)\n      val point_magnitude = point.LPNorm(2.0)\n      val center_magnitude = center.LPNorm(2.0)\n      1.0 - (dot_product/(point_magnitude*center_magnitude))\n    }\n\n    /**\n     *  Assign a point to its nearest cluster center by cosine distance.\n     */\n    def assignCluster(point : StringKeyedVector, centers : Map[String,StringKeyedVector]) : String = {\n      val distances = centers.toList.map(i => (i._1, computeDistance(point, i._2)))\n      distances.minBy{_._2}._1\n    }\n\n    /**\n     *  e-Accurate Projection to L1 ball for sparse cluster centers\n     */\n    def l1Projection(center : StringKeyedVector, e : Double = 0.01, lambda : Double = 1.0) : StringKeyedVector = {\n      val l1Norm = center.LPNorm(1.0)\n      if (l1Norm <= lambda + e) {\n        center\n      } else {\n        var upper = center.max()\n        var lower = 0.0\n        var current = l1Norm\n        var theta = 0.0\n        while (current > lambda*(1+e) || current < lambda) {\n          theta = (upper + lower) / 2.0\n          current = center.values().map(i => math.max(0.0, math.abs(i)-theta)).sum\n          if(current <= lambda){\n            upper = theta\n          } else {\n            lower = theta\n          }\n        }\n        var sparse_center = new StringKeyedVector()\n        center.getMap()\n        .map(i => (i._1, math.signum(i._2) * math.max(0.0, math.abs(i._2) - theta)))\n        .filter(i => i._2 != 0.0)\n        .foreach{ i => sparse_center.setCoordinate(i._1, i._2)}\n        sparse_center\n      }\n    }\n\n    override def next : Option[Job] = { \n      val new_args = args + (\"curr_iter\", Some((curr_iter+1).toString))\n      if(curr_iter < total_iter) {\n        Some(clone(new_args))\n      } else {\n        None\n      }   \n    }\n\n    override def config = super.config ++\n      Map(\"mapred.child.java.opts\" -> \"-Xmx%dG\".format(xmx),\n        \"mapreduce.map.memory.mb\" -> containerMemory.toString,\n        \"mapreduce.reduce.memory.mb\" -> containerMemory.toString\n      )\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/scalding/jobs/conjecture/AdHocMulticlassClassifier.scala",
    "content": "package com.etsy.scalding.jobs.conjecture\n\nimport com.twitter.scalding.{Args, Job, Mode, SequenceFile, Tsv}\nimport com.etsy.conjecture.scalding.evaluate.MulticlassCrossValidator\nimport com.etsy.conjecture.scalding.train.MulticlassModelTrainer\nimport com.etsy.conjecture.data.{MulticlassLabeledInstance, StringKeyedVector}\nimport com.etsy.conjecture.model.UpdateableMulticlassLinearModel\n\nimport com.google.gson.Gson\n\nimport cascading.tuple.Fields\n\nclass AdHocMulticlassClassifier(args : Args) extends Job(args) {\n\n  val input = args(\"input\")\n  val out_dir = args(\"out_dir\")\n  val folds = args.getOrElse(\"folds\", \"0\").toInt\n  val categories = args(\"categories\").split(\",\").toArray\n\n  val xmx = args.getOrElse(\"xmx\", \"3\").toInt\n  val containerMemory = (xmx * 1024 * 1.16).toInt\n\n  // Let the user configure the field names on the command line.\n  val data_field_names = args.getOrElse(\"data_fields\", \"instance\").split(\",\")\n  val data_fields = data_field_names.tail.foldLeft(new Fields(data_field_names.head)) { (x,y) => x.append(new Fields(y)) }\n  val instance_field = Symbol(args.getOrElse(\"instance_field\", \"instance\"))\n\n  // assumes input instances are a sequence file\n  val instances = SequenceFile(input, data_fields).project(instance_field)\n\n  val model_pipe = new MulticlassModelTrainer(args, categories)\n    .train(instances, instance_field, 'model)\n\n  model_pipe\n    .write(SequenceFile(out_dir + \"/model\"))\n    .mapTo('model -> 'json) { x : UpdateableMulticlassLinearModel => new Gson().toJson(x) }\n    .write(Tsv(out_dir + \"/model_json\"))\n\n  if(folds > 0) {\n    val eval_pred = new MulticlassCrossValidator(args, folds, categories)\n      .crossValidateWithPredictions(instances, instance_field, 'pred)\n    eval_pred._1\n      .write(Tsv(out_dir + \"/xval\"))\n    eval_pred._2\n      .write(SequenceFile(out_dir + \"/pred\"))\n  }\n\n  override def config = super.config ++\n    Map(\"mapred.child.java.opts\" -> \"-Xmx%dG\".format(xmx),\n        \"mapreduce.map.memory.mb\" -> containerMemory.toString,\n        \"mapreduce.reduce.memory.mb\" -> containerMemory.toString)\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/scalding/jobs/conjecture/AdHocPredictor.scala",
    "content": "package com.etsy.scalding.jobs.conjecture\n\nimport com.twitter.scalding.{Args, Job, Mode, SequenceFile, Tsv}\nimport com.etsy.conjecture.scalding.evaluate.BinaryEvaluator\nimport com.etsy.conjecture.data.{BinaryLabeledInstance, BinaryLabel}\nimport com.etsy.conjecture.model.UpdateableLinearModel\n\nimport com.google.gson.Gson\n\nimport cascading.tuple.Fields\n\nclass AdHocPredictor(args : Args) extends Job(args) {\n\n  val input = args.getOrElse(\"input\", \"specify_an_input_dir\")\n  val out_dir = args.getOrElse(\"out_dir\", \"adhoc_classifier\")\n  val model = args.getOrElse(\"model\", \"specify a model\")\n  val problemName = args.getOrElse(\"name\", \"demo_problem\")\n  val xmx = args.getOrElse(\"xmx\", \"3\").toInt\n  val skipFinalSort = args.boolean(\"skip_final_sort\")\n  val containerMemory = (xmx * 1024 * 1.16).toInt\n\n  // Let the user configure the field names on the command line.\n  val data_field_names = args.getOrElse(\"data_fields\", \"instance\").split(\",\")\n  val data_fields = data_field_names.tail.foldLeft(new Fields(data_field_names.head)) { (x,y) => x.append(new Fields(y)) }\n  val model_field = new Fields(args.getOrElse(\"model_field\", \"model\"))\n  val instance_field = new Fields(args.getOrElse(\"instance_field\", \"instance\"))\n\n  val instances = SequenceFile(input, data_fields).read.project(instance_field)\n\n  val model_pipe = SequenceFile(model, model_field).read\n\n  val predictions = instances.crossWithTiny(model_pipe)\n    .map((model_field, instance_field) -> ('pred, 'explain)) {\n        x : (UpdateableLinearModel[BinaryLabel], BinaryLabeledInstance) =>\n        (x._1.predict(x._2.getVector), x._1.explainPrediction(x._2.getVector))\n    }\n    .discard(model_field)\n    .map(instance_field -> 'supporting_data) { x : BinaryLabeledInstance => x.getSupportingData() }\n    .project('supporting_data, 'pred)\n    .map('pred -> 'pred) { in : BinaryLabel => in.getValue() }\n\n  val output = if (skipFinalSort)\n    predictions\n  else\n    predictions.groupAll { _.sortBy('pred).reverse }\n\n  output.write(SequenceFile(out_dir + \"/pred\"))\n\n  override def config = super.config ++\n    Map(\"mapred.child.java.opts\" -> \"-Xmx%dG\".format(xmx),\n        \"mapreduce.map.memory.mb\" -> containerMemory.toString,\n        \"mapreduce.reduce.memory.mb\" -> containerMemory.toString\n    )\n\n}\n"
  },
  {
    "path": "src/main/scala/com/etsy/scalding/jobs/conjecture/NNMFTest.scala",
    "content": "package com.etsy.scalding.jobs.conjecture\n\nimport com.etsy.conjecture.scalding.NNMF\nimport com.twitter.scalding.{Args, Job, Tsv, SequenceFile}\nimport org.apache.commons.math3.linear.RealVector\n\n/*\n * Job to do NNMF of the supplied matrix, given via the arg \"A\"\n * \"alpha\" is the extra weight given to non-zero entries.\n */\nclass NNMFTest(args : Args) extends Job(args) {\n\n  val iter = args.getOrElse(\"iter\", \"0\").toInt\n  val iters = args.getOrElse(\"iters\", \"20\").toInt\n  val base_dir = args.getOrElse(\"base_dir\", \"nnmf_test\")\n  val A_path = args.getOrElse(\"A\", \"critics.tsv\")\n  val alpha = args.getOrElse(\"alpha\", \"0.0\").toDouble\n  \n  val A = Tsv(A_path, ('row, 'col, 'val))\n    .map('val -> 'val){v : String => v.toDouble}\n\n  val HW = if(iter == 0) {\n    // just initialize\n    NNMF.initGaussian(A, 10)\n  } else {\n    // Last iterations output.\n    (SequenceFile(base_dir + \"/H/\" + (iter-1), ('row, 'vec, 'bias)).read,\n     SequenceFile(base_dir + \"/W/\" + (iter-1), ('col, 'vec, 'bias)).read)\n  }\n  \n  val HW_ = NNMF.updateGaussianWeighted(A, HW._1, HW._2, alpha)\n  \n  HW_._1.write(SequenceFile(base_dir + \"/H/\" + iter))\n  HW_._2.write(SequenceFile(base_dir + \"/W/\" + iter))\n\n  HW._1.crossWithSmaller(HW._2.rename('vec -> 'vec2).rename('bias -> 'bias2))\n    .map(('vec, 'vec2, 'bias, 'bias2) -> 'pred){x : (RealVector, RealVector, Double, Double) => x._1.dotProduct(x._2) + x._3 + x._4}\n    .project('row, 'col, 'pred)\n    .joinWithSmaller(('row, 'col) -> ('row_, 'col_), A.rename(('row, 'col) -> ('row_, 'col_)), new cascading.pipe.joiner.OuterJoin())\n    .mapTo(('val, 'pred) -> 'err){x : (Double, Double) => val d = x._1 - x._2; (if(x._1 == 0.0) 1.0 else (1.0 + alpha)) * d * d}\n    .groupAll{_.average('err)}\n    .write(Tsv(base_dir+\"/err/\"+iter))\n\n  // Start more iterations possibly.\n  override def next : Option[Job] = {\n    val new_args = args + ((\"iter\", Some((iter+1).toString)))\n    if(iter < iters - 1) {\n      Some(clone(new_args))\n    } else {\n      None\n    }\n  }\n\n}\n"
  },
  {
    "path": "src/test/java/com/etsy/conjecture/data/LazyVectorTest.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.io.ByteArrayInputStream;\nimport java.io.ByteArrayOutputStream;\nimport java.io.ObjectInputStream;\nimport java.io.ObjectOutputStream;\n\nimport com.esotericsoftware.kryo.Kryo;\nimport com.esotericsoftware.kryo.io.Input;\nimport com.esotericsoftware.kryo.io.Output;\n\nimport com.google.gson.Gson;\n\nimport static org.junit.Assert.assertEquals;\nimport static org.junit.Assert.assertTrue;\nimport static org.junit.Assert.assertFalse;\n\nimport org.junit.Test;\n\npublic class LazyVectorTest {\n\n    final double eps = 0.000001;\n\n    // Update function to use for testing.\n    // Decay the parameters over time.\n    final static LazyVector.UpdateFunction uf = new LazyVector.UpdateFunction() {\n\n        private static final long serialVersionUID = 1019666879466468375L;\n        public double lazyUpdate(String k, double p, long a, long b) {\n            return p * Math.pow(0.9, b - a);\n        }\n    };\n\n    // Build an SKV in a way which exercises a bunch of different code.\n    public LazyVector buildLV() {\n        LazyVector lv = new LazyVector(uf);\n        lv.setCoordinate(\"foo\", 1.0);\n        lv.addToCoordinate(\"bar\", -2.0);\n        lv.addToCoordinate(\"baz\", 0.0);\n        lv.setCoordinate(\"dave\", 5.0);\n        lv.deleteCoordinate(\"dave\");\n        return lv;\n    }\n\n    /**\n     * Basic testing of coordinate getting and setting.\n     */\n    @Test\n    public void testCoordinates() {\n        LazyVector lv = buildLV();\n        assertEquals(2, lv.size());\n        assertEquals(1.0, lv.getCoordinate(\"foo\"), eps);\n        assertEquals(-2.0, lv.getCoordinate(\"bar\"), eps);\n        assertEquals(0.0, lv.getCoordinate(\"baz\"), eps);\n        assertEquals(0.0, lv.getCoordinate(\"dave\"), eps);\n        assertEquals(0.0, lv.getCoordinate(\"test\"), eps);\n    }\n\n    /**\n     * Basic testing of lazy updating.\n     */\n    @Test\n    public void testCoordinatesLazy() {\n        LazyVector lv = buildLV();\n        lv.incrementIteration();\n        assertEquals(2, lv.size());\n        assertEquals(0.9, lv.getCoordinate(\"foo\"), eps);\n        assertEquals(-1.8, lv.getCoordinate(\"bar\"), eps);\n        assertEquals(0.0, lv.getCoordinate(\"baz\"), eps);\n        assertEquals(0.0, lv.getCoordinate(\"dave\"), eps);\n        assertEquals(0.0, lv.getCoordinate(\"test\"), eps);\n        lv.setCoordinate(\"bar\", 2.0);\n        lv.incrementIteration();\n        assertEquals(2, lv.size());\n        assertEquals(0.81, lv.getCoordinate(\"foo\"), eps);\n        assertEquals(1.8, lv.getCoordinate(\"bar\"), eps);\n    }\n\n    /**\n     * Test addScaled.\n     */\n    @Test\n    public void testAddScaledToSKV() {\n        LazyVector lv = buildLV();\n        StringKeyedVector accum = new StringKeyedVector();\n        accum.addScaled(lv, 2.0);\n        assertEquals(2, accum.size());\n        assertEquals(2.0, accum.getCoordinate(\"foo\"), eps);\n        assertEquals(-4.0, accum.getCoordinate(\"bar\"), eps);\n        lv.incrementIteration();\n        accum.addScaled(lv, -2.0);\n        assertEquals(2, accum.size());\n        assertEquals(0.2, accum.getCoordinate(\"foo\"), eps);\n        assertEquals(-0.4, accum.getCoordinate(\"bar\"), eps);\n    }\n\n    /**\n     * Test addScaled.\n     */\n    @Test\n    public void testAddScaledToLV() {\n        LazyVector lv = buildLV();\n        LazyVector accum = new LazyVector(uf);\n        accum.setCoordinate(\"foo\", 10.0);\n        accum.incrementIteration();\n        accum.incrementIteration();\n        accum.incrementIteration(); // foo is now 7.29\n        accum.addScaled(lv, 2.0);\n        assertEquals(2, accum.size());\n        assertEquals(9.29, accum.getCoordinate(\"foo\"), eps);\n        assertEquals(-4.0, accum.getCoordinate(\"bar\"), eps);\n        lv.incrementIteration(); // foo is now 0.9\n        accum.incrementIteration(); // foo is now 8.361\n        accum.addScaled(lv, -2.0);\n        assertEquals(1, accum.size());\n        assertEquals(6.561, accum.getCoordinate(\"foo\"), eps);\n    }\n\n    /**\n     * Test addScaled.\n     */\n    @Test\n    public void testAddScaledToSelf() {\n        LazyVector lv = buildLV();\n        lv.incrementIteration();\n        lv.incrementIteration();\n        lv.addScaled(lv, 1.0);\n        assertEquals(2, lv.size());\n        assertEquals(1.0 * 0.81 * 2, lv.getCoordinate(\"foo\"), eps);\n        assertEquals(-2.0 * 0.81 * 2, lv.getCoordinate(\"bar\"), eps);\n    }\n\n    /**\n     * Test addScaled.\n     */\n    @Test\n    public void testAddScaledSKVToLV() {\n        LazyVector accum = new LazyVector(uf);\n        StringKeyedVector skv = new StringKeyedVector();\n        skv.setCoordinate(\"foo\", 1.0);\n        skv.setCoordinate(\"bar\", 5.0);\n        accum.addScaled(skv, 2.0);\n        assertEquals(2, accum.size());\n        assertEquals(2.0, accum.getCoordinate(\"foo\"), eps);\n        assertEquals(10.0, accum.getCoordinate(\"bar\"), eps);\n        accum.incrementIteration();\n        accum.incrementIteration(); // foo: 1.62, bar: 8.10\n        accum.addScaled(skv, -1.0);\n        assertEquals(2, accum.size());\n        assertEquals(0.62, accum.getCoordinate(\"foo\"), eps);\n        assertEquals(3.10, accum.getCoordinate(\"bar\"), eps);\n    }\n\n    /**\n     * Test the dot product.\n     */\n    @Test\n    public void testDotProduct() {\n        LazyVector skv = buildLV();\n        StringKeyedVector skv2 = new StringKeyedVector(skv);\n        assertEquals(5.0, skv.dot(skv), eps);\n        skv.incrementIteration();\n        assertEquals(5.0 * 0.81, skv.dot(skv), eps);\n        skv2.addToCoordinate(\"baz\", -10.0);\n        assertEquals(5.0 * 0.9, skv.dot(skv2), eps);\n    }\n\n    /**\n     * Test freezing the keys.\n     */\n    @Test\n    public void testFreezing() {\n        LazyVector skv = buildLV();\n        skv.setFreezeKeySet(true);\n        skv.addToCoordinate(\"fake\", 1.0);\n        assertEquals(2, skv.size());\n        skv.setCoordinate(\"fake2\", 2.0);\n        assertEquals(2, skv.size());\n        skv.setFreezeKeySet(false);\n        skv.setCoordinate(\"fake2\", 2.0);\n        assertEquals(3, skv.size());\n    }\n\n    /**\n     * Test java serialization.\n     */\n    @Test\n    public void testJavaSerialization() throws Exception {\n        LazyVector skv = buildLV();\n        skv.incrementIteration();\n        // Serialize to a byte array in ram.\n        ByteArrayOutputStream bos = new ByteArrayOutputStream();\n        ObjectOutputStream oos = new ObjectOutputStream(bos);\n        oos.writeObject(skv);\n        oos.flush();\n        // Deserialize.\n        ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());\n        ObjectInputStream ois = new ObjectInputStream(bis);\n        LazyVector des = (LazyVector)ois.readObject();\n        assertFalse(des.getFreezeKeySet());\n        assertEquals(2, des.size());\n        assertEquals(0.9, des.getCoordinate(\"foo\"), eps);\n        assertEquals(-1.8, des.getCoordinate(\"bar\"), eps);\n        des.incrementIteration();\n        assertEquals(2, des.size());\n        assertEquals(0.81, des.getCoordinate(\"foo\"), eps);\n        assertEquals(-1.62, des.getCoordinate(\"bar\"), eps);\n    }\n\n    /**\n     * Test kryo serialization.\n     */\n    @Test\n    public void testKryoSerialization() throws Exception {\n        LazyVector skv = buildLV();\n        skv.incrementIteration();\n        // Serialize to a byte array in ram.\n        ByteArrayOutputStream bos = new ByteArrayOutputStream();\n        Output ko = new Output(bos);\n        Kryo kry = new Kryo();\n        kry.writeObject(ko, skv);\n        ko.flush();\n        // Deserialize.\n        ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());\n        Input ki = new Input(bis);\n        LazyVector des = (LazyVector)kry.readObject(ki, LazyVector.class);\n        assertFalse(des.getFreezeKeySet());\n        assertEquals(2, des.size());\n        assertEquals(0.9, des.getCoordinate(\"foo\"), eps);\n        assertEquals(-1.8, des.getCoordinate(\"bar\"), eps);\n        des.incrementIteration();\n        assertEquals(2, des.size());\n        assertEquals(0.81, des.getCoordinate(\"foo\"), eps);\n        assertEquals(-1.62, des.getCoordinate(\"bar\"), eps);\n    }\n\n    /**\n     * Make sure Gson serializes this thing properly.\n     */\n    @Test\n    public void testGson() {\n        Gson gson = new Gson();\n        String json = gson.toJson(buildLV());\n        String vector1 = \"\\\"vector\\\":{\\\"foo\\\":1.0,\\\"bar\\\":-2.0}\";\n        String vector2 = \"\\\"vector\\\":{\\\"bar\\\":-2.0,\\\"foo\\\":1.0}\";\n        String fks = \"\\\"freezeKeySet\\\":false\";\n        assertTrue(json.contains(vector1) || json.contains(vector2));\n        assertTrue(json.contains(fks));\n        assertFalse(json.contains(\"iterations\"));\n    }\n\n}\n"
  },
  {
    "path": "src/test/java/com/etsy/conjecture/data/StringKeyedVectorTest.java",
    "content": "package com.etsy.conjecture.data;\n\nimport java.io.ByteArrayInputStream;\nimport java.io.ByteArrayOutputStream;\nimport java.io.ObjectInputStream;\nimport java.io.ObjectOutputStream;\n\nimport com.esotericsoftware.kryo.Kryo;\nimport com.esotericsoftware.kryo.io.Input;\nimport com.esotericsoftware.kryo.io.Output;\n\nimport com.google.gson.Gson;\n\nimport static org.junit.Assert.assertEquals;\nimport static org.junit.Assert.assertTrue;\nimport static org.junit.Assert.assertFalse;\n\nimport org.junit.Test;\n\npublic class StringKeyedVectorTest {\n\n    final double eps = 0.000001;\n\n    // Build an SKV in a way which exercises a bunch of different code.\n    public StringKeyedVector buildSKV() {\n        StringKeyedVector skv = new StringKeyedVector();\n        skv.setCoordinate(\"foo\", 1.0);\n        skv.addToCoordinate(\"bar\", -2.0);\n        skv.addToCoordinate(\"baz\", 0.0);\n        skv.setCoordinate(\"dave\", 5.0);\n        skv.deleteCoordinate(\"dave\");\n        return skv;\n    }\n\n    /**\n     * Basic testing of coordinate getting and setting.\n     */\n    @Test\n    public void testCoordinates() {\n        StringKeyedVector skv = buildSKV();\n        assertEquals(2, skv.size());\n        assertEquals(1.0, skv.getCoordinate(\"foo\"), eps);\n        assertEquals(-2.0, skv.getCoordinate(\"bar\"), eps);\n        assertEquals(0.0, skv.getCoordinate(\"baz\"), eps);\n        assertEquals(0.0, skv.getCoordinate(\"dave\"), eps);\n        assertEquals(0.0, skv.getCoordinate(\"test\"), eps);\n    }\n\n    /**\n     * Test addScaled.\n     */\n    @Test\n    public void testAddScaled() {\n        StringKeyedVector skv = buildSKV();\n        StringKeyedVector accum = new StringKeyedVector();\n        skv.addScaled(accum, 1.0);\n        accum.addScaled(skv, 2.0);\n        assertEquals(2, accum.size());\n        assertEquals(2.0, accum.getCoordinate(\"foo\"), eps);\n        assertEquals(-4.0, accum.getCoordinate(\"bar\"), eps);\n        accum.addScaled(skv, -2.0);\n        assertEquals(0, accum.size());\n    }\n\n    /**\n     * Test the dot product.\n     */\n    @Test\n    public void testDotProduct() {\n        StringKeyedVector skv = buildSKV();\n        assertEquals(5.0, skv.dot(skv), eps);\n        StringKeyedVector skv2 = new StringKeyedVector(skv);\n        skv2.addToCoordinate(\"baz\", -10.0);\n        assertEquals(5.0, skv.dot(skv2), eps);\n        assertEquals(105.0, skv2.dot(skv2), eps);\n    }\n\n    /**\n     * Test freezing the keys.\n     */\n    @Test\n    public void testFreezing() {\n        StringKeyedVector skv = buildSKV();\n        skv.setFreezeKeySet(true);\n        skv.addToCoordinate(\"fake\", 1.0);\n        assertEquals(2, skv.size());\n        skv.setCoordinate(\"fake2\", 2.0);\n        assertEquals(2, skv.size());\n        skv.setFreezeKeySet(false);\n        skv.setCoordinate(\"fake2\", 2.0);\n        assertEquals(3, skv.size());\n    }\n\n    /**\n     * Test java serialization.\n     */\n    @Test\n    public void testJavaSerialization() throws Exception {\n        StringKeyedVector skv = buildSKV();\n        // Serialize to a byte array in ram.\n        ByteArrayOutputStream bos = new ByteArrayOutputStream();\n        ObjectOutputStream oos = new ObjectOutputStream(bos);\n        oos.writeObject(skv);\n        oos.flush();\n        // Deserialize.\n        ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());\n        ObjectInputStream ois = new ObjectInputStream(bis);\n        StringKeyedVector des = (StringKeyedVector)ois.readObject();\n        assertFalse(des.getFreezeKeySet());\n        assertEquals(2, des.size());\n        assertEquals(1.0, des.getCoordinate(\"foo\"), eps);\n        assertEquals(-2.0, des.getCoordinate(\"bar\"), eps);\n        assertEquals(0.0, des.getCoordinate(\"baz\"), eps);\n        assertEquals(0.0, des.getCoordinate(\"dave\"), eps);\n        assertEquals(0.0, des.getCoordinate(\"test\"), eps);\n    }\n\n    /**\n     * Test kryo serialization.\n     */\n    @Test\n    public void testKryoSerialization() throws Exception {\n        StringKeyedVector skv = buildSKV();\n        // Serialize to a byte array in ram.\n        ByteArrayOutputStream bos = new ByteArrayOutputStream();\n        Output ko = new Output(bos);\n        Kryo kry = new Kryo();\n        kry.writeObject(ko, skv);\n        ko.flush();\n        // Deserialize.\n        ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());\n        Input ki = new Input(bis);\n        StringKeyedVector des = (StringKeyedVector)kry.readObject(ki,\n                StringKeyedVector.class);\n        assertFalse(des.getFreezeKeySet());\n        assertEquals(2, des.size());\n        assertEquals(1.0, des.getCoordinate(\"foo\"), eps);\n        assertEquals(-2.0, des.getCoordinate(\"bar\"), eps);\n        assertEquals(0.0, des.getCoordinate(\"baz\"), eps);\n        assertEquals(0.0, des.getCoordinate(\"dave\"), eps);\n        assertEquals(0.0, des.getCoordinate(\"test\"), eps);\n    }\n\n    /**\n     * Make sure Gson serializes this thing properly.\n     */\n    @Test\n    public void testGson() {\n        Gson gson = new Gson();\n        String json = gson.toJson(buildSKV());\n        String vector1 = \"\\\"vector\\\":{\\\"foo\\\":1.0,\\\"bar\\\":-2.0}\";\n        String vector2 = \"\\\"vector\\\":{\\\"bar\\\":-2.0,\\\"foo\\\":1.0}\";\n        String fks = \"\\\"freezeKeySet\\\":false\";\n        assertTrue(json.contains(vector1) || json.contains(vector2));\n        assertTrue(json.contains(fks));\n    }\n\n}\n"
  },
  {
    "path": "src/test/java/com/etsy/conjecture/evaluation/TestReceiverOperatingCharacteristic.java",
    "content": "package com.etsy.conjecture.evaluation;\n\nimport static org.junit.Assert.assertEquals;\n\nimport org.junit.Test;\n\npublic class TestReceiverOperatingCharacteristic {\n\n    static double[] labels = { 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0,\n            0, 0 };\n    static double[] predictions = { 0.80962, 0.48458, 0.65812, 0.16117,\n            0.47375, 0.26587, 0.71517, 0.63866, 0.36296, 0.89639, 0.35936,\n            0.22413, 0.36402, 0.41459, 0.83148, 0.23271, 0.23271, 0.23271 };\n\n    // from scikit learn\n    static double AUC = 0.97402597402597402;\n\n    @Test\n    public void testAUC() {\n        ReceiverOperatingCharacteristic roc = new ReceiverOperatingCharacteristic();\n        for (int i = 0; i < labels.length; i++) {\n            roc.add(labels[i], predictions[i]);\n        }\n\n        assertEquals(AUC, roc.binaryAUC(), 0.0000001);\n    }\n\n}"
  },
  {
    "path": "src/test/java/com/etsy/conjecture/model/UpdateableLinearModelTest.java",
    "content": "package com.etsy.conjecture.model;\n\nimport static org.junit.Assert.assertEquals;\nimport static org.junit.Assert.assertTrue;\nimport com.etsy.conjecture.data.StringKeyedVector;\n\nimport org.junit.Test;\n\nimport com.etsy.conjecture.data.BinaryLabeledInstance;\n\npublic class UpdateableLinearModelTest {\n\n    final double eps = 0.000001;\n    final SGDOptimizer optimizer = new ElasticNetOptimizer();\n\n    BinaryLabeledInstance getPositiveInstance() {\n        BinaryLabeledInstance bli = new BinaryLabeledInstance(1.0);\n        bli.setCoordinate(\"foo\", 1.0);\n        bli.setCoordinate(\"bar\", 2.0);\n        return bli;\n    }\n\n    BinaryLabeledInstance getNegativeInstance() {\n        BinaryLabeledInstance bli = new BinaryLabeledInstance(0.0);\n        bli.setCoordinate(\"foo\", 1.0);\n        bli.setCoordinate(\"baz\", -1.0);\n        return bli;\n    }\n\n    @Test\n    public void testLogisticRegressionBasic() {\n        LogisticRegression slr = new LogisticRegression(optimizer);\n        // perform one update and check parameter values.\n        double eta = slr.optimizer.getDecreasingLearningRate(slr.epoch);\n        slr.update(getPositiveInstance());\n        assertEquals(eta * 0.5, slr.getParam().getCoordinate(\"foo\"), eps);\n        assertEquals(eta * 1.0, slr.getParam().getCoordinate(\"bar\"), eps);\n        assertTrue(slr.predict(getPositiveInstance().getVector()).getValue() > 0.5);\n        // perform a second update.\n        slr.update(getNegativeInstance());\n        assertTrue(slr.predict(getPositiveInstance().getVector()).getValue() > 0.5);\n        assertTrue(slr.predict(getNegativeInstance().getVector()).getValue() < 0.5);\n    }\n\n    @Test\n    public void testLogisticRegressionLaplaceRegularization() {\n        SGDOptimizer laplaceOptimizer = optimizer.setLaplaceRegularizationWeight(0.1);\n        LogisticRegression slr = new LogisticRegression(laplaceOptimizer);\n        // perform one update and check parameter values.\n        double eta = slr.optimizer.getDecreasingLearningRate(slr.epoch);\n        slr.update(getPositiveInstance());\n        assertEquals(eta * 0.5, slr.getParam().getCoordinate(\"foo\"), eps);\n        assertEquals(eta * 1.0, slr.getParam().getCoordinate(\"bar\"), eps);\n        double eta2 = slr.optimizer.getDecreasingLearningRate(slr.epoch);\n        slr.update(getNegativeInstance());\n        assertEquals(eta * 1.0 - eta2 * 0.1, slr.getParam()\n                .getCoordinate(\"bar\"), eps);\n        // update with a different example enough times to make bar -> 0.\n        for (int i = 0; i < 10; i++) {\n            slr.update(getNegativeInstance());\n        }\n        assertEquals(2, slr.getParam().size());\n        assertEquals(0.0, slr.getParam().getCoordinate(\"bar\"), eps);\n    }\n\n    @Test\n    public void testLogisticRegressionGaussianRegularization() {\n        SGDOptimizer gaussianOptimizer = optimizer.setGaussianRegularizationWeight(0.2);\n        LogisticRegression slr = new LogisticRegression(gaussianOptimizer);\n        // perform one update and check parameter values.\n        double eta = slr.optimizer.getDecreasingLearningRate(slr.epoch);\n        slr.update(getPositiveInstance());\n        assertEquals(eta * 0.5, slr.getParam().getCoordinate(\"foo\"), eps);\n        assertEquals(eta * 1.0, slr.getParam().getCoordinate(\"bar\"), eps);\n        double eta2 = slr.optimizer.getDecreasingLearningRate(slr.epoch);\n        slr.update(getNegativeInstance());\n        assertEquals(eta * 1.0 * (1.0 - eta2 * 0.2), slr.getParam()\n                .getCoordinate(\"bar\"), eps);\n    }\n\n    @Test\n    public void testPerceptronBasic() {\n        Hinge p = new Hinge(optimizer).setThreshold(0.0);\n        // perform one update and check parameter values.\n        double eta = p.optimizer.getDecreasingLearningRate(p.epoch);\n        p.update(getPositiveInstance());\n        assertEquals(eta * 1.0, p.getParam().getCoordinate(\"foo\"), eps);\n        assertEquals(eta * 2.0, p.getParam().getCoordinate(\"bar\"), eps);\n        assertTrue(p.predict(getPositiveInstance().getVector()).getValue() > 0.5);\n        // perform a second update.\n        p.update(getNegativeInstance());\n        assertTrue(p.predict(getPositiveInstance().getVector()).getValue() > 0.5);\n        assertTrue(p.predict(getNegativeInstance().getVector()).getValue() < 0.5);\n    }\n\n    public void testInstanceNotModified(UpdateableLinearModel model) {\n        BinaryLabeledInstance instance = getPositiveInstance();\n        StringKeyedVector instanceCopy = instance.getVector().copy();\n        model.update(instance);\n        assertEquals(instance.getVector().getCoordinate(\"foo\"), instanceCopy.getCoordinate(\"foo\"), 0.0);\n        assertEquals(instance.getVector().getCoordinate(\"bar\"), instanceCopy.getCoordinate(\"bar\"), 0.0);\n    }\n\n    @Test\n    public void testInstanceNotModifiedByOptimizer() {\n        ElasticNetOptimizer eOptimizer = new ElasticNetOptimizer();\n        LogisticRegression eModel = new LogisticRegression(eOptimizer);\n        testInstanceNotModified(eModel);\n\n        FTRLOptimizer ftrlOptimizer = new FTRLOptimizer();\n        LogisticRegression fModel = new LogisticRegression(ftrlOptimizer);\n        testInstanceNotModified(fModel);\n\n        AdagradOptimizer adagradOptimizer = new AdagradOptimizer();\n        LogisticRegression aModel = new LogisticRegression(adagradOptimizer);\n        testInstanceNotModified(aModel);\n\n        MIRA mModel = new MIRA();\n        testInstanceNotModified(mModel);\n    }\n\n    @Test\n    public void testInstanceNotModifiedByModel() {\n        LogisticRegression lrModel = new LogisticRegression(optimizer);\n        testInstanceNotModified(lrModel);\n\n        LeastSquaresRegressionModel lsModel = new LeastSquaresRegressionModel(optimizer);\n        testInstanceNotModified(lsModel);\n\n        Hinge hModel = new Hinge(optimizer);\n        testInstanceNotModified(hModel);\n    }\n\n}\n"
  }
]