[
  {
    "path": ".gitignore",
    "content": ".rvmrc\ncoverage/\n.DS_Store\n*.gem\nutils.rb\nGemfile.lock\n"
  },
  {
    "path": "Gemfile",
    "content": "source \"http://rubygems.org\"\n\ngemspec\n"
  },
  {
    "path": "LICENSE.txt",
    "content": "Copyright (c) 2012 Alexandru Nedelcu\n\nPermission is hereby granted, free of charge, to any person obtaining\na copy of this software and associated documentation files (the\n\"Software\"), to deal in the Software without restriction, including\nwithout limitation the rights to use, copy, modify, merge, publish,\ndistribute, sublicense, and/or sell copies of the Software, and to\npermit persons to whom the Software is furnished to do so, subject to\nthe following conditions:\n\nThe above copyright notice and this permission notice shall be\nincluded in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND,\nEXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF\nMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND\nNONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE\nLIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION\nOF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION\nWITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# stuff-classifier\n\n## No longer maintained\n\nThis repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.\n\n## Description\n\nA library for classifying text into multiple categories.\n\nCurrently provided classifiers:\n\n- a [naive bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)\n- a classifier based on [tf-idf weights](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)\n\nRan a benchmark of 1345 items that I have previously manually\nclassified with multiple categories. Here's the rate over which the 2\nalgorithms have correctly detected one of those categories:\n\n- Bayes: 79.26%\n- Tf-Idf: 81.34%\n\nI prefer the Naive Bayes approach, because while having lower stats on\nthis benchmark, it seems to make better decisions than I did in many\ncases. For example, an item with title *\"Paintball Session, 100 Balls\nand Equipment\"* was classified as *\"Activities\"* by me, but the bayes\nclassifier identified it as *\"Sports\"*, at which point I had an\nintellectual orgasm. Also, the Tf-Idf classifier seems to do better on\nclear-cut cases, but doesn't seem to handle uncertainty so well. Of\ncourse, these are just quick tests I made and I have no idea which is\nreally better.\n\n## Install\n\n```bash\ngem install stuff-classifier\n```\n\n## Usage\n\nYou either instantiate one class or the other. Both have the same\nsignature:\n\n```ruby\nrequire 'stuff-classifier'\n\n# for the naive bayes implementation\ncls = StuffClassifier::Bayes.new(\"Cats or Dogs\")\n\n# for the Tf-Idf based implementation\ncls = StuffClassifier::TfIdf.new(\"Cats or Dogs\")\n\n# these classifiers use word stemming by default, but if it has weird\n# behavior, then you can disable it on init:\ncls = StuffClassifier::TfIdf.new(\"Cats or Dogs\", :stemming => false)\n\n# also by default, the parsing phase filters out stop words, to\n# disable or to come up with your own list of stop words, on a\n# classifier instance you can do this:\ncls.ignore_words = [ 'the', 'my', 'i', 'dont' ]\n ```\n\nTraining the classifier:\n\n```ruby\ncls.train(:dog, \"Dogs are awesome, cats too. I love my dog\")\ncls.train(:cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\")    \ncls.train(:dog, \"My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs\")\ncls.train(:cat, \"Cats are difficult animals, unlike dogs, really annoying, I hate them all\")\ncls.train(:dog, \"So which one should you choose? A dog, definitely.\")\ncls.train(:cat, \"The favorite food for cats is bird meat, although mice are good, but birds are a delicacy\")\ncls.train(:dog, \"A dog will eat anything, including birds or whatever meat\")\ncls.train(:cat, \"My cat's favorite place to purr is on my keyboard\")\ncls.train(:dog, \"My dog's favorite place to take a leak is the tree in front of our house\")\n```\n\nAnd finally, classifying stuff:\n\n```ruby\ncls.classify(\"This test is about cats.\")\n#=> :cat\ncls.classify(\"I hate ...\")\n#=> :cat\ncls.classify(\"The most annoying animal on earth.\")\n#=> :cat\ncls.classify(\"The preferred company of software developers.\")\n#=> :cat\ncls.classify(\"My precious, my favorite!\")\n#=> :cat\ncls.classify(\"Get off my keyboard!\")\n#=> :cat\ncls.classify(\"Kill that bird!\")\n#=> :cat\n\ncls.classify(\"This test is about dogs.\")\n#=> :dog\ncls.classify(\"Cats or Dogs?\") \n#=> :dog\ncls.classify(\"What pet will I love more?\")    \n#=> :dog\ncls.classify(\"Willy, where the heck are you?\")\n#=> :dog\ncls.classify(\"I like big buts and I cannot lie.\") \n#=> :dog\ncls.classify(\"Why is the front door of our house open?\")\n#=> :dog\ncls.classify(\"Who is eating my meat?\")\n#=> :dog\n```\n\n## Persistency\n\nThe following layers for saving the training data between sessions are\nimplemented:\n\n- in memory (by default)\n- on disk\n- Redis\n- (coming soon) in a RDBMS\n\nTo persist the data in Redis, you can do this:\n```ruby\n# defaults to redis running on localhost on default port\nstore = StuffClassifier::RedisStorage.new(@key)\n\n# pass in connection args\nstore = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})\n```\n\nTo persist the data on disk, you can do this:\n\n```ruby\nstore = StuffClassifier::FileStorage.new(@storage_path)\n\n# global setting\nStuffClassifier::Base.storage = store\n\n# or alternative local setting on instantiation, by means of an\n# optional param ...\ncls = StuffClassifier::Bayes.new(\"Cats or Dogs\", :storage => store)\n\n# after training is done, to persist the data ...\ncls.save_state\n\n# or you could just do this:\nStuffClassifier::Bayes.open(\"Cats or Dogs\") do |cls|\n  # when done, save_state is called on END\nend\n\n# to start fresh, deleting the saved training data for this classifier\nStuffClassifier::Bayes.new(\"Cats or Dogs\", :purge_state => true)\n```\n\nThe name you give your classifier is important, as based on it the\ndata will get loaded and saved. For instance, following 3 classifiers\nwill be stored in different buckets, being independent of each other.\n\n```ruby\ncls1 = StuffClassifier::Bayes.new(\"Cats or Dogs\")\ncls2 = StuffClassifier::Bayes.new(\"True or False\")\ncls3 = StuffClassifier::Bayes.new(\"Spam or Ham\")\t\n```\n\n## License\n\nMIT Licensed. See LICENSE.txt for details.\n\n\n"
  },
  {
    "path": "Rakefile",
    "content": "require 'bundler/setup'\nrequire 'rake/testtask'\nrequire 'stuff-classifier'\n\nRake::TestTask.new(:test) do |test|\n  test.libs << 'lib' << 'test'\n  test.pattern = 'test/**/test_*.rb'\n  test.verbose = true\nend\n\ntask :default => :test\n\n"
  },
  {
    "path": "lib/stuff-classifier/base.rb",
    "content": "# -*- encoding : utf-8 -*-\n\nclass StuffClassifier::Base\n  extend StuffClassifier::Storage::ActAsStorable\n  attr_reader :name\n  attr_reader :word_list\n  attr_reader :category_list\n  attr_reader :training_count\n\n  attr_accessor :tokenizer\n  attr_accessor :language\n\n  attr_accessor :thresholds\n  attr_accessor :min_prob\n\n\n  storable :version,:word_list,:category_list,:training_count,:thresholds,:min_prob\n\n  # opts :\n  # language\n  # stemming : true | false\n  # weight\n  # assumed_prob\n  # storage\n  # purge_state ?\n\n  def initialize(name, opts={})\n    @version = StuffClassifier::VERSION\n\n    @name = name\n\n    # This values are nil or are loaded from storage\n    @word_list = {}\n    @category_list = {}\n    @training_count=0\n\n    # storage\n    purge_state = opts[:purge_state]\n    @storage = opts[:storage] || StuffClassifier::Base.storage\n    unless purge_state\n      @storage.load_state(self)\n    else\n      @storage.purge_state(self)\n    end\n\n    # This value can be set during initialization or overrided after load_state\n    @thresholds = opts[:thresholds] || {}\n    @min_prob = opts[:min_prob] || 0.0\n\n\n    @ignore_words = nil\n    @tokenizer = StuffClassifier::Tokenizer.new(opts)\n\n  end\n\n  def incr_word(word, category)\n    @word_list[word] ||= {}\n\n    @word_list[word][:categories] ||= {}\n    @word_list[word][:categories][category] ||= 0\n    @word_list[word][:categories][category] += 1\n\n    @word_list[word][:_total_word] ||= 0\n    @word_list[word][:_total_word] += 1\n\n\n    # words count by categroy\n    @category_list[category] ||= {}\n    @category_list[category][:_total_word] ||= 0\n    @category_list[category][:_total_word] += 1\n\n  end\n\n  def incr_cat(category)\n    @category_list[category] ||= {}\n    @category_list[category][:_count] ||= 0\n    @category_list[category][:_count] += 1\n\n    @training_count ||= 0\n    @training_count += 1\n\n  end\n\n  # return number of times the word appears in a category\n  def word_count(word, category)\n    return 0.0 unless @word_list[word] && @word_list[word][:categories] && @word_list[word][:categories][category]\n    @word_list[word][:categories][category].to_f\n  end\n\n  # return the number of times the word appears in all categories\n  def total_word_count(word)\n    return 0.0 unless @word_list[word] && @word_list[word][:_total_word]\n    @word_list[word][:_total_word].to_f\n  end\n\n  # return the number of words in a categories\n  def total_word_count_in_cat(cat)\n    return 0.0 unless @category_list[cat] && @category_list[cat][:_total_word]\n    @category_list[cat][:_total_word].to_f\n  end\n\n  # return the number of training item\n  def total_cat_count\n    @training_count\n  end\n\n  # return the number of training document for a category\n  def cat_count(category)\n    @category_list[category][:_count] ? @category_list[category][:_count].to_f : 0.0\n  end\n\n  # return the number of time categories in wich a word appear\n  def categories_with_word_count(word)\n    return 0 unless @word_list[word] && @word_list[word][:categories]\n    @word_list[word][:categories].length\n  end\n\n  # return the number of categories\n  def total_categories\n    categories.length\n  end\n\n  # return categories list\n  def categories\n    @category_list.keys\n  end\n\n  # train the classifier\n  def train(category, text)\n    @tokenizer.each_word(text) {|w| incr_word(w, category) }\n    incr_cat(category)\n  end\n\n  # classify a text\n  def classify(text, default=nil)\n    # Find the category with the highest probability\n    max_prob = @min_prob\n    best = nil\n\n    scores = cat_scores(text)\n    scores.each do |score|\n      cat, prob = score\n      if prob > max_prob\n        max_prob = prob\n        best = cat\n      end\n    end\n\n    # Return the default category in case the threshold condition was\n    # not met. For example, if the threshold for :spam is 1.2\n    #\n    #    :spam => 0.73, :ham => 0.40  (OK)\n    #    :spam => 0.80, :ham => 0.70  (Fail, :ham is too close)\n\n    return default unless best\n\n    threshold = @thresholds[best] || 1.0\n\n    scores.each do |score|\n      cat, prob = score\n      next if cat == best\n      return default if prob * threshold > max_prob\n    end\n\n    return best\n  end\n\n  def save_state\n    @storage.save_state(self)\n  end\n\n  class << self\n    attr_writer :storage\n\n    def storage\n      @storage = StuffClassifier::InMemoryStorage.new unless defined? @storage\n      @storage\n    end\n\n    def open(name)\n      inst = self.new(name)\n      if block_given?\n        yield inst\n        inst.save_state\n      else\n        inst\n      end\n    end\n  end\nend\n"
  },
  {
    "path": "lib/stuff-classifier/bayes.rb",
    "content": "# -*- encoding : utf-8 -*-\n\nclass StuffClassifier::Bayes < StuffClassifier::Base\n  attr_accessor :weight\n  attr_accessor :assumed_prob\n\n\n  # http://en.wikipedia.org/wiki/Naive_Bayes_classifier\n  extend StuffClassifier::Storage::ActAsStorable\n  storable :weight,:assumed_prob\n\n  def initialize(name, opts={})\n    super(name, opts)\n    @weight = opts[:weight] || 1.0\n    @assumed_prob = opts[:assumed_prob] || 0.1\n  end\n\n  def word_prob(word, cat)\n    total_words_in_cat = total_word_count_in_cat(cat)\n    return 0.0 if total_words_in_cat == 0\n    word_count(word, cat).to_f / total_words_in_cat\n  end\n\n\n  def word_weighted_average(word, cat, opts={})\n    func = opts[:func]\n\n    # calculate current probability\n    basic_prob = func ? func.call(word, cat) : word_prob(word, cat)\n\n    # count the number of times this word has appeared in all\n    # categories\n    totals = total_word_count(word)\n\n    # the final weighted average\n    (@weight * @assumed_prob + totals * basic_prob) / (@weight + totals)\n  end\n\n  def doc_prob(text, category)\n    @tokenizer.each_word(text).map {|w|\n      word_weighted_average(w, category)\n    }.inject(1) {|p,c| p * c}\n  end\n\n  def text_prob(text, category)\n    cat_prob = cat_count(category) / total_cat_count\n    doc_prob = doc_prob(text, category)\n    cat_prob * doc_prob\n  end\n\n  def cat_scores(text)\n    probs = {}\n    categories.each do |cat|\n      probs[cat] = text_prob(text, cat)\n    end\n    probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}\n  end\n\n\n  def word_classification_detail(word)\n\n    p \"word_prob\"\n    result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end\n    p result\n\n    p \"word_weighted_average\"\n    result=categories.inject({}) do |h,cat| h[cat]=word_weighted_average(word,cat);h end\n    p result\n\n    p \"doc_prob\"\n    result=categories.inject({}) do |h,cat| h[cat]=doc_prob(word,cat);h end\n    p result\n\n    p \"text_prob\"\n    result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end\n    p result\n\n\n  end\n\nend\n"
  },
  {
    "path": "lib/stuff-classifier/storage.rb",
    "content": "# -*- encoding : utf-8 -*-\nmodule StuffClassifier\n\n  class Storage\n    module ActAsStorable\n        def storable(*to_store)\n          @to_store = to_store\n        end\n        def to_store\n          @to_store || []\n        end\n    end\n\n    attr_accessor :storage\n\n    def initialize(*opts)\n      @storage = {}\n    end\n\n    def storage_to_classifier(classifier)\n      if @storage.key? classifier.name\n        @storage[classifier.name].each do |var,value|\n          classifier.instance_variable_set \"@#{var}\",value\n        end\n      end\n    end\n\n    def classifier_to_storage(classifier)\n      to_store = classifier.class.to_store + classifier.class.superclass.to_store\n      @storage[classifier.name] =  to_store.inject({}) {|h,var| h[var] = classifier.instance_variable_get(\"@#{var}\");h}\n    end\n\n    def clear_storage(classifier)\n      @storage.delete(classifier.name)\n    end\n\n  end\n\n  class InMemoryStorage < Storage\n    def initialize\n      super\n    end\n\n    def load_state(classifier)\n      storage_to_classifier(classifier)\n    end\n\n    def save_state(classifier)\n      classifier_to_storage(classifier)\n    end\n\n    def purge_state(classifier)\n      clear_storage(classifier)\n    end\n\n  end\n\n  class FileStorage < Storage\n    def initialize(path)\n      super\n      @path = path\n    end\n\n    def load_state(classifier)\n      if @storage.length == 0 && File.exists?(@path)\n        data = File.open(@path, 'rb') { |f| f.read }\n        @storage = Marshal.load(data)\n      end\n      storage_to_classifier(classifier)\n    end\n\n    def save_state(classifier)\n      classifier_to_storage(classifier)\n      _write_to_file\n    end\n\n    def purge_state(classifier)\n      clear_storage(classifier)\n      _write_to_file\n    end\n\n    def _write_to_file\n      File.open(@path, 'wb') do |fh|\n        fh.flock(File::LOCK_EX)\n        fh.write(Marshal.dump(@storage))\n      end\n    end\n\n  end\n\n  class RedisStorage < Storage\n    def initialize(key, redis_options=nil)\n      super\n      @key = key\n      @redis = Redis.new(redis_options || {})\n    end\n\n    def load_state(classifier)\n      if @storage.length == 0 && @redis.exists(@key)\n        data = @redis.get(@key)\n        @storage = Marshal.load(data)\n      end\n      storage_to_classifier(classifier)\n    end\n\n    def save_state(classifier)\n      classifier_to_storage(classifier)\n      _write_to_redis\n    end\n\n    def purge_state(classifier)\n      clear_storage(classifier)\n      _write_to_redis\n    end\n\n    private\n    def _write_to_redis\n      data = Marshal.dump(@storage)\n      @redis.set(@key, data)\n    end\n  end\nend\n"
  },
  {
    "path": "lib/stuff-classifier/tf-idf.rb",
    "content": "# -*- encoding : utf-8 -*-\nclass StuffClassifier::TfIdf < StuffClassifier::Base\n  extend StuffClassifier::Storage::ActAsStorable\n\n  def initialize(name, opts={})\n    super(name, opts)\n  end\n\n\n  def word_prob(word, cat)\n    word_cat_nr = word_count(word, cat)\n    cat_nr = cat_count(cat)\n\n    tf = 1.0 * word_cat_nr / cat_nr\n\n    idf = Math.log10((total_categories + 2) / (categories_with_word_count(word) + 1.0))\n    tf * idf\n  end\n\n  def text_prob(text, cat)\n    @tokenizer.each_word(text).map{|w| word_prob(w, cat)}.inject(0){|s,p| s + p}\n  end\n\n  def cat_scores(text)\n    probs = {}\n    categories.each do |cat|\n      p = text_prob(text, cat)\n      probs[cat] = p\n    end\n    probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}\n  end\n\n  def word_classification_detail(word)\n\n    p \"tf_idf\"\n    result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end\n    ap result\n\n    p \"text_prob\"\n    result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end\n    ap result\n\n  end\n\nend\n"
  },
  {
    "path": "lib/stuff-classifier/tokenizer/tokenizer_properties.rb",
    "content": "# -*- encoding : utf-8 -*-\nrequire 'set'\nStuffClassifier::Tokenizer::TOKENIZER_PROPERTIES = {\n  \"en\" => {\n    :preprocessing_regexps => {/['`]/ => '',/[_]/ => ' '},\n    :stop_word => Set.new([\n                            '的','个','得',\n                            'a', 'about', 'above', 'across', 'after', 'afterwards',\n                            'again', 'against', 'all', 'almost', 'alone', 'along',\n                            'already', 'also', 'although', 'always', 'am', 'among',\n                            'amongst', 'amoungst', 'amount', 'an', 'and', 'another',\n                            'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere',\n                            'are', 'around', 'as', 'at', 'back', 'be',\n                            'became', 'because', 'become', 'becomes', 'becoming', 'been',\n                            'before', 'beforehand', 'behind', 'being', 'below', 'beside',\n                            'besides', 'between', 'beyond', 'bill', 'both', 'bottom',\n                            'but', 'by', 'call', 'can', 'cannot', 'cant', 'dont',\n                            'co', 'computer', 'con', 'could', 'couldnt', 'cry',\n                            'de', 'describe', 'detail', 'do', 'done', 'down',\n                            'due', 'during', 'each', 'eg', 'eight', 'either',\n                            'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every',\n                            'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen',\n                            'fify', 'fill', 'find', 'fire', 'first', 'five',\n                            'for', 'former', 'formerly', 'forty', 'found', 'four',\n                            'from', 'front', 'full', 'further', 'get', 'give',\n                            'go', 'had', 'has', 'hasnt', 'have', 'he',\n                            'hence', 'her', 'here', 'hereafter', 'hereby', 'herein',\n                            'hereupon', 'hers', 'herself', 'him', 'himself', 'his',\n                            'how', 'however', 'hundred', 'i', 'ie', 'if',\n                            'in', 'inc', 'indeed', 'interest', 'into', 'is',\n                            'it', 'its', 'itself', 'keep', 'last', 'latter',\n                            'latterly', 'least', 'less', 'ltd', 'made', 'many',\n                            'may', 'me', 'meanwhile', 'might', 'mill', 'mine',\n                            'more', 'moreover', 'most', 'mostly', 'move', 'much',\n                            'must', 'my', 'myself', 'name', 'namely', 'neither',\n                            'never', 'nevertheless', 'next', 'nine', 'no', 'nobody',\n                            'none', 'noone', 'nor', 'not', 'nothing', 'now',\n                            'nowhere', 'of', 'off', 'often', 'on', 'once',\n                            'one', 'only', 'onto', 'or', 'other', 'others',\n                            'otherwise', 'our', 'ours', 'ourselves', 'out', 'over',\n                            'own', 'part', 'per', 'perhaps', 'please', 'put',\n                            'rather', 're', 'same', 'see', 'seem', 'seemed',\n                            'seeming', 'seems', 'serious', 'several', 'she', 'should',\n                            'show', 'side', 'since', 'sincere', 'six', 'sixty',\n                            'so', 'some', 'somehow', 'someone', 'something', 'sometime',\n                            'sometimes', 'somewhere', 'still', 'such', 'system', 'take',\n                            'ten', 'than', 'that', 'the', 'their', 'them',\n                            'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',\n                            'therefore', 'therein', 'thereupon', 'these', 'they', 'thick',\n                            'thin', 'third', 'this', 'those', 'though', 'three',\n                            'through', 'throughout', 'thru', 'thus', 'to', 'together',\n                            'too', 'top', 'toward', 'towards', 'twelve', 'twenty',\n                            'two', 'un', 'under', 'until', 'up', 'upon',\n                            'us', 'very', 'via', 'was', 'we', 'well',\n                            'were', 'what', 'whatever', 'when', 'whence', 'whenever',\n                            'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',\n                            'wherever', 'whether', 'which', 'while', 'whither', 'who',\n                            'whoever', 'whole', 'whom', 'whose', 'why', 'will',\n                            'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours',\n                            'yourself', 'yourselves'\n    ])\n  },\n  \"fr\" => {\n    :stop_word => Set.new([\n                            'au',  'aux',  'avec',  'ce',  'ces',  'dans',  'de',  'des',  'du',  'elle',  'en',  'et',  'eux',\n                            'il',  'je',  'la',  'le',  'leur',  'lui',  'ma',  'mais',  'me',  'même',  'mes',  'moi',  'mon',\n                            'ne',  'nos',  'notre',  'nous',  'on',  'ou',  'par',  'pas',  'pour',  'qu',  'que',  'qui',  'sa',\n                            'se',  'ses',  'son',  'sur',  'ta',  'te',  'tes',  'toi',  'ton',  'tu',  'un',  'une',  'vos',  'votre',\n                            'vous',  'c',  'd',  'j',  'l',  'à',  'm',  'n',  's',  't',  'y',  'été',  'étée',  'étées',\n                            'étés',  'étant',  'suis',  'es',  'est',  'sommes',  'êtes',  'sont',  'serai',  'seras',\n                            'sera',  'serons',  'serez',  'seront',  'serais',  'serait',  'serions',  'seriez',  'seraient',\n                            'étais',  'était',  'étions',  'étiez',  'étaient',  'fus',  'fut',  'fûmes',  'fûtes',\n                            'furent',  'sois',  'soit',  'soyons',  'soyez',  'soient',  'fusse',  'fusses',  'fût',\n                            'fussions',  'fussiez',  'fussent',  'ayant',  'eu',  'eue',  'eues',  'eus',  'ai',  'as',\n                            'avons',  'avez',  'ont',  'aurai',  'auras',  'aura',  'aurons',  'aurez',  'auront',  'aurais',\n                            'aurait',  'aurions',  'auriez',  'auraient',  'avais',  'avait',  'avions',  'aviez',  'avaient',\n                            'eut',  'eûmes',  'eûtes',  'eurent',  'aie',  'aies',  'ait',  'ayons',  'ayez',  'aient',  'eusse',\n                            'eusses',  'eût',  'eussions',  'eussiez',  'eussent',  'ceci',  'celà ',  'cet',  'cette',  'ici',\n                            'ils',  'les',  'leurs',  'quel',  'quels',  'quelle',  'quelles',  'sans',  'soi'\n    ])\n  },\n  \"de\" => {\n    :stop_word => Set.new([\n                            'aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere',\n                            'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf',\n                            'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das',\n                            'daß', 'dass', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe',\n                            'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du',\n                            'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen',\n                            'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es',\n                            'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat',\n                            'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres',\n                            'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener',\n                            'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche',\n                            'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach',\n                            'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst',\n                            'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über',\n                            'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst',\n                            'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will',\n                            'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen'\n    ])\n  }\n}\n"
  },
  {
    "path": "lib/stuff-classifier/tokenizer.rb",
    "content": "# -*- encoding : utf-8 -*-\nrequire \"lingua/stemmer\"\nrequire \"rseg\"\n\nclass StuffClassifier::Tokenizer\n  require  \"stuff-classifier/tokenizer/tokenizer_properties\"\n\n  def initialize(opts={})\n    @language = opts.key?(:language) ? opts[:language] : \"en\"\n    @properties = StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES[@language]\n\n    @stemming = opts.key?(:stemming) ? opts[:stemming] : true\n    if @stemming\n      @stemmer = Lingua::Stemmer.new(:language => @language)\n    end\n  end\n\n  def language\n    @language\n  end\n\n  def preprocessing_regexps=(value)\n    @preprocessing_regexps = value\n  end\n\n  def preprocessing_regexps\n    @preprocessing_regexps || @properties[:preprocessing_regexps]\n  end\n\n  def ignore_words=(value)\n    @ignore_words = value\n  end\n\n  def ignore_words\n    @ignore_words || @properties[:stop_word]\n  end\n\n  def stemming?\n    @stemming || false\n  end\n\n  def each_word(string)\n    string = string.strip\n    return if string == ''\n\n    words = []\n\n    # tokenize string\n    string.split(\"\\n\").each do |line|\n\n      # Apply preprocessing regexps\n      if preprocessing_regexps\n        preprocessing_regexps.each { |regexp,replace_by| line.gsub!(regexp, replace_by) }\n      end\n\n      Rseg.segment(line).each do |w|\n        next if w == '' || ignore_words.member?(w.downcase)\n\n        if stemming? and stemable?(w)\n          w = @stemmer.stem(w).downcase\n          next if ignore_words.member?(w)\n        else\n          w = w.downcase\n        end\n\n        words << (block_given? ? (yield w) : w)\n      end\n    end\n\n    return words\n  end\n\n  private\n\n  def stemable?(word)\n    true\n    word =~ /^\\p{Alpha}+$/\n  end\n\nend\n"
  },
  {
    "path": "lib/stuff-classifier/version.rb",
    "content": "module StuffClassifier\n  VERSION = '0.5'\nend\n"
  },
  {
    "path": "lib/stuff-classifier.rb",
    "content": "# -*- encoding : utf-8 -*-\nmodule StuffClassifier\n  autoload :VERSION,    'stuff-classifier/version'\n\n  autoload :Storage, 'stuff-classifier/storage'\n  autoload :InMemoryStorage, 'stuff-classifier/storage'\n  autoload :FileStorage,     'stuff-classifier/storage'\n  autoload :RedisStorage, 'stuff-classifier/storage'\n\n  autoload :Tokenizer,  'stuff-classifier/tokenizer'\n  autoload :TOKENIZER_PROPERTIES, 'stuff-classifier/tokenizer/tokenizer_properties'\n\n  autoload :Base,       'stuff-classifier/base'\n  autoload :Bayes,      'stuff-classifier/bayes'\n  autoload :TfIdf,      'stuff-classifier/tf-idf'\n\nend\n"
  },
  {
    "path": "stuff-classifier.gemspec",
    "content": "# -*- encoding: utf-8 -*-\n$:.push File.expand_path(\"../lib\", __FILE__)\nrequire \"stuff-classifier/version\"\n\nGem::Specification.new do |s|\n  s.name        = \"stuff-classifier\"\n  s.version     = StuffClassifier::VERSION\n  s.authors     = [\"Alexandru Nedelcu\"]\n  s.email       = [\"github@contact.bionicspirit.com\"]\n  s.homepage    = \"https://github.com/alexandru/stuff-classifier/\"\n  s.summary     = %q{Simple text classifier(s) implemetation}\n  s.description = %q{2 methods are provided for now - (1) naive bayes implementation + (2) tf-idf weights}\n\n  s.files         = `git ls-files`.split(\"\\n\")\n  s.test_files    = `git ls-files -- {test,spec,features}/*`.split(\"\\n\")\n  s.executables   = `git ls-files -- bin/*`.split(\"\\n\").map{ |f| File.basename(f) }\n  s.require_paths = [\"lib\"]\n\n  s.required_ruby_version = '>= 1.9.1'\n\n  s.add_runtime_dependency \"ruby-stemmer\"\n  s.add_runtime_dependency \"sequel\"\n  s.add_runtime_dependency \"redis\"\n\n\n  s.add_development_dependency \"bundler\"\n  s.add_development_dependency \"rake\", \">= 0.9.2\"\n  s.add_development_dependency \"minitest\", \"~> 4\"\n  s.add_development_dependency \"turn\", \">= 0.8.3\"\n  s.add_development_dependency \"simplecov\"\n  s.add_development_dependency \"awesome_print\"\n  s.add_development_dependency \"ruby-debug19\"\n  s.add_development_dependency \"rseg\"\n\nend\n\n"
  },
  {
    "path": "test/helper.rb",
    "content": "# -*- encoding : utf-8 -*-\nrequire 'simplecov'\nSimpleCov.start\n\nrequire 'turn'\nrequire 'minitest/autorun'\nrequire 'stuff-classifier'\n\nTurn.config do |c|\n # use one of output formats:\n # :outline  - turn's original case/test outline mode [default]\n # :progress - indicates progress with progress bar\n # :dotted   - test/unit's traditional dot-progress mode\n # :pretty   - new pretty reporter\n # :marshal  - dump output as YAML (normal run mode only)\n # :cue      - interactive testing\n c.format  = :cue\n # turn on invoke/execute tracing, enable full backtrace\n c.trace   = true\n # use humanized test names (works only with :outline format)\n c.natural = true\nend\n\nclass TestBase < MiniTest::Unit::TestCase\n  def self.before(&block)\n    @on_setup = block if block\n    @on_setup\n  end\n\n  def setup\n    on_setup = self.class.before\n    instance_eval(&on_setup) if on_setup\n  end\n\n  def set_classifier(instance)\n    @classifier = instance\n  end\n  def classifier\n    @classifier\n  end\n\n\n  def train(category, value)\n    @classifier.train(category, value)\n  end\n\n  def should_be(category, value)\n    assert_equal category, @classifier.classify(value), value\n  end\nend\n"
  },
  {
    "path": "test/test_001_tokenizer.rb",
    "content": "# -*- coding: utf-8 -*-\nrequire './helper.rb'\n\nclass Test001Tokenizer < TestBase\n  before do\n    @en_tokenizer = StuffClassifier::Tokenizer.new\n    @fr_tokenizer = StuffClassifier::Tokenizer.new(:language => \"fr\")\n  end\n\n  def test_simple_tokens\n     words =  @en_tokenizer.each_word('Hello world! How are you?')\n     should_return = [\"hello\", \"world\"]\n\n     assert_equal should_return, words\n  end\n\n  def test_with_stemming\n    words =  @en_tokenizer.each_word('Lots of dogs, lots of cats! This really is the information highway')\n    should_return =[\"lot\", \"dog\", \"lot\", \"cat\", \"realli\" ,\"inform\", \"highway\" ]\n\n    assert_equal should_return, words\n\n  end\n\n  def test_complicated_tokens\n    words = @en_tokenizer.each_word(\"I don't really get what you want to\n      accomplish. There is a class TestEval2, you can do test_eval2 =\n      TestEval2.new afterwards. And: class A ... end always yields nil, so\n      your output is ok I guess ;-)\")\n\n    should_return = [\n      \"realli\", \"want\", \"accomplish\", \"class\",\n      \"testeval2\",  \"test\", \"eval2\",\"testeval2\", \"new\", \"class\", \"end\",\n      \"yield\", \"nil\", \"output\", \"ok\", \"guess\"]\n\n    assert_equal should_return, words\n  end\n\n  def test_unicode\n\n    words = @fr_tokenizer.each_word(\"il s'appelle le vilain petit canard : en référence à Hans Christian Andersen, se démarquer négativement\")\n\n    should_return = [\n      \"appel\", \"vilain\", \"pet\", \"canard\", \"référent\",\n      \"han\", \"christian\", \"andersen\", \"démarqu\", \"négat\"]\n\n    assert_equal should_return, words\n  end\n\nend\n"
  },
  {
    "path": "test/test_002_base.rb",
    "content": "require 'helper'\n\n\nclass Test002Base < TestBase\n  before do\n    @cls = StuffClassifier::Bayes.new(\"Cats or Dogs\")\n    set_classifier @cls\n    \n    train :dog, \"Dogs are awesome, cats too. I love my dog\"\n    train :cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\"    \n    train :dog, \"My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs\"\n    train :cat, \"Cats are difficult animals, unlike dogs, really annoying, I hate them all\"\n    train :dog, \"So which one should you choose? A dog, definitely.\"\n    train :cat, \"The favorite food for cats is bird meat, although mice are good, but birds are a delicacy\"\n    train :dog, \"A dog will eat anything, including birds or whatever meat\"\n    train :cat, \"My cat's favorite place to purr is on my keyboard\"\n    train :dog, \"My dog's favorite place to take a leak is the tree in front of our house\"\n  end\n\n  def test_count \n    assert @cls.total_cat_count == 9\n    assert @cls.categories.map {|c| @cls.cat_count(c)}.inject(0){|s,count| s+count} == 9\n    \n\n    # compare word count sum to word by cat count sum \n    assert @cls.word_list.map  {|w| @cls.total_word_count(w[0]) }.inject(0)  {|s,count| s+count}  == 58\n    assert @cls.categories.map {|c| @cls.total_word_count_in_cat(c) }.inject(0){|s,count| s+count}  == 58\n\n    # test word count by categories\n    assert @cls.word_list.map {|w| @cls.word_count(w[0],:dog) }.inject(0)  {|s,count| s+count}  == 29\n    assert @cls.word_list.map {|w| @cls.word_count(w[0],:cat) }.inject(0)  {|s,count| s+count}  == 29\n\n    # for all categories\n    assert @cls.categories.map {|c| @cls.word_list.map {|w| @cls.word_count(w[0],c) }.inject(0) {|s,count| s+count} }.inject(0){|s,count| s+count}  == 58\n\n  end\n\nend\n"
  },
  {
    "path": "test/test_003_naive_bayes.rb",
    "content": "require 'helper'\n\n\nclass Test003NaiveBayesClassification < TestBase\n  before do\n    set_classifier StuffClassifier::Bayes.new(\"Cats or Dogs\")\n    \n    train :dog, \"Dogs are awesome, cats too. I love my dog\"\n    train :cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\"    \n    train :dog, \"My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs\"\n    train :cat, \"Cats are difficult animals, unlike dogs, really annoying, I hate them all\"\n    train :dog, \"So which one should you choose? A dog, definitely.\"\n    train :cat, \"The favorite food for cats is bird meat, although mice are good, but birds are a delicacy\"\n    train :dog, \"A dog will eat anything, including birds or whatever meat\"\n    train :cat, \"My cat's favorite place to purr is on my keyboard\"\n    train :dog, \"My dog's favorite place to take a leak is the tree in front of our house\"\n  end\n\n  def test_for_cats \n    should_be :cat, \"This test is about cats.\"\n    should_be :cat, \"I hate ...\"\n    should_be :cat, \"The most annoying animal on earth.\"\n    should_be :cat, \"The preferred company of software developers.\"\n    should_be :cat, \"My precious, my favorite!\"\n    should_be :cat, \"Kill that bird!\"\n  end\n\n  def test_for_dogs\n    should_be :dog, \"This test is about dogs.\"\n    should_be :dog, \"Cats or Dogs?\" \n    should_be :dog, \"What pet will I love more?\"    \n    should_be :dog, \"Willy, where the heck are you?\"\n    should_be :dog, \"I like big buts and I cannot lie.\" \n    should_be :dog, \"Why is the front door of our house open?\"\n    should_be :dog, \"Who ate my meat?\"\n  end\n\n  def test_min_prob\n    classifier.min_prob = 0.001\n    should_be :cat, \"This test is about cats.\"\n    should_be :cat, \"I hate ...\"\n    should_be nil, \"The most annoying animal on earth.\"\n    should_be nil, \"The preferred company of software developers.\"\n    should_be :cat, \"My precious, my favorite!\"\n    should_be :cat, \"Kill that bird!\"\n    should_be :dog, \"This test is about dogs.\"\n    should_be :dog, \"Cats or Dogs?\" \n    should_be :dog, \"What pet will I love more?\"    \n    should_be :dog, \"Willy, where the heck are you?\"\n    should_be nil, \"I like big buts and I cannot lie.\" \n    should_be nil, \"Why is the front door of our house open?\"\n    should_be :dog, \"Who ate my meat?\"\n  end\n\n\nend\n"
  },
  {
    "path": "test/test_004_tf_idf.rb",
    "content": "require 'helper'\n\n\nclass Test004TfIdfClassification < TestBase\n  before do\n    set_classifier StuffClassifier::TfIdf.new(\"Cats or Dogs\")\n    \n    train :dog, \"Dogs are awesome, cats too. I love my dog\"\n    train :cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\"    \n    train :dog, \"My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs\"\n    train :cat, \"Cats are difficult animals, unlike dogs, really annoying, I hate them all\"\n    train :dog, \"So which one should you choose? A dog, definitely.\"\n    train :cat, \"The favorite food for cats is bird meat, although mice are good, but birds are a delicacy\"\n    train :dog, \"A dog will eat anything, including birds or whatever meat\"\n    train :cat, \"My cat's favorite place to purr is on my keyboard\"\n    train :dog, \"My dog's favorite place to take a leak is the tree in front of our house\"\n  end\n\n  def test_for_cats \n    should_be :cat, \"This test is about cats.\"\n    should_be :cat, \"I hate ...\"\n    should_be :cat, \"The most annoying animal on earth.\"\n    should_be :cat, \"The preferred company of software developers.\"\n    should_be :cat, \"My precious, my favorite!\"\n    should_be :cat, \"Kill that bird!\"\n  end\n\n  def test_for_dogs\n    should_be :dog, \"This test is about dogs.\"\n    should_be :dog, \"Cats or Dogs?\" \n    should_be :dog, \"What pet will I love more?\"    \n    should_be :dog, \"Willy, where the heck are you?\"\n    should_be :dog, \"I like big buts and I cannot lie.\" \n    should_be :dog, \"Why is the front door of our house open?\"\n    should_be :dog, \"Who is eating my meat?\"\n  end\nend\n"
  },
  {
    "path": "test/test_005_in_memory_storage.rb",
    "content": "require 'helper'\n\n\nclass Test005InMemoryStorage < TestBase\n  before do\n    StuffClassifier::Base.storage = StuffClassifier::InMemoryStorage.new\n\n    StuffClassifier::Bayes.open(\"Cats or Dogs\") do |cls|    \n      cls.train(:dog, \"Dogs are awesome, cats too. I love my dog\")\n      cls.train(:cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\")\n    end\n  end\n\n  def test_for_persistance\n    test = self\n    StuffClassifier::Bayes.new(\"Cats or Dogs\").instance_eval do\n      test.assert @storage.instance_of?(StuffClassifier::InMemoryStorage),\n        \"@storage should be an instance of FileStorage\"\n      test.assert @word_list.length > 0, \"Word count should be persisted\"\n      test.assert @category_list.length > 0, \"Category count should be persisted\"\n    end\n  end\n\n  def test_purge_state\n    test = self\n    StuffClassifier::Bayes.new(\"Cats or Dogs\", :purge_state => true).instance_eval do\n      test.assert @word_list.length == 0, \"Word count should be purged\"\n      test.assert @category_list.length == 0, \"Category count should be purged\"\n    end\n  end\nend\n"
  },
  {
    "path": "test/test_006_file_storage.rb",
    "content": "require 'helper'\n\n\nclass Test006FileStorage < TestBase\n  before do\n    @storage_path = \"/tmp/test_classifier.db\"\n    @storage = StuffClassifier::FileStorage.new(@storage_path)\n    StuffClassifier::Base.storage = @storage\n\n    StuffClassifier::Bayes.open(\"Cats or Dogs\") do |cls|    \n      cls.train(:dog, \"Dogs are awesome, cats too. I love my dog.\")\n      cls.train(:dog, \"My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs\")\n      cls.train(:dog, \"So which one should you choose? A dog, definitely.\")\n      cls.train(:dog, \"A dog will eat anything, including birds or whatever meat\")\n      cls.train(:dog, \"My dog's favorite place to take a leak is the tree in front of our house\")\n\n      cls.train(:cat, \"My cat's favorite place to purr is on my keyboard\")\n      cls.train(:cat, \"The favorite food for cats is bird meat, although mice are good, but birds are a delicacy\")\n      cls.train(:cat, \"Cats are difficult animals, unlike dogs, really annoying, I hate them all\")\n      cls.train(:cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\")    \n    end\n\n    # redefining storage instance, forcing it to read from file again\n    StuffClassifier::Base.storage = StuffClassifier::FileStorage.new(@storage_path)\n  end\n\n  def teardown\n    File.unlink @storage_path if File.exists? @storage_path\n  end\n\n  def test_result    \n    set_classifier StuffClassifier::Bayes.new(\"Cats or Dogs\")\n    \n    should_be :cat, \"This test is about cats.\"\n    should_be :cat, \"I hate ...\"\n    should_be :cat, \"The most annoying animal on earth.\"\n    should_be :cat, \"The preferred company of software developers.\"\n    should_be :cat, \"My precious, my favorite!\"\n    should_be :cat, \"Kill that bird!\"\n\n    should_be :dog, \"This test is about dogs.\"\n    should_be :dog, \"Cats or Dogs?\" \n    should_be :dog, \"What pet will I love more?\"    \n    should_be :dog, \"Willy, where the heck are you?\"\n    should_be :dog, \"I like big buts and I cannot lie.\" \n    should_be :dog, \"Why is the front door of our house open?\"\n    should_be :dog, \"Who ate my meat?\"\n    \n  end\n\n  def test_for_persistance    \n    assert ! @storage.equal?(StuffClassifier::Base.storage),\"Storage instance should not be the same\"\n\n    test = self\n    StuffClassifier::Bayes.new(\"Cats or Dogs\").instance_eval do\n      test.assert @storage.instance_of?(StuffClassifier::FileStorage),\"@storage should be an instance of FileStorage\"\n      test.assert @word_list.length > 0, \"Word count should be persisted\"\n      test.assert @category_list.length > 0, \"Category count should be persisted\"\n    end\n  end\n\n  def test_file_created\n    assert File.exist?(@storage_path), \"File #@storage_path should exist\"\n\n    content = File.read(@storage_path)\n    assert content.length > 100, \"Serialized content should have more than 100 chars\"\n  end\n\n  def test_purge_state\n    test = self\n    StuffClassifier::Bayes.new(\"Cats or Dogs\", :purge_state => true).instance_eval do\n      test.assert @storage.instance_of?(StuffClassifier::FileStorage),\"@storage should be an instance of FileStorage\"\n      test.assert @word_list.length == 0, \"Word count should be purged\"\n      test.assert @category_list.length == 0, \"Category count should be purged\"\n    end\n  end\nend\n"
  },
  {
    "path": "test/test_007_redis_storage.rb",
    "content": "require 'helper'\nrequire 'redis'\n\n\nclass Test007RedisStorage < TestBase\n  before do\n    @key = \"test_classifier\"\n    @redis_options = { host: 'localhost', port: 6379 }\n    @redis = Redis.new(@redis_options)\n\n    @storage = StuffClassifier::RedisStorage.new(@key, @redis_options)\n    StuffClassifier::Base.storage = @storage\n\n    StuffClassifier::Bayes.open(\"Cats or Dogs\") do |cls|\n      cls.train(:dog, \"Dogs are awesome, cats too. I love my dog.\")\n      cls.train(:dog, \"My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs\")\n      cls.train(:dog, \"So which one should you choose? A dog, definitely.\")\n      cls.train(:dog, \"A dog will eat anything, including birds or whatever meat\")\n      cls.train(:dog, \"My dog's favorite place to take a leak is the tree in front of our house\")\n\n      cls.train(:cat, \"My cat's favorite place to purr is on my keyboard\")\n      cls.train(:cat, \"The favorite food for cats is bird meat, although mice are good, but birds are a delicacy\")\n      cls.train(:cat, \"Cats are difficult animals, unlike dogs, really annoying, I hate them all\")\n      cls.train(:cat, \"Cats are more preferred by software developers. I never could stand cats. I have a dog\")\n    end\n\n    # redefining storage instance, forcing it to read from file again\n    StuffClassifier::Base.storage = StuffClassifier::RedisStorage.new(@key, @redis_options)\n  end\n\n  def teardown\n    @redis.del(@key)\n  end\n\n  def test_result\n    set_classifier StuffClassifier::Bayes.new(\"Cats or Dogs\")\n\n    should_be :cat, \"This test is about cats.\"\n    should_be :cat, \"I hate ...\"\n    should_be :cat, \"The most annoying animal on earth.\"\n    should_be :cat, \"The preferred company of software developers.\"\n    should_be :cat, \"My precious, my favorite!\"\n    should_be :cat, \"Kill that bird!\"\n\n    should_be :dog, \"This test is about dogs.\"\n    should_be :dog, \"Cats or Dogs?\"\n    should_be :dog, \"What pet will I love more?\"\n    should_be :dog, \"Willy, where the heck are you?\"\n    should_be :dog, \"I like big buts and I cannot lie.\"\n    should_be :dog, \"Why is the front door of our house open?\"\n    should_be :dog, \"Who ate my meat?\"\n\n  end\n\n  def test_for_persistance\n    assert !@storage.equal?(StuffClassifier::Base.storage),\"Storage instance should not be the same\"\n\n    test = self\n    StuffClassifier::Bayes.new(\"Cats or Dogs\").instance_eval do\n      test.assert @storage.instance_of?(StuffClassifier::RedisStorage),\"@storage should be an instance of RedisStorage\"\n      test.assert @word_list.length > 0, \"Word count should be persisted\"\n      test.assert @category_list.length > 0, \"Category count should be persisted\"\n    end\n  end\n\n  def test_key_created\n    assert @redis.exists(@key), \"Redis key #{@key} should exist\"\n\n    content = @redis.get(@key)\n    assert content.length > 100, \"Serialized content should have more than 100 chars\"\n  end\n\n  def test_purge_state\n    test = self\n    StuffClassifier::Bayes.new(\"Cats or Dogs\", :purge_state => true).instance_eval do\n      test.assert @storage.instance_of?(StuffClassifier::RedisStorage),\"@storage should be an instance of RedisStorage\"\n      test.assert @word_list.length == 0, \"Word count should be purged\"\n      test.assert @category_list.length == 0, \"Category count should be purged\"\n    end\n  end\nend\n"
  }
]