Repository: alexandru/stuff-classifier
Branch: master
Commit: eceef3207ef0
Files: 22
Total size: 44.4 KB

Directory structure:
gitextract_9kt2n9by/

├── .gitignore
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── lib/
│   ├── stuff-classifier/
│   │   ├── base.rb
│   │   ├── bayes.rb
│   │   ├── storage.rb
│   │   ├── tf-idf.rb
│   │   ├── tokenizer/
│   │   │   └── tokenizer_properties.rb
│   │   ├── tokenizer.rb
│   │   └── version.rb
│   └── stuff-classifier.rb
├── stuff-classifier.gemspec
└── test/
    ├── helper.rb
    ├── test_001_tokenizer.rb
    ├── test_002_base.rb
    ├── test_003_naive_bayes.rb
    ├── test_004_tf_idf.rb
    ├── test_005_in_memory_storage.rb
    ├── test_006_file_storage.rb
    └── test_007_redis_storage.rb

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.rvmrc
coverage/
.DS_Store
*.gem
utils.rb
Gemfile.lock


================================================
FILE: Gemfile
================================================
source "http://rubygems.org"

gemspec


================================================
FILE: LICENSE.txt
================================================
Copyright (c) 2012 Alexandru Nedelcu

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


================================================
FILE: README.md
================================================
# stuff-classifier

## No longer maintained

This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.

## Description

A library for classifying text into multiple categories.

Currently provided classifiers:

- a [naive bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- a classifier based on [tf-idf weights](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)

Ran a benchmark of 1345 items that I have previously manually
classified with multiple categories. Here's the rate over which the 2
algorithms have correctly detected one of those categories:

- Bayes: 79.26%
- Tf-Idf: 81.34%

I prefer the Naive Bayes approach, because while having lower stats on
this benchmark, it seems to make better decisions than I did in many
cases. For example, an item with title *"Paintball Session, 100 Balls
and Equipment"* was classified as *"Activities"* by me, but the bayes
classifier identified it as *"Sports"*, at which point I had an
intellectual orgasm. Also, the Tf-Idf classifier seems to do better on
clear-cut cases, but doesn't seem to handle uncertainty so well. Of
course, these are just quick tests I made and I have no idea which is
really better.

## Install

```bash
gem install stuff-classifier
```

## Usage

You either instantiate one class or the other. Both have the same
signature:

```ruby
require 'stuff-classifier'

# for the naive bayes implementation
cls = StuffClassifier::Bayes.new("Cats or Dogs")

# for the Tf-Idf based implementation
cls = StuffClassifier::TfIdf.new("Cats or Dogs")

# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false)

# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]
 ```

Training the classifier:

```ruby
cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")    
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")
```

And finally, classifying stuff:

```ruby
cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :cat

cls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?") 
#=> :dog
cls.classify("What pet will I love more?")    
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.") 
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog
```

## Persistency

The following layers for saving the training data between sessions are
implemented:

- in memory (by default)
- on disk
- Redis
- (coming soon) in a RDBMS

To persist the data in Redis, you can do this:
```ruby
# defaults to redis running on localhost on default port
store = StuffClassifier::RedisStorage.new(@key)

# pass in connection args
store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})
```

To persist the data on disk, you can do this:

```ruby
store = StuffClassifier::FileStorage.new(@storage_path)

# global setting
StuffClassifier::Base.storage = store

# or alternative local setting on instantiation, by means of an
# optional param ...
cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store)

# after training is done, to persist the data ...
cls.save_state

# or you could just do this:
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
  # when done, save_state is called on END
end

# to start fresh, deleting the saved training data for this classifier
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true)
```

The name you give your classifier is important, as based on it the
data will get loaded and saved. For instance, following 3 classifiers
will be stored in different buckets, being independent of each other.

```ruby
cls1 = StuffClassifier::Bayes.new("Cats or Dogs")
cls2 = StuffClassifier::Bayes.new("True or False")
cls3 = StuffClassifier::Bayes.new("Spam or Ham")	
```

## License

MIT Licensed. See LICENSE.txt for details.


================================================
FILE: Rakefile
================================================
require 'bundler/setup'
require 'rake/testtask'
require 'stuff-classifier'

Rake::TestTask.new(:test) do |test|
  test.libs << 'lib' << 'test'
  test.pattern = 'test/**/test_*.rb'
  test.verbose = true
end

task :default => :test


================================================
FILE: lib/stuff-classifier/base.rb
================================================
# -*- encoding : utf-8 -*-

class StuffClassifier::Base
  extend StuffClassifier::Storage::ActAsStorable
  attr_reader :name
  attr_reader :word_list
  attr_reader :category_list
  attr_reader :training_count

  attr_accessor :tokenizer
  attr_accessor :language

  attr_accessor :thresholds
  attr_accessor :min_prob


  storable :version,:word_list,:category_list,:training_count,:thresholds,:min_prob

  # opts :
  # language
  # stemming : true | false
  # weight
  # assumed_prob
  # storage
  # purge_state ?

  def initialize(name, opts={})
    @version = StuffClassifier::VERSION

    @name = name

    # This values are nil or are loaded from storage
    @word_list = {}
    @category_list = {}
    @training_count=0

    # storage
    purge_state = opts[:purge_state]
    @storage = opts[:storage] || StuffClassifier::Base.storage
    unless purge_state
      @storage.load_state(self)
    else
      @storage.purge_state(self)
    end

    # This value can be set during initialization or overrided after load_state
    @thresholds = opts[:thresholds] || {}
    @min_prob = opts[:min_prob] || 0.0


    @ignore_words = nil
    @tokenizer = StuffClassifier::Tokenizer.new(opts)

  end

  def incr_word(word, category)
    @word_list[word] ||= {}

    @word_list[word][:categories] ||= {}
    @word_list[word][:categories][category] ||= 0
    @word_list[word][:categories][category] += 1

    @word_list[word][:_total_word] ||= 0
    @word_list[word][:_total_word] += 1


    # words count by categroy
    @category_list[category] ||= {}
    @category_list[category][:_total_word] ||= 0
    @category_list[category][:_total_word] += 1

  end

  def incr_cat(category)
    @category_list[category] ||= {}
    @category_list[category][:_count] ||= 0
    @category_list[category][:_count] += 1

    @training_count ||= 0
    @training_count += 1

  end

  # return number of times the word appears in a category
  def word_count(word, category)
    return 0.0 unless @word_list[word] && @word_list[word][:categories] && @word_list[word][:categories][category]
    @word_list[word][:categories][category].to_f
  end

  # return the number of times the word appears in all categories
  def total_word_count(word)
    return 0.0 unless @word_list[word] && @word_list[word][:_total_word]
    @word_list[word][:_total_word].to_f
  end

  # return the number of words in a categories
  def total_word_count_in_cat(cat)
    return 0.0 unless @category_list[cat] && @category_list[cat][:_total_word]
    @category_list[cat][:_total_word].to_f
  end

  # return the number of training item
  def total_cat_count
    @training_count
  end

  # return the number of training document for a category
  def cat_count(category)
    @category_list[category][:_count] ? @category_list[category][:_count].to_f : 0.0
  end

  # return the number of time categories in wich a word appear
  def categories_with_word_count(word)
    return 0 unless @word_list[word] && @word_list[word][:categories]
    @word_list[word][:categories].length
  end

  # return the number of categories
  def total_categories
    categories.length
  end

  # return categories list
  def categories
    @category_list.keys
  end

  # train the classifier
  def train(category, text)
    @tokenizer.each_word(text) {|w| incr_word(w, category) }
    incr_cat(category)
  end

  # classify a text
  def classify(text, default=nil)
    # Find the category with the highest probability
    max_prob = @min_prob
    best = nil

    scores = cat_scores(text)
    scores.each do |score|
      cat, prob = score
      if prob > max_prob
        max_prob = prob
        best = cat
      end
    end

    # Return the default category in case the threshold condition was
    # not met. For example, if the threshold for :spam is 1.2
    #
    #    :spam => 0.73, :ham => 0.40  (OK)
    #    :spam => 0.80, :ham => 0.70  (Fail, :ham is too close)

    return default unless best

    threshold = @thresholds[best] || 1.0

    scores.each do |score|
      cat, prob = score
      next if cat == best
      return default if prob * threshold > max_prob
    end

    return best
  end

  def save_state
    @storage.save_state(self)
  end

  class << self
    attr_writer :storage

    def storage
      @storage = StuffClassifier::InMemoryStorage.new unless defined? @storage
      @storage
    end

    def open(name)
      inst = self.new(name)
      if block_given?
        yield inst
        inst.save_state
      else
        inst
      end
    end
  end
end


================================================
FILE: lib/stuff-classifier/bayes.rb
================================================
# -*- encoding : utf-8 -*-

class StuffClassifier::Bayes < StuffClassifier::Base
  attr_accessor :weight
  attr_accessor :assumed_prob


  # http://en.wikipedia.org/wiki/Naive_Bayes_classifier
  extend StuffClassifier::Storage::ActAsStorable
  storable :weight,:assumed_prob

  def initialize(name, opts={})
    super(name, opts)
    @weight = opts[:weight] || 1.0
    @assumed_prob = opts[:assumed_prob] || 0.1
  end

  def word_prob(word, cat)
    total_words_in_cat = total_word_count_in_cat(cat)
    return 0.0 if total_words_in_cat == 0
    word_count(word, cat).to_f / total_words_in_cat
  end


  def word_weighted_average(word, cat, opts={})
    func = opts[:func]

    # calculate current probability
    basic_prob = func ? func.call(word, cat) : word_prob(word, cat)

    # count the number of times this word has appeared in all
    # categories
    totals = total_word_count(word)

    # the final weighted average
    (@weight * @assumed_prob + totals * basic_prob) / (@weight + totals)
  end

  def doc_prob(text, category)
    @tokenizer.each_word(text).map {|w|
      word_weighted_average(w, category)
    }.inject(1) {|p,c| p * c}
  end

  def text_prob(text, category)
    cat_prob = cat_count(category) / total_cat_count
    doc_prob = doc_prob(text, category)
    cat_prob * doc_prob
  end

  def cat_scores(text)
    probs = {}
    categories.each do |cat|
      probs[cat] = text_prob(text, cat)
    end
    probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}
  end


  def word_classification_detail(word)

    p "word_prob"
    result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end
    p result

    p "word_weighted_average"
    result=categories.inject({}) do |h,cat| h[cat]=word_weighted_average(word,cat);h end
    p result

    p "doc_prob"
    result=categories.inject({}) do |h,cat| h[cat]=doc_prob(word,cat);h end
    p result

    p "text_prob"
    result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end
    p result


  end

end


================================================
FILE: lib/stuff-classifier/storage.rb
================================================
# -*- encoding : utf-8 -*-
module StuffClassifier

  class Storage
    module ActAsStorable
        def storable(*to_store)
          @to_store = to_store
        end
        def to_store
          @to_store || []
        end
    end

    attr_accessor :storage

    def initialize(*opts)
      @storage = {}
    end

    def storage_to_classifier(classifier)
      if @storage.key? classifier.name
        @storage[classifier.name].each do |var,value|
          classifier.instance_variable_set "@#{var}",value
        end
      end
    end

    def classifier_to_storage(classifier)
      to_store = classifier.class.to_store + classifier.class.superclass.to_store
      @storage[classifier.name] =  to_store.inject({}) {|h,var| h[var] = classifier.instance_variable_get("@#{var}");h}
    end

    def clear_storage(classifier)
      @storage.delete(classifier.name)
    end

  end

  class InMemoryStorage < Storage
    def initialize
      super
    end

    def load_state(classifier)
      storage_to_classifier(classifier)
    end

    def save_state(classifier)
      classifier_to_storage(classifier)
    end

    def purge_state(classifier)
      clear_storage(classifier)
    end

  end

  class FileStorage < Storage
    def initialize(path)
      super
      @path = path
    end

    def load_state(classifier)
      if @storage.length == 0 && File.exists?(@path)
        data = File.open(@path, 'rb') { |f| f.read }
        @storage = Marshal.load(data)
      end
      storage_to_classifier(classifier)
    end

    def save_state(classifier)
      classifier_to_storage(classifier)
      _write_to_file
    end

    def purge_state(classifier)
      clear_storage(classifier)
      _write_to_file
    end

    def _write_to_file
      File.open(@path, 'wb') do |fh|
        fh.flock(File::LOCK_EX)
        fh.write(Marshal.dump(@storage))
      end
    end

  end

  class RedisStorage < Storage
    def initialize(key, redis_options=nil)
      super
      @key = key
      @redis = Redis.new(redis_options || {})
    end

    def load_state(classifier)
      if @storage.length == 0 && @redis.exists(@key)
        data = @redis.get(@key)
        @storage = Marshal.load(data)
      end
      storage_to_classifier(classifier)
    end

    def save_state(classifier)
      classifier_to_storage(classifier)
      _write_to_redis
    end

    def purge_state(classifier)
      clear_storage(classifier)
      _write_to_redis
    end

    private
    def _write_to_redis
      data = Marshal.dump(@storage)
      @redis.set(@key, data)
    end
  end
end


================================================
FILE: lib/stuff-classifier/tf-idf.rb
================================================
# -*- encoding : utf-8 -*-
class StuffClassifier::TfIdf < StuffClassifier::Base
  extend StuffClassifier::Storage::ActAsStorable

  def initialize(name, opts={})
    super(name, opts)
  end


  def word_prob(word, cat)
    word_cat_nr = word_count(word, cat)
    cat_nr = cat_count(cat)

    tf = 1.0 * word_cat_nr / cat_nr

    idf = Math.log10((total_categories + 2) / (categories_with_word_count(word) + 1.0))
    tf * idf
  end

  def text_prob(text, cat)
    @tokenizer.each_word(text).map{|w| word_prob(w, cat)}.inject(0){|s,p| s + p}
  end

  def cat_scores(text)
    probs = {}
    categories.each do |cat|
      p = text_prob(text, cat)
      probs[cat] = p
    end
    probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}
  end

  def word_classification_detail(word)

    p "tf_idf"
    result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end
    ap result

    p "text_prob"
    result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end
    ap result

  end

end


================================================
FILE: lib/stuff-classifier/tokenizer/tokenizer_properties.rb
================================================
# -*- encoding : utf-8 -*-
require 'set'
StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES = {
  "en" => {
    :preprocessing_regexps => {/['`]/ => '',/[_]/ => ' '},
    :stop_word => Set.new([
                            '的','个','得',
                            'a', 'about', 'above', 'across', 'after', 'afterwards',
                            'again', 'against', 'all', 'almost', 'alone', 'along',
                            'already', 'also', 'although', 'always', 'am', 'among',
                            'amongst', 'amoungst', 'amount', 'an', 'and', 'another',
                            'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere',
                            'are', 'around', 'as', 'at', 'back', 'be',
                            'became', 'because', 'become', 'becomes', 'becoming', 'been',
                            'before', 'beforehand', 'behind', 'being', 'below', 'beside',
                            'besides', 'between', 'beyond', 'bill', 'both', 'bottom',
                            'but', 'by', 'call', 'can', 'cannot', 'cant', 'dont',
                            'co', 'computer', 'con', 'could', 'couldnt', 'cry',
                            'de', 'describe', 'detail', 'do', 'done', 'down',
                            'due', 'during', 'each', 'eg', 'eight', 'either',
                            'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every',
                            'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen',
                            'fify', 'fill', 'find', 'fire', 'first', 'five',
                            'for', 'former', 'formerly', 'forty', 'found', 'four',
                            'from', 'front', 'full', 'further', 'get', 'give',
                            'go', 'had', 'has', 'hasnt', 'have', 'he',
                            'hence', 'her', 'here', 'hereafter', 'hereby', 'herein',
                            'hereupon', 'hers', 'herself', 'him', 'himself', 'his',
                            'how', 'however', 'hundred', 'i', 'ie', 'if',
                            'in', 'inc', 'indeed', 'interest', 'into', 'is',
                            'it', 'its', 'itself', 'keep', 'last', 'latter',
                            'latterly', 'least', 'less', 'ltd', 'made', 'many',
                            'may', 'me', 'meanwhile', 'might', 'mill', 'mine',
                            'more', 'moreover', 'most', 'mostly', 'move', 'much',
                            'must', 'my', 'myself', 'name', 'namely', 'neither',
                            'never', 'nevertheless', 'next', 'nine', 'no', 'nobody',
                            'none', 'noone', 'nor', 'not', 'nothing', 'now',
                            'nowhere', 'of', 'off', 'often', 'on', 'once',
                            'one', 'only', 'onto', 'or', 'other', 'others',
                            'otherwise', 'our', 'ours', 'ourselves', 'out', 'over',
                            'own', 'part', 'per', 'perhaps', 'please', 'put',
                            'rather', 're', 'same', 'see', 'seem', 'seemed',
                            'seeming', 'seems', 'serious', 'several', 'she', 'should',
                            'show', 'side', 'since', 'sincere', 'six', 'sixty',
                            'so', 'some', 'somehow', 'someone', 'something', 'sometime',
                            'sometimes', 'somewhere', 'still', 'such', 'system', 'take',
                            'ten', 'than', 'that', 'the', 'their', 'them',
                            'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',
                            'therefore', 'therein', 'thereupon', 'these', 'they', 'thick',
                            'thin', 'third', 'this', 'those', 'though', 'three',
                            'through', 'throughout', 'thru', 'thus', 'to', 'together',
                            'too', 'top', 'toward', 'towards', 'twelve', 'twenty',
                            'two', 'un', 'under', 'until', 'up', 'upon',
                            'us', 'very', 'via', 'was', 'we', 'well',
                            'were', 'what', 'whatever', 'when', 'whence', 'whenever',
                            'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
                            'wherever', 'whether', 'which', 'while', 'whither', 'who',
                            'whoever', 'whole', 'whom', 'whose', 'why', 'will',
                            'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours',
                            'yourself', 'yourselves'
    ])
  },
  "fr" => {
    :stop_word => Set.new([
                            'au',  'aux',  'avec',  'ce',  'ces',  'dans',  'de',  'des',  'du',  'elle',  'en',  'et',  'eux',
                            'il',  'je',  'la',  'le',  'leur',  'lui',  'ma',  'mais',  'me',  'même',  'mes',  'moi',  'mon',
                            'ne',  'nos',  'notre',  'nous',  'on',  'ou',  'par',  'pas',  'pour',  'qu',  'que',  'qui',  'sa',
                            'se',  'ses',  'son',  'sur',  'ta',  'te',  'tes',  'toi',  'ton',  'tu',  'un',  'une',  'vos',  'votre',
                            'vous',  'c',  'd',  'j',  'l',  'à',  'm',  'n',  's',  't',  'y',  'été',  'étée',  'étées',
                            'étés',  'étant',  'suis',  'es',  'est',  'sommes',  'êtes',  'sont',  'serai',  'seras',
                            'sera',  'serons',  'serez',  'seront',  'serais',  'serait',  'serions',  'seriez',  'seraient',
                            'étais',  'était',  'étions',  'étiez',  'étaient',  'fus',  'fut',  'fûmes',  'fûtes',
                            'furent',  'sois',  'soit',  'soyons',  'soyez',  'soient',  'fusse',  'fusses',  'fût',
                            'fussions',  'fussiez',  'fussent',  'ayant',  'eu',  'eue',  'eues',  'eus',  'ai',  'as',
                            'avons',  'avez',  'ont',  'aurai',  'auras',  'aura',  'aurons',  'aurez',  'auront',  'aurais',
                            'aurait',  'aurions',  'auriez',  'auraient',  'avais',  'avait',  'avions',  'aviez',  'avaient',
                            'eut',  'eûmes',  'eûtes',  'eurent',  'aie',  'aies',  'ait',  'ayons',  'ayez',  'aient',  'eusse',
                            'eusses',  'eût',  'eussions',  'eussiez',  'eussent',  'ceci',  'celà ',  'cet',  'cette',  'ici',
                            'ils',  'les',  'leurs',  'quel',  'quels',  'quelle',  'quelles',  'sans',  'soi'
    ])
  },
  "de" => {
    :stop_word => Set.new([
                            'aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere',
                            'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf',
                            'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das',
                            'daß', 'dass', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe',
                            'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du',
                            'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen',
                            'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es',
                            'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat',
                            'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres',
                            'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener',
                            'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche',
                            'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach',
                            'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst',
                            'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über',
                            'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst',
                            'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will',
                            'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen'
    ])
  }
}


================================================
FILE: lib/stuff-classifier/tokenizer.rb
================================================
# -*- encoding : utf-8 -*-
require "lingua/stemmer"
require "rseg"

class StuffClassifier::Tokenizer
  require  "stuff-classifier/tokenizer/tokenizer_properties"

  def initialize(opts={})
    @language = opts.key?(:language) ? opts[:language] : "en"
    @properties = StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES[@language]

    @stemming = opts.key?(:stemming) ? opts[:stemming] : true
    if @stemming
      @stemmer = Lingua::Stemmer.new(:language => @language)
    end
  end

  def language
    @language
  end

  def preprocessing_regexps=(value)
    @preprocessing_regexps = value
  end

  def preprocessing_regexps
    @preprocessing_regexps || @properties[:preprocessing_regexps]
  end

  def ignore_words=(value)
    @ignore_words = value
  end

  def ignore_words
    @ignore_words || @properties[:stop_word]
  end

  def stemming?
    @stemming || false
  end

  def each_word(string)
    string = string.strip
    return if string == ''

    words = []

    # tokenize string
    string.split("\n").each do |line|

      # Apply preprocessing regexps
      if preprocessing_regexps
        preprocessing_regexps.each { |regexp,replace_by| line.gsub!(regexp, replace_by) }
      end

      Rseg.segment(line).each do |w|
        next if w == '' || ignore_words.member?(w.downcase)

        if stemming? and stemable?(w)
          w = @stemmer.stem(w).downcase
          next if ignore_words.member?(w)
        else
          w = w.downcase
        end

        words << (block_given? ? (yield w) : w)
      end
    end

    return words
  end

  private

  def stemable?(word)
    true
    word =~ /^\p{Alpha}+$/
  end

end


================================================
FILE: lib/stuff-classifier/version.rb
================================================
module StuffClassifier
  VERSION = '0.5'
end


================================================
FILE: lib/stuff-classifier.rb
================================================
# -*- encoding : utf-8 -*-
module StuffClassifier
  autoload :VERSION,    'stuff-classifier/version'

  autoload :Storage, 'stuff-classifier/storage'
  autoload :InMemoryStorage, 'stuff-classifier/storage'
  autoload :FileStorage,     'stuff-classifier/storage'
  autoload :RedisStorage, 'stuff-classifier/storage'

  autoload :Tokenizer,  'stuff-classifier/tokenizer'
  autoload :TOKENIZER_PROPERTIES, 'stuff-classifier/tokenizer/tokenizer_properties'

  autoload :Base,       'stuff-classifier/base'
  autoload :Bayes,      'stuff-classifier/bayes'
  autoload :TfIdf,      'stuff-classifier/tf-idf'

end


================================================
FILE: stuff-classifier.gemspec
================================================
# -*- encoding: utf-8 -*-
$:.push File.expand_path("../lib", __FILE__)
require "stuff-classifier/version"

Gem::Specification.new do |s|
  s.name        = "stuff-classifier"
  s.version     = StuffClassifier::VERSION
  s.authors     = ["Alexandru Nedelcu"]
  s.email       = ["github@contact.bionicspirit.com"]
  s.homepage    = "https://github.com/alexandru/stuff-classifier/"
  s.summary     = %q{Simple text classifier(s) implemetation}
  s.description = %q{2 methods are provided for now - (1) naive bayes implementation + (2) tf-idf weights}

  s.files         = `git ls-files`.split("\n")
  s.test_files    = `git ls-files -- {test,spec,features}/*`.split("\n")
  s.executables   = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
  s.require_paths = ["lib"]

  s.required_ruby_version = '>= 1.9.1'

  s.add_runtime_dependency "ruby-stemmer"
  s.add_runtime_dependency "sequel"
  s.add_runtime_dependency "redis"


  s.add_development_dependency "bundler"
  s.add_development_dependency "rake", ">= 0.9.2"
  s.add_development_dependency "minitest", "~> 4"
  s.add_development_dependency "turn", ">= 0.8.3"
  s.add_development_dependency "simplecov"
  s.add_development_dependency "awesome_print"
  s.add_development_dependency "ruby-debug19"
  s.add_development_dependency "rseg"

end


================================================
FILE: test/helper.rb
================================================
# -*- encoding : utf-8 -*-
require 'simplecov'
SimpleCov.start

require 'turn'
require 'minitest/autorun'
require 'stuff-classifier'

Turn.config do |c|
 # use one of output formats:
 # :outline  - turn's original case/test outline mode [default]
 # :progress - indicates progress with progress bar
 # :dotted   - test/unit's traditional dot-progress mode
 # :pretty   - new pretty reporter
 # :marshal  - dump output as YAML (normal run mode only)
 # :cue      - interactive testing
 c.format  = :cue
 # turn on invoke/execute tracing, enable full backtrace
 c.trace   = true
 # use humanized test names (works only with :outline format)
 c.natural = true
end

class TestBase < MiniTest::Unit::TestCase
  def self.before(&block)
    @on_setup = block if block
    @on_setup
  end

  def setup
    on_setup = self.class.before
    instance_eval(&on_setup) if on_setup
  end

  def set_classifier(instance)
    @classifier = instance
  end
  def classifier
    @classifier
  end


  def train(category, value)
    @classifier.train(category, value)
  end

  def should_be(category, value)
    assert_equal category, @classifier.classify(value), value
  end
end


================================================
FILE: test/test_001_tokenizer.rb
================================================
# -*- coding: utf-8 -*-
require './helper.rb'

class Test001Tokenizer < TestBase
  before do
    @en_tokenizer = StuffClassifier::Tokenizer.new
    @fr_tokenizer = StuffClassifier::Tokenizer.new(:language => "fr")
  end

  def test_simple_tokens
     words =  @en_tokenizer.each_word('Hello world! How are you?')
     should_return = ["hello", "world"]

     assert_equal should_return, words
  end

  def test_with_stemming
    words =  @en_tokenizer.each_word('Lots of dogs, lots of cats! This really is the information highway')
    should_return =["lot", "dog", "lot", "cat", "realli" ,"inform", "highway" ]

    assert_equal should_return, words

  end

  def test_complicated_tokens
    words = @en_tokenizer.each_word("I don't really get what you want to
      accomplish. There is a class TestEval2, you can do test_eval2 =
      TestEval2.new afterwards. And: class A ... end always yields nil, so
      your output is ok I guess ;-)")

    should_return = [
      "realli", "want", "accomplish", "class",
      "testeval2",  "test", "eval2","testeval2", "new", "class", "end",
      "yield", "nil", "output", "ok", "guess"]

    assert_equal should_return, words
  end

  def test_unicode

    words = @fr_tokenizer.each_word("il s'appelle le vilain petit canard : en référence à Hans Christian Andersen, se démarquer négativement")

    should_return = [
      "appel", "vilain", "pet", "canard", "référent",
      "han", "christian", "andersen", "démarqu", "négat"]

    assert_equal should_return, words
  end

end


================================================
FILE: test/test_002_base.rb
================================================
require 'helper'


class Test002Base < TestBase
  before do
    @cls = StuffClassifier::Bayes.new("Cats or Dogs")
    set_classifier @cls
    
    train :dog, "Dogs are awesome, cats too. I love my dog"
    train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"    
    train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
    train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
    train :dog, "So which one should you choose? A dog, definitely."
    train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
    train :dog, "A dog will eat anything, including birds or whatever meat"
    train :cat, "My cat's favorite place to purr is on my keyboard"
    train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
  end

  def test_count 
    assert @cls.total_cat_count == 9
    assert @cls.categories.map {|c| @cls.cat_count(c)}.inject(0){|s,count| s+count} == 9
    

    # compare word count sum to word by cat count sum 
    assert @cls.word_list.map  {|w| @cls.total_word_count(w[0]) }.inject(0)  {|s,count| s+count}  == 58
    assert @cls.categories.map {|c| @cls.total_word_count_in_cat(c) }.inject(0){|s,count| s+count}  == 58

    # test word count by categories
    assert @cls.word_list.map {|w| @cls.word_count(w[0],:dog) }.inject(0)  {|s,count| s+count}  == 29
    assert @cls.word_list.map {|w| @cls.word_count(w[0],:cat) }.inject(0)  {|s,count| s+count}  == 29

    # for all categories
    assert @cls.categories.map {|c| @cls.word_list.map {|w| @cls.word_count(w[0],c) }.inject(0) {|s,count| s+count} }.inject(0){|s,count| s+count}  == 58

  end

end


================================================
FILE: test/test_003_naive_bayes.rb
================================================
require 'helper'


class Test003NaiveBayesClassification < TestBase
  before do
    set_classifier StuffClassifier::Bayes.new("Cats or Dogs")
    
    train :dog, "Dogs are awesome, cats too. I love my dog"
    train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"    
    train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
    train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
    train :dog, "So which one should you choose? A dog, definitely."
    train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
    train :dog, "A dog will eat anything, including birds or whatever meat"
    train :cat, "My cat's favorite place to purr is on my keyboard"
    train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
  end

  def test_for_cats 
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"
  end

  def test_for_dogs
    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie." 
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"
  end

  def test_min_prob
    classifier.min_prob = 0.001
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be nil, "The most annoying animal on earth."
    should_be nil, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"
    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be nil, "I like big buts and I cannot lie." 
    should_be nil, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"
  end


end


================================================
FILE: test/test_004_tf_idf.rb
================================================
require 'helper'


class Test004TfIdfClassification < TestBase
  before do
    set_classifier StuffClassifier::TfIdf.new("Cats or Dogs")
    
    train :dog, "Dogs are awesome, cats too. I love my dog"
    train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"    
    train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
    train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
    train :dog, "So which one should you choose? A dog, definitely."
    train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
    train :dog, "A dog will eat anything, including birds or whatever meat"
    train :cat, "My cat's favorite place to purr is on my keyboard"
    train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
  end

  def test_for_cats 
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"
  end

  def test_for_dogs
    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie." 
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who is eating my meat?"
  end
end


================================================
FILE: test/test_005_in_memory_storage.rb
================================================
require 'helper'


class Test005InMemoryStorage < TestBase
  before do
    StuffClassifier::Base.storage = StuffClassifier::InMemoryStorage.new

    StuffClassifier::Bayes.open("Cats or Dogs") do |cls|    
      cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
      cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
    end
  end

  def test_for_persistance
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::InMemoryStorage),
        "@storage should be an instance of FileStorage"
      test.assert @word_list.length > 0, "Word count should be persisted"
      test.assert @category_list.length > 0, "Category count should be persisted"
    end
  end

  def test_purge_state
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
      test.assert @word_list.length == 0, "Word count should be purged"
      test.assert @category_list.length == 0, "Category count should be purged"
    end
  end
end


================================================
FILE: test/test_006_file_storage.rb
================================================
require 'helper'


class Test006FileStorage < TestBase
  before do
    @storage_path = "/tmp/test_classifier.db"
    @storage = StuffClassifier::FileStorage.new(@storage_path)
    StuffClassifier::Base.storage = @storage

    StuffClassifier::Bayes.open("Cats or Dogs") do |cls|    
      cls.train(:dog, "Dogs are awesome, cats too. I love my dog.")
      cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
      cls.train(:dog, "So which one should you choose? A dog, definitely.")
      cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
      cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")

      cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
      cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
      cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
      cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")    
    end

    # redefining storage instance, forcing it to read from file again
    StuffClassifier::Base.storage = StuffClassifier::FileStorage.new(@storage_path)
  end

  def teardown
    File.unlink @storage_path if File.exists? @storage_path
  end

  def test_result    
    set_classifier StuffClassifier::Bayes.new("Cats or Dogs")
    
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"

    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie." 
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"
    
  end

  def test_for_persistance    
    assert ! @storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same"

    test = self
    StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage"
      test.assert @word_list.length > 0, "Word count should be persisted"
      test.assert @category_list.length > 0, "Category count should be persisted"
    end
  end

  def test_file_created
    assert File.exist?(@storage_path), "File #@storage_path should exist"

    content = File.read(@storage_path)
    assert content.length > 100, "Serialized content should have more than 100 chars"
  end

  def test_purge_state
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage"
      test.assert @word_list.length == 0, "Word count should be purged"
      test.assert @category_list.length == 0, "Category count should be purged"
    end
  end
end


================================================
FILE: test/test_007_redis_storage.rb
================================================
require 'helper'
require 'redis'


class Test007RedisStorage < TestBase
  before do
    @key = "test_classifier"
    @redis_options = { host: 'localhost', port: 6379 }
    @redis = Redis.new(@redis_options)

    @storage = StuffClassifier::RedisStorage.new(@key, @redis_options)
    StuffClassifier::Base.storage = @storage

    StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
      cls.train(:dog, "Dogs are awesome, cats too. I love my dog.")
      cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
      cls.train(:dog, "So which one should you choose? A dog, definitely.")
      cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
      cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")

      cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
      cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
      cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
      cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
    end

    # redefining storage instance, forcing it to read from file again
    StuffClassifier::Base.storage = StuffClassifier::RedisStorage.new(@key, @redis_options)
  end

  def teardown
    @redis.del(@key)
  end

  def test_result
    set_classifier StuffClassifier::Bayes.new("Cats or Dogs")

    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"

    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?"
    should_be :dog, "What pet will I love more?"
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie."
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"

  end

  def test_for_persistance
    assert !@storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same"

    test = self
    StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage"
      test.assert @word_list.length > 0, "Word count should be persisted"
      test.assert @category_list.length > 0, "Category count should be persisted"
    end
  end

  def test_key_created
    assert @redis.exists(@key), "Redis key #{@key} should exist"

    content = @redis.get(@key)
    assert content.length > 100, "Serialized content should have more than 100 chars"
  end

  def test_purge_state
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage"
      test.assert @word_list.length == 0, "Word count should be purged"
      test.assert @category_list.length == 0, "Category count should be purged"
    end
  end
end