Full Code of alexandru/stuff-classifier for AI

master eceef3207ef0 cached

22 files

44.4 KB

13.0k tokens

105 symbols

1 requests

Download .txt

Repository: alexandru/stuff-classifier
Branch: master
Commit: eceef3207ef0
Files: 22
Total size: 44.4 KB

Directory structure:
gitextract_9kt2n9by/

├── .gitignore
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── lib/
│   ├── stuff-classifier/
│   │   ├── base.rb
│   │   ├── bayes.rb
│   │   ├── storage.rb
│   │   ├── tf-idf.rb
│   │   ├── tokenizer/
│   │   │   └── tokenizer_properties.rb
│   │   ├── tokenizer.rb
│   │   └── version.rb
│   └── stuff-classifier.rb
├── stuff-classifier.gemspec
└── test/
    ├── helper.rb
    ├── test_001_tokenizer.rb
    ├── test_002_base.rb
    ├── test_003_naive_bayes.rb
    ├── test_004_tf_idf.rb
    ├── test_005_in_memory_storage.rb
    ├── test_006_file_storage.rb
    └── test_007_redis_storage.rb

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.rvmrc
coverage/
.DS_Store
*.gem
utils.rb
Gemfile.lock


================================================
FILE: Gemfile
================================================
source "http://rubygems.org"

gemspec


================================================
FILE: LICENSE.txt
================================================
Copyright (c) 2012 Alexandru Nedelcu

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


================================================
FILE: README.md
================================================
# stuff-classifier

## No longer maintained

This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.

## Description

A library for classifying text into multiple categories.

Currently provided classifiers:

- a [naive bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- a classifier based on [tf-idf weights](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)

Ran a benchmark of 1345 items that I have previously manually
classified with multiple categories. Here's the rate over which the 2
algorithms have correctly detected one of those categories:

- Bayes: 79.26%
- Tf-Idf: 81.34%

I prefer the Naive Bayes approach, because while having lower stats on
this benchmark, it seems to make better decisions than I did in many
cases. For example, an item with title *"Paintball Session, 100 Balls
and Equipment"* was classified as *"Activities"* by me, but the bayes
classifier identified it as *"Sports"*, at which point I had an
intellectual orgasm. Also, the Tf-Idf classifier seems to do better on
clear-cut cases, but doesn't seem to handle uncertainty so well. Of
course, these are just quick tests I made and I have no idea which is
really better.

## Install

```bash
gem install stuff-classifier
```

## Usage

You either instantiate one class or the other. Both have the same
signature:

```ruby
require 'stuff-classifier'

# for the naive bayes implementation
cls = StuffClassifier::Bayes.new("Cats or Dogs")

# for the Tf-Idf based implementation
cls = StuffClassifier::TfIdf.new("Cats or Dogs")

# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false)

# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]
 ```

Training the classifier:

```ruby
cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")    
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")
```

And finally, classifying stuff:

```ruby
cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :cat

cls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?") 
#=> :dog
cls.classify("What pet will I love more?")    
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.") 
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog
```

## Persistency

The following layers for saving the training data between sessions are
implemented:

- in memory (by default)
- on disk
- Redis
- (coming soon) in a RDBMS

To persist the data in Redis, you can do this:
```ruby
# defaults to redis running on localhost on default port
store = StuffClassifier::RedisStorage.new(@key)

# pass in connection args
store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})
```

To persist the data on disk, you can do this:

```ruby
store = StuffClassifier::FileStorage.new(@storage_path)

# global setting
StuffClassifier::Base.storage = store

# or alternative local setting on instantiation, by means of an
# optional param ...
cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store)

# after training is done, to persist the data ...
cls.save_state

# or you could just do this:
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
  # when done, save_state is called on END
end

# to start fresh, deleting the saved training data for this classifier
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true)
```

The name you give your classifier is important, as based on it the
data will get loaded and saved. For instance, following 3 classifiers
will be stored in different buckets, being independent of each other.

```ruby
cls1 = StuffClassifier::Bayes.new("Cats or Dogs")
cls2 = StuffClassifier::Bayes.new("True or False")
cls3 = StuffClassifier::Bayes.new("Spam or Ham")	
```

## License

MIT Licensed. See LICENSE.txt for details.




================================================
FILE: Rakefile
================================================
require 'bundler/setup'
require 'rake/testtask'
require 'stuff-classifier'

Rake::TestTask.new(:test) do |test|
  test.libs << 'lib' << 'test'
  test.pattern = 'test/**/test_*.rb'
  test.verbose = true
end

task :default => :test



================================================
FILE: lib/stuff-classifier/base.rb
================================================
# -*- encoding : utf-8 -*-

class StuffClassifier::Base
  extend StuffClassifier::Storage::ActAsStorable
  attr_reader :name
  attr_reader :word_list
  attr_reader :category_list
  attr_reader :training_count

  attr_accessor :tokenizer
  attr_accessor :language

  attr_accessor :thresholds
  attr_accessor :min_prob


  storable :version,:word_list,:category_list,:training_count,:thresholds,:min_prob

  # opts :
  # language
  # stemming : true | false
  # weight
  # assumed_prob
  # storage
  # purge_state ?

  def initialize(name, opts={})
    @version = StuffClassifier::VERSION

    @name = name

    # This values are nil or are loaded from storage
    @word_list = {}
    @category_list = {}
    @training_count=0

    # storage
    purge_state = opts[:purge_state]
    @storage = opts[:storage] || StuffClassifier::Base.storage
    unless purge_state
      @storage.load_state(self)
    else
      @storage.purge_state(self)
    end

    # This value can be set during initialization or overrided after load_state
    @thresholds = opts[:thresholds] || {}
    @min_prob = opts[:min_prob] || 0.0


    @ignore_words = nil
    @tokenizer = StuffClassifier::Tokenizer.new(opts)

  end

  def incr_word(word, category)
    @word_list[word] ||= {}

    @word_list[word][:categories] ||= {}
    @word_list[word][:categories][category] ||= 0
    @word_list[word][:categories][category] += 1

    @word_list[word][:_total_word] ||= 0
    @word_list[word][:_total_word] += 1


    # words count by categroy
    @category_list[category] ||= {}
    @category_list[category][:_total_word] ||= 0
    @category_list[category][:_total_word] += 1

  end

  def incr_cat(category)
    @category_list[category] ||= {}
    @category_list[category][:_count] ||= 0
    @category_list[category][:_count] += 1

    @training_count ||= 0
    @training_count += 1

  end

  # return number of times the word appears in a category
  def word_count(word, category)
    return 0.0 unless @word_list[word] && @word_list[word][:categories] && @word_list[word][:categories][category]
    @word_list[word][:categories][category].to_f
  end

  # return the number of times the word appears in all categories
  def total_word_count(word)
    return 0.0 unless @word_list[word] && @word_list[word][:_total_word]
    @word_list[word][:_total_word].to_f
  end

  # return the number of words in a categories
  def total_word_count_in_cat(cat)
    return 0.0 unless @category_list[cat] && @category_list[cat][:_total_word]
    @category_list[cat][:_total_word].to_f
  end

  # return the number of training item
  def total_cat_count
    @training_count
  end

  # return the number of training document for a category
  def cat_count(category)
    @category_list[category][:_count] ? @category_list[category][:_count].to_f : 0.0
  end

  # return the number of time categories in wich a word appear
  def categories_with_word_count(word)
    return 0 unless @word_list[word] && @word_list[word][:categories]
    @word_list[word][:categories].length
  end

  # return the number of categories
  def total_categories
    categories.length
  end

  # return categories list
  def categories
    @category_list.keys
  end

  # train the classifier
  def train(category, text)
    @tokenizer.each_word(text) {|w| incr_word(w, category) }
    incr_cat(category)
  end

  # classify a text
  def classify(text, default=nil)
    # Find the category with the highest probability
    max_prob = @min_prob
    best = nil

    scores = cat_scores(text)
    scores.each do |score|
      cat, prob = score
      if prob > max_prob
        max_prob = prob
        best = cat
      end
    end

    # Return the default category in case the threshold condition was
    # not met. For example, if the threshold for :spam is 1.2
    #
    #    :spam => 0.73, :ham => 0.40  (OK)
    #    :spam => 0.80, :ham => 0.70  (Fail, :ham is too close)

    return default unless best

    threshold = @thresholds[best] || 1.0

    scores.each do |score|
      cat, prob = score
      next if cat == best
      return default if prob * threshold > max_prob
    end

    return best
  end

  def save_state
    @storage.save_state(self)
  end

  class << self
    attr_writer :storage

    def storage
      @storage = StuffClassifier::InMemoryStorage.new unless defined? @storage
      @storage
    end

    def open(name)
      inst = self.new(name)
      if block_given?
        yield inst
        inst.save_state
      else
        inst
      end
    end
  end
end


================================================
FILE: lib/stuff-classifier/bayes.rb
================================================
# -*- encoding : utf-8 -*-

class StuffClassifier::Bayes < StuffClassifier::Base
  attr_accessor :weight
  attr_accessor :assumed_prob


  # http://en.wikipedia.org/wiki/Naive_Bayes_classifier
  extend StuffClassifier::Storage::ActAsStorable
  storable :weight,:assumed_prob

  def initialize(name, opts={})
    super(name, opts)
    @weight = opts[:weight] || 1.0
    @assumed_prob = opts[:assumed_prob] || 0.1
  end

  def word_prob(word, cat)
    total_words_in_cat = total_word_count_in_cat(cat)
    return 0.0 if total_words_in_cat == 0
    word_count(word, cat).to_f / total_words_in_cat
  end


  def word_weighted_average(word, cat, opts={})
    func = opts[:func]

    # calculate current probability
    basic_prob = func ? func.call(word, cat) : word_prob(word, cat)

    # count the number of times this word has appeared in all
    # categories
    totals = total_word_count(word)

    # the final weighted average
    (@weight * @assumed_prob + totals * basic_prob) / (@weight + totals)
  end

  def doc_prob(text, category)
    @tokenizer.each_word(text).map {|w|
      word_weighted_average(w, category)
    }.inject(1) {|p,c| p * c}
  end

  def text_prob(text, category)
    cat_prob = cat_count(category) / total_cat_count
    doc_prob = doc_prob(text, category)
    cat_prob * doc_prob
  end

  def cat_scores(text)
    probs = {}
    categories.each do |cat|
      probs[cat] = text_prob(text, cat)
    end
    probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}
  end


  def word_classification_detail(word)

    p "word_prob"
    result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end
    p result

    p "word_weighted_average"
    result=categories.inject({}) do |h,cat| h[cat]=word_weighted_average(word,cat);h end
    p result

    p "doc_prob"
    result=categories.inject({}) do |h,cat| h[cat]=doc_prob(word,cat);h end
    p result

    p "text_prob"
    result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end
    p result


  end

end


================================================
FILE: lib/stuff-classifier/storage.rb
================================================
# -*- encoding : utf-8 -*-
module StuffClassifier

  class Storage
    module ActAsStorable
        def storable(*to_store)
          @to_store = to_store
        end
        def to_store
          @to_store || []
        end
    end

    attr_accessor :storage

    def initialize(*opts)
      @storage = {}
    end

    def storage_to_classifier(classifier)
      if @storage.key? classifier.name
        @storage[classifier.name].each do |var,value|
          classifier.instance_variable_set "@#{var}",value
        end
      end
    end

    def classifier_to_storage(classifier)
      to_store = classifier.class.to_store + classifier.class.superclass.to_store
      @storage[classifier.name] =  to_store.inject({}) {|h,var| h[var] = classifier.instance_variable_get("@#{var}");h}
    end

    def clear_storage(classifier)
      @storage.delete(classifier.name)
    end

  end

  class InMemoryStorage < Storage
    def initialize
      super
    end

    def load_state(classifier)
      storage_to_classifier(classifier)
    end

    def save_state(classifier)
      classifier_to_storage(classifier)
    end

    def purge_state(classifier)
      clear_storage(classifier)
    end

  end

  class FileStorage < Storage
    def initialize(path)
      super
      @path = path
    end

    def load_state(classifier)
      if @storage.length == 0 && File.exists?(@path)
        data = File.open(@path, 'rb') { |f| f.read }
        @storage = Marshal.load(data)
      end
      storage_to_classifier(classifier)
    end

    def save_state(classifier)
      classifier_to_storage(classifier)
      _write_to_file
    end

    def purge_state(classifier)
      clear_storage(classifier)
      _write_to_file
    end

    def _write_to_file
      File.open(@path, 'wb') do |fh|
        fh.flock(File::LOCK_EX)
        fh.write(Marshal.dump(@storage))
      end
    end

  end

  class RedisStorage < Storage
    def initialize(key, redis_options=nil)
      super
      @key = key
      @redis = Redis.new(redis_options || {})
    end

    def load_state(classifier)
      if @storage.length == 0 && @redis.exists(@key)
        data = @redis.get(@key)
        @storage = Marshal.load(data)
      end
      storage_to_classifier(classifier)
    end

    def save_state(classifier)
      classifier_to_storage(classifier)
      _write_to_redis
    end

    def purge_state(classifier)
      clear_storage(classifier)
      _write_to_redis
    end

    private
    def _write_to_redis
      data = Marshal.dump(@storage)
      @redis.set(@key, data)
    end
  end
end


================================================
FILE: lib/stuff-classifier/tf-idf.rb
================================================
# -*- encoding : utf-8 -*-
class StuffClassifier::TfIdf < StuffClassifier::Base
  extend StuffClassifier::Storage::ActAsStorable

  def initialize(name, opts={})
    super(name, opts)
  end


  def word_prob(word, cat)
    word_cat_nr = word_count(word, cat)
    cat_nr = cat_count(cat)

    tf = 1.0 * word_cat_nr / cat_nr

    idf = Math.log10((total_categories + 2) / (categories_with_word_count(word) + 1.0))
    tf * idf
  end

  def text_prob(text, cat)
    @tokenizer.each_word(text).map{|w| word_prob(w, cat)}.inject(0){|s,p| s + p}
  end

  def cat_scores(text)
    probs = {}
    categories.each do |cat|
      p = text_prob(text, cat)
      probs[cat] = p
    end
    probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}
  end

  def word_classification_detail(word)

    p "tf_idf"
    result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end
    ap result

    p "text_prob"
    result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end
    ap result

  end

end


================================================
FILE: lib/stuff-classifier/tokenizer/tokenizer_properties.rb
================================================
# -*- encoding : utf-8 -*-
require 'set'
StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES = {
  "en" => {
    :preprocessing_regexps => {/['`]/ => '',/[_]/ => ' '},
    :stop_word => Set.new([
                            '的','个','得',
                            'a', 'about', 'above', 'across', 'after', 'afterwards',
                            'again', 'against', 'all', 'almost', 'alone', 'along',
                            'already', 'also', 'although', 'always', 'am', 'among',
                            'amongst', 'amoungst', 'amount', 'an', 'and', 'another',
                            'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere',
                            'are', 'around', 'as', 'at', 'back', 'be',
                            'became', 'because', 'become', 'becomes', 'becoming', 'been',
                            'before', 'beforehand', 'behind', 'being', 'below', 'beside',
                            'besides', 'between', 'beyond', 'bill', 'both', 'bottom',
                            'but', 'by', 'call', 'can', 'cannot', 'cant', 'dont',
                            'co', 'computer', 'con', 'could', 'couldnt', 'cry',
                            'de', 'describe', 'detail', 'do', 'done', 'down',
                            'due', 'during', 'each', 'eg', 'eight', 'either',
                            'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every',
                            'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen',
                            'fify', 'fill', 'find', 'fire', 'first', 'five',
                            'for', 'former', 'formerly', 'forty', 'found', 'four',
                            'from', 'front', 'full', 'further', 'get', 'give',
                            'go', 'had', 'has', 'hasnt', 'have', 'he',
                            'hence', 'her', 'here', 'hereafter', 'hereby', 'herein',
                            'hereupon', 'hers', 'herself', 'him', 'himself', 'his',
                            'how', 'however', 'hundred', 'i', 'ie', 'if',
                            'in', 'inc', 'indeed', 'interest', 'into', 'is',
                            'it', 'its', 'itself', 'keep', 'last', 'latter',
                            'latterly', 'least', 'less', 'ltd', 'made', 'many',
                            'may', 'me', 'meanwhile', 'might', 'mill', 'mine',
                            'more', 'moreover', 'most', 'mostly', 'move', 'much',
                            'must', 'my', 'myself', 'name', 'namely', 'neither',
                            'never', 'nevertheless', 'next', 'nine', 'no', 'nobody',
                            'none', 'noone', 'nor', 'not', 'nothing', 'now',
                            'nowhere', 'of', 'off', 'often', 'on', 'once',
                            'one', 'only', 'onto', 'or', 'other', 'others',
                            'otherwise', 'our', 'ours', 'ourselves', 'out', 'over',
                            'own', 'part', 'per', 'perhaps', 'please', 'put',
                            'rather', 're', 'same', 'see', 'seem', 'seemed',
                            'seeming', 'seems', 'serious', 'several', 'she', 'should',
                            'show', 'side', 'since', 'sincere', 'six', 'sixty',
                            'so', 'some', 'somehow', 'someone', 'something', 'sometime',
                            'sometimes', 'somewhere', 'still', 'such', 'system', 'take',
                            'ten', 'than', 'that', 'the', 'their', 'them',
                            'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',
                            'therefore', 'therein', 'thereupon', 'these', 'they', 'thick',
                            'thin', 'third', 'this', 'those', 'though', 'three',
                            'through', 'throughout', 'thru', 'thus', 'to', 'together',
                            'too', 'top', 'toward', 'towards', 'twelve', 'twenty',
                            'two', 'un', 'under', 'until', 'up', 'upon',
                            'us', 'very', 'via', 'was', 'we', 'well',
                            'were', 'what', 'whatever', 'when', 'whence', 'whenever',
                            'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
                            'wherever', 'whether', 'which', 'while', 'whither', 'who',
                            'whoever', 'whole', 'whom', 'whose', 'why', 'will',
                            'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours',
                            'yourself', 'yourselves'
    ])
  },
  "fr" => {
    :stop_word => Set.new([
                            'au',  'aux',  'avec',  'ce',  'ces',  'dans',  'de',  'des',  'du',  'elle',  'en',  'et',  'eux',
                            'il',  'je',  'la',  'le',  'leur',  'lui',  'ma',  'mais',  'me',  'même',  'mes',  'moi',  'mon',
                            'ne',  'nos',  'notre',  'nous',  'on',  'ou',  'par',  'pas',  'pour',  'qu',  'que',  'qui',  'sa',
                            'se',  'ses',  'son',  'sur',  'ta',  'te',  'tes',  'toi',  'ton',  'tu',  'un',  'une',  'vos',  'votre',
                            'vous',  'c',  'd',  'j',  'l',  'à',  'm',  'n',  's',  't',  'y',  'été',  'étée',  'étées',
                            'étés',  'étant',  'suis',  'es',  'est',  'sommes',  'êtes',  'sont',  'serai',  'seras',
                            'sera',  'serons',  'serez',  'seront',  'serais',  'serait',  'serions',  'seriez',  'seraient',
                            'étais',  'était',  'étions',  'étiez',  'étaient',  'fus',  'fut',  'fûmes',  'fûtes',
                            'furent',  'sois',  'soit',  'soyons',  'soyez',  'soient',  'fusse',  'fusses',  'fût',
                            'fussions',  'fussiez',  'fussent',  'ayant',  'eu',  'eue',  'eues',  'eus',  'ai',  'as',
                            'avons',  'avez',  'ont',  'aurai',  'auras',  'aura',  'aurons',  'aurez',  'auront',  'aurais',
                            'aurait',  'aurions',  'auriez',  'auraient',  'avais',  'avait',  'avions',  'aviez',  'avaient',
                            'eut',  'eûmes',  'eûtes',  'eurent',  'aie',  'aies',  'ait',  'ayons',  'ayez',  'aient',  'eusse',
                            'eusses',  'eût',  'eussions',  'eussiez',  'eussent',  'ceci',  'celà ',  'cet',  'cette',  'ici',
                            'ils',  'les',  'leurs',  'quel',  'quels',  'quelle',  'quelles',  'sans',  'soi'
    ])
  },
  "de" => {
    :stop_word => Set.new([
                            'aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere',
                            'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf',
                            'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das',
                            'daß', 'dass', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe',
                            'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du',
                            'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen',
                            'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es',
                            'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat',
                            'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres',
                            'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener',
                            'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche',
                            'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach',
                            'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst',
                            'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über',
                            'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst',
                            'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will',
                            'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen'
    ])
  }
}


================================================
FILE: lib/stuff-classifier/tokenizer.rb
================================================
# -*- encoding : utf-8 -*-
require "lingua/stemmer"
require "rseg"

class StuffClassifier::Tokenizer
  require  "stuff-classifier/tokenizer/tokenizer_properties"

  def initialize(opts={})
    @language = opts.key?(:language) ? opts[:language] : "en"
    @properties = StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES[@language]

    @stemming = opts.key?(:stemming) ? opts[:stemming] : true
    if @stemming
      @stemmer = Lingua::Stemmer.new(:language => @language)
    end
  end

  def language
    @language
  end

  def preprocessing_regexps=(value)
    @preprocessing_regexps = value
  end

  def preprocessing_regexps
    @preprocessing_regexps || @properties[:preprocessing_regexps]
  end

  def ignore_words=(value)
    @ignore_words = value
  end

  def ignore_words
    @ignore_words || @properties[:stop_word]
  end

  def stemming?
    @stemming || false
  end

  def each_word(string)
    string = string.strip
    return if string == ''

    words = []

    # tokenize string
    string.split("\n").each do |line|

      # Apply preprocessing regexps
      if preprocessing_regexps
        preprocessing_regexps.each { |regexp,replace_by| line.gsub!(regexp, replace_by) }
      end

      Rseg.segment(line).each do |w|
        next if w == '' || ignore_words.member?(w.downcase)

        if stemming? and stemable?(w)
          w = @stemmer.stem(w).downcase
          next if ignore_words.member?(w)
        else
          w = w.downcase
        end

        words << (block_given? ? (yield w) : w)
      end
    end

    return words
  end

  private

  def stemable?(word)
    true
    word =~ /^\p{Alpha}+$/
  end

end


================================================
FILE: lib/stuff-classifier/version.rb
================================================
module StuffClassifier
  VERSION = '0.5'
end


================================================
FILE: lib/stuff-classifier.rb
================================================
# -*- encoding : utf-8 -*-
module StuffClassifier
  autoload :VERSION,    'stuff-classifier/version'

  autoload :Storage, 'stuff-classifier/storage'
  autoload :InMemoryStorage, 'stuff-classifier/storage'
  autoload :FileStorage,     'stuff-classifier/storage'
  autoload :RedisStorage, 'stuff-classifier/storage'

  autoload :Tokenizer,  'stuff-classifier/tokenizer'
  autoload :TOKENIZER_PROPERTIES, 'stuff-classifier/tokenizer/tokenizer_properties'

  autoload :Base,       'stuff-classifier/base'
  autoload :Bayes,      'stuff-classifier/bayes'
  autoload :TfIdf,      'stuff-classifier/tf-idf'

end


================================================
FILE: stuff-classifier.gemspec
================================================
# -*- encoding: utf-8 -*-
$:.push File.expand_path("../lib", __FILE__)
require "stuff-classifier/version"

Gem::Specification.new do |s|
  s.name        = "stuff-classifier"
  s.version     = StuffClassifier::VERSION
  s.authors     = ["Alexandru Nedelcu"]
  s.email       = ["github@contact.bionicspirit.com"]
  s.homepage    = "https://github.com/alexandru/stuff-classifier/"
  s.summary     = %q{Simple text classifier(s) implemetation}
  s.description = %q{2 methods are provided for now - (1) naive bayes implementation + (2) tf-idf weights}

  s.files         = `git ls-files`.split("\n")
  s.test_files    = `git ls-files -- {test,spec,features}/*`.split("\n")
  s.executables   = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
  s.require_paths = ["lib"]

  s.required_ruby_version = '>= 1.9.1'

  s.add_runtime_dependency "ruby-stemmer"
  s.add_runtime_dependency "sequel"
  s.add_runtime_dependency "redis"


  s.add_development_dependency "bundler"
  s.add_development_dependency "rake", ">= 0.9.2"
  s.add_development_dependency "minitest", "~> 4"
  s.add_development_dependency "turn", ">= 0.8.3"
  s.add_development_dependency "simplecov"
  s.add_development_dependency "awesome_print"
  s.add_development_dependency "ruby-debug19"
  s.add_development_dependency "rseg"

end



================================================
FILE: test/helper.rb
================================================
# -*- encoding : utf-8 -*-
require 'simplecov'
SimpleCov.start

require 'turn'
require 'minitest/autorun'
require 'stuff-classifier'

Turn.config do |c|
 # use one of output formats:
 # :outline  - turn's original case/test outline mode [default]
 # :progress - indicates progress with progress bar
 # :dotted   - test/unit's traditional dot-progress mode
 # :pretty   - new pretty reporter
 # :marshal  - dump output as YAML (normal run mode only)
 # :cue      - interactive testing
 c.format  = :cue
 # turn on invoke/execute tracing, enable full backtrace
 c.trace   = true
 # use humanized test names (works only with :outline format)
 c.natural = true
end

class TestBase < MiniTest::Unit::TestCase
  def self.before(&block)
    @on_setup = block if block
    @on_setup
  end

  def setup
    on_setup = self.class.before
    instance_eval(&on_setup) if on_setup
  end

  def set_classifier(instance)
    @classifier = instance
  end
  def classifier
    @classifier
  end


  def train(category, value)
    @classifier.train(category, value)
  end

  def should_be(category, value)
    assert_equal category, @classifier.classify(value), value
  end
end


================================================
FILE: test/test_001_tokenizer.rb
================================================
# -*- coding: utf-8 -*-
require './helper.rb'

class Test001Tokenizer < TestBase
  before do
    @en_tokenizer = StuffClassifier::Tokenizer.new
    @fr_tokenizer = StuffClassifier::Tokenizer.new(:language => "fr")
  end

  def test_simple_tokens
     words =  @en_tokenizer.each_word('Hello world! How are you?')
     should_return = ["hello", "world"]

     assert_equal should_return, words
  end

  def test_with_stemming
    words =  @en_tokenizer.each_word('Lots of dogs, lots of cats! This really is the information highway')
    should_return =["lot", "dog", "lot", "cat", "realli" ,"inform", "highway" ]

    assert_equal should_return, words

  end

  def test_complicated_tokens
    words = @en_tokenizer.each_word("I don't really get what you want to
      accomplish. There is a class TestEval2, you can do test_eval2 =
      TestEval2.new afterwards. And: class A ... end always yields nil, so
      your output is ok I guess ;-)")

    should_return = [
      "realli", "want", "accomplish", "class",
      "testeval2",  "test", "eval2","testeval2", "new", "class", "end",
      "yield", "nil", "output", "ok", "guess"]

    assert_equal should_return, words
  end

  def test_unicode

    words = @fr_tokenizer.each_word("il s'appelle le vilain petit canard : en référence à Hans Christian Andersen, se démarquer négativement")

    should_return = [
      "appel", "vilain", "pet", "canard", "référent",
      "han", "christian", "andersen", "démarqu", "négat"]

    assert_equal should_return, words
  end

end


================================================
FILE: test/test_002_base.rb
================================================
require 'helper'


class Test002Base < TestBase
  before do
    @cls = StuffClassifier::Bayes.new("Cats or Dogs")
    set_classifier @cls
    
    train :dog, "Dogs are awesome, cats too. I love my dog"
    train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"    
    train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
    train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
    train :dog, "So which one should you choose? A dog, definitely."
    train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
    train :dog, "A dog will eat anything, including birds or whatever meat"
    train :cat, "My cat's favorite place to purr is on my keyboard"
    train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
  end

  def test_count 
    assert @cls.total_cat_count == 9
    assert @cls.categories.map {|c| @cls.cat_count(c)}.inject(0){|s,count| s+count} == 9
    

    # compare word count sum to word by cat count sum 
    assert @cls.word_list.map  {|w| @cls.total_word_count(w[0]) }.inject(0)  {|s,count| s+count}  == 58
    assert @cls.categories.map {|c| @cls.total_word_count_in_cat(c) }.inject(0){|s,count| s+count}  == 58

    # test word count by categories
    assert @cls.word_list.map {|w| @cls.word_count(w[0],:dog) }.inject(0)  {|s,count| s+count}  == 29
    assert @cls.word_list.map {|w| @cls.word_count(w[0],:cat) }.inject(0)  {|s,count| s+count}  == 29

    # for all categories
    assert @cls.categories.map {|c| @cls.word_list.map {|w| @cls.word_count(w[0],c) }.inject(0) {|s,count| s+count} }.inject(0){|s,count| s+count}  == 58

  end

end


================================================
FILE: test/test_003_naive_bayes.rb
================================================
require 'helper'


class Test003NaiveBayesClassification < TestBase
  before do
    set_classifier StuffClassifier::Bayes.new("Cats or Dogs")
    
    train :dog, "Dogs are awesome, cats too. I love my dog"
    train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"    
    train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
    train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
    train :dog, "So which one should you choose? A dog, definitely."
    train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
    train :dog, "A dog will eat anything, including birds or whatever meat"
    train :cat, "My cat's favorite place to purr is on my keyboard"
    train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
  end

  def test_for_cats 
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"
  end

  def test_for_dogs
    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie." 
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"
  end

  def test_min_prob
    classifier.min_prob = 0.001
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be nil, "The most annoying animal on earth."
    should_be nil, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"
    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be nil, "I like big buts and I cannot lie." 
    should_be nil, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"
  end


end


================================================
FILE: test/test_004_tf_idf.rb
================================================
require 'helper'


class Test004TfIdfClassification < TestBase
  before do
    set_classifier StuffClassifier::TfIdf.new("Cats or Dogs")
    
    train :dog, "Dogs are awesome, cats too. I love my dog"
    train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"    
    train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
    train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
    train :dog, "So which one should you choose? A dog, definitely."
    train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
    train :dog, "A dog will eat anything, including birds or whatever meat"
    train :cat, "My cat's favorite place to purr is on my keyboard"
    train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
  end

  def test_for_cats 
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"
  end

  def test_for_dogs
    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie." 
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who is eating my meat?"
  end
end


================================================
FILE: test/test_005_in_memory_storage.rb
================================================
require 'helper'


class Test005InMemoryStorage < TestBase
  before do
    StuffClassifier::Base.storage = StuffClassifier::InMemoryStorage.new

    StuffClassifier::Bayes.open("Cats or Dogs") do |cls|    
      cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
      cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
    end
  end

  def test_for_persistance
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::InMemoryStorage),
        "@storage should be an instance of FileStorage"
      test.assert @word_list.length > 0, "Word count should be persisted"
      test.assert @category_list.length > 0, "Category count should be persisted"
    end
  end

  def test_purge_state
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
      test.assert @word_list.length == 0, "Word count should be purged"
      test.assert @category_list.length == 0, "Category count should be purged"
    end
  end
end


================================================
FILE: test/test_006_file_storage.rb
================================================
require 'helper'


class Test006FileStorage < TestBase
  before do
    @storage_path = "/tmp/test_classifier.db"
    @storage = StuffClassifier::FileStorage.new(@storage_path)
    StuffClassifier::Base.storage = @storage

    StuffClassifier::Bayes.open("Cats or Dogs") do |cls|    
      cls.train(:dog, "Dogs are awesome, cats too. I love my dog.")
      cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
      cls.train(:dog, "So which one should you choose? A dog, definitely.")
      cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
      cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")

      cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
      cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
      cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
      cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")    
    end

    # redefining storage instance, forcing it to read from file again
    StuffClassifier::Base.storage = StuffClassifier::FileStorage.new(@storage_path)
  end

  def teardown
    File.unlink @storage_path if File.exists? @storage_path
  end

  def test_result    
    set_classifier StuffClassifier::Bayes.new("Cats or Dogs")
    
    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"

    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?" 
    should_be :dog, "What pet will I love more?"    
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie." 
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"
    
  end

  def test_for_persistance    
    assert ! @storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same"

    test = self
    StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage"
      test.assert @word_list.length > 0, "Word count should be persisted"
      test.assert @category_list.length > 0, "Category count should be persisted"
    end
  end

  def test_file_created
    assert File.exist?(@storage_path), "File #@storage_path should exist"

    content = File.read(@storage_path)
    assert content.length > 100, "Serialized content should have more than 100 chars"
  end

  def test_purge_state
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage"
      test.assert @word_list.length == 0, "Word count should be purged"
      test.assert @category_list.length == 0, "Category count should be purged"
    end
  end
end


================================================
FILE: test/test_007_redis_storage.rb
================================================
require 'helper'
require 'redis'


class Test007RedisStorage < TestBase
  before do
    @key = "test_classifier"
    @redis_options = { host: 'localhost', port: 6379 }
    @redis = Redis.new(@redis_options)

    @storage = StuffClassifier::RedisStorage.new(@key, @redis_options)
    StuffClassifier::Base.storage = @storage

    StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
      cls.train(:dog, "Dogs are awesome, cats too. I love my dog.")
      cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
      cls.train(:dog, "So which one should you choose? A dog, definitely.")
      cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
      cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")

      cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
      cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
      cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
      cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
    end

    # redefining storage instance, forcing it to read from file again
    StuffClassifier::Base.storage = StuffClassifier::RedisStorage.new(@key, @redis_options)
  end

  def teardown
    @redis.del(@key)
  end

  def test_result
    set_classifier StuffClassifier::Bayes.new("Cats or Dogs")

    should_be :cat, "This test is about cats."
    should_be :cat, "I hate ..."
    should_be :cat, "The most annoying animal on earth."
    should_be :cat, "The preferred company of software developers."
    should_be :cat, "My precious, my favorite!"
    should_be :cat, "Kill that bird!"

    should_be :dog, "This test is about dogs."
    should_be :dog, "Cats or Dogs?"
    should_be :dog, "What pet will I love more?"
    should_be :dog, "Willy, where the heck are you?"
    should_be :dog, "I like big buts and I cannot lie."
    should_be :dog, "Why is the front door of our house open?"
    should_be :dog, "Who ate my meat?"

  end

  def test_for_persistance
    assert !@storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same"

    test = self
    StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage"
      test.assert @word_list.length > 0, "Word count should be persisted"
      test.assert @category_list.length > 0, "Category count should be persisted"
    end
  end

  def test_key_created
    assert @redis.exists(@key), "Redis key #{@key} should exist"

    content = @redis.get(@key)
    assert content.length > 100, "Serialized content should have more than 100 chars"
  end

  def test_purge_state
    test = self
    StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
      test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage"
      test.assert @word_list.length == 0, "Word count should be purged"
      test.assert @category_list.length == 0, "Category count should be purged"
    end
  end
end

Download .txt

gitextract_9kt2n9by/

├── .gitignore
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── lib/
│   ├── stuff-classifier/
│   │   ├── base.rb
│   │   ├── bayes.rb
│   │   ├── storage.rb
│   │   ├── tf-idf.rb
│   │   ├── tokenizer/
│   │   │   └── tokenizer_properties.rb
│   │   ├── tokenizer.rb
│   │   └── version.rb
│   └── stuff-classifier.rb
├── stuff-classifier.gemspec
└── test/
    ├── helper.rb
    ├── test_001_tokenizer.rb
    ├── test_002_base.rb
    ├── test_003_naive_bayes.rb
    ├── test_004_tf_idf.rb
    ├── test_005_in_memory_storage.rb
    ├── test_006_file_storage.rb
    └── test_007_redis_storage.rb

Download .txt

SYMBOL INDEX (105 symbols across 15 files)

FILE: lib/stuff-classifier.rb
  type StuffClassifier (line 2) | module StuffClassifier

FILE: lib/stuff-classifier/base.rb
  class StuffClassifier::Base (line 3) | class StuffClassifier::Base
    method initialize (line 27) | def initialize(name, opts={})
    method incr_word (line 56) | def incr_word(word, category)
    method incr_cat (line 74) | def incr_cat(category)
    method word_count (line 85) | def word_count(word, category)
    method total_word_count (line 91) | def total_word_count(word)
    method total_word_count_in_cat (line 97) | def total_word_count_in_cat(cat)
    method total_cat_count (line 103) | def total_cat_count
    method cat_count (line 108) | def cat_count(category)
    method categories_with_word_count (line 113) | def categories_with_word_count(word)
    method total_categories (line 119) | def total_categories
    method categories (line 124) | def categories
    method train (line 129) | def train(category, text)
    method classify (line 135) | def classify(text, default=nil)
    method save_state (line 168) | def save_state
    method storage (line 175) | def storage
    method open (line 180) | def open(name)

FILE: lib/stuff-classifier/bayes.rb
  class StuffClassifier::Bayes (line 3) | class StuffClassifier::Bayes < StuffClassifier::Base
    method initialize (line 12) | def initialize(name, opts={})
    method word_prob (line 18) | def word_prob(word, cat)
    method word_weighted_average (line 25) | def word_weighted_average(word, cat, opts={})
    method doc_prob (line 39) | def doc_prob(text, category)
    method text_prob (line 45) | def text_prob(text, category)
    method cat_scores (line 51) | def cat_scores(text)
    method word_classification_detail (line 60) | def word_classification_detail(word)

FILE: lib/stuff-classifier/storage.rb
  type StuffClassifier (line 2) | module StuffClassifier
    class Storage (line 4) | class Storage
      type ActAsStorable (line 5) | module ActAsStorable
        function storable (line 6) | def storable(*to_store)
        function to_store (line 9) | def to_store
      method initialize (line 16) | def initialize(*opts)
      method storage_to_classifier (line 20) | def storage_to_classifier(classifier)
      method classifier_to_storage (line 28) | def classifier_to_storage(classifier)
      method clear_storage (line 33) | def clear_storage(classifier)
    class InMemoryStorage (line 39) | class InMemoryStorage < Storage
      method initialize (line 40) | def initialize
      method load_state (line 44) | def load_state(classifier)
      method save_state (line 48) | def save_state(classifier)
      method purge_state (line 52) | def purge_state(classifier)
    class FileStorage (line 58) | class FileStorage < Storage
      method initialize (line 59) | def initialize(path)
      method load_state (line 64) | def load_state(classifier)
      method save_state (line 72) | def save_state(classifier)
      method purge_state (line 77) | def purge_state(classifier)
      method _write_to_file (line 82) | def _write_to_file
    class RedisStorage (line 91) | class RedisStorage < Storage
      method initialize (line 92) | def initialize(key, redis_options=nil)
      method load_state (line 98) | def load_state(classifier)
      method save_state (line 106) | def save_state(classifier)
      method purge_state (line 111) | def purge_state(classifier)
      method _write_to_redis (line 117) | def _write_to_redis

FILE: lib/stuff-classifier/tf-idf.rb
  class StuffClassifier::TfIdf (line 2) | class StuffClassifier::TfIdf < StuffClassifier::Base
    method initialize (line 5) | def initialize(name, opts={})
    method word_prob (line 10) | def word_prob(word, cat)
    method text_prob (line 20) | def text_prob(text, cat)
    method cat_scores (line 24) | def cat_scores(text)
    method word_classification_detail (line 33) | def word_classification_detail(word)

FILE: lib/stuff-classifier/tokenizer.rb
  class StuffClassifier::Tokenizer (line 5) | class StuffClassifier::Tokenizer
    method initialize (line 8) | def initialize(opts={})
    method language (line 18) | def language
    method preprocessing_regexps= (line 22) | def preprocessing_regexps=(value)
    method preprocessing_regexps (line 26) | def preprocessing_regexps
    method ignore_words= (line 30) | def ignore_words=(value)
    method ignore_words (line 34) | def ignore_words
    method stemming? (line 38) | def stemming?
    method each_word (line 42) | def each_word(string)
    method stemable? (line 75) | def stemable?(word)

FILE: lib/stuff-classifier/version.rb
  type StuffClassifier (line 1) | module StuffClassifier

FILE: test/helper.rb
  class TestBase (line 24) | class TestBase < MiniTest::Unit::TestCase
    method before (line 25) | def self.before(&block)
    method setup (line 30) | def setup
    method set_classifier (line 35) | def set_classifier(instance)
    method classifier (line 38) | def classifier
    method train (line 43) | def train(category, value)
    method should_be (line 47) | def should_be(category, value)

FILE: test/test_001_tokenizer.rb
  class Test001Tokenizer (line 4) | class Test001Tokenizer < TestBase
    method test_simple_tokens (line 10) | def test_simple_tokens
    method test_with_stemming (line 17) | def test_with_stemming
    method test_complicated_tokens (line 25) | def test_complicated_tokens
    method test_unicode (line 39) | def test_unicode

FILE: test/test_002_base.rb
  class Test002Base (line 4) | class Test002Base < TestBase
    method test_count (line 20) | def test_count

FILE: test/test_003_naive_bayes.rb
  class Test003NaiveBayesClassification (line 4) | class Test003NaiveBayesClassification < TestBase
    method test_for_cats (line 19) | def test_for_cats
    method test_for_dogs (line 28) | def test_for_dogs
    method test_min_prob (line 38) | def test_min_prob

FILE: test/test_004_tf_idf.rb
  class Test004TfIdfClassification (line 4) | class Test004TfIdfClassification < TestBase
    method test_for_cats (line 19) | def test_for_cats
    method test_for_dogs (line 28) | def test_for_dogs

FILE: test/test_005_in_memory_storage.rb
  class Test005InMemoryStorage (line 4) | class Test005InMemoryStorage < TestBase
    method test_for_persistance (line 14) | def test_for_persistance
    method test_purge_state (line 24) | def test_purge_state

FILE: test/test_006_file_storage.rb
  class Test006FileStorage (line 4) | class Test006FileStorage < TestBase
    method teardown (line 27) | def teardown
    method test_result (line 31) | def test_result
    method test_for_persistance (line 51) | def test_for_persistance
    method test_file_created (line 62) | def test_file_created
    method test_purge_state (line 69) | def test_purge_state

FILE: test/test_007_redis_storage.rb
  class Test007RedisStorage (line 5) | class Test007RedisStorage < TestBase
    method teardown (line 31) | def teardown
    method test_result (line 35) | def test_result
    method test_for_persistance (line 55) | def test_for_persistance
    method test_key_created (line 66) | def test_key_created
    method test_purge_state (line 73) | def test_purge_state

Download .json

Condensed preview — 22 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (49K chars).

[
  {
    "path": ".gitignore",
    "chars": 55,
    "preview": ".rvmrc\ncoverage/\n.DS_Store\n*.gem\nutils.rb\nGemfile.lock\n"
  },
  {
    "path": "Gemfile",
    "chars": 38,
    "preview": "source \"http://rubygems.org\"\n\ngemspec\n"
  },
  {
    "path": "LICENSE.txt",
    "chars": 1061,
    "preview": "Copyright (c) 2012 Alexandru Nedelcu\n\nPermission is hereby granted, free of charge, to any person obtaining\na copy of th"
  },
  {
    "path": "README.md",
    "chars": 5174,
    "preview": "# stuff-classifier\n\n## No longer maintained\n\nThis repository is no longer maintained for some time. If you're interested"
  },
  {
    "path": "Rakefile",
    "chars": 231,
    "preview": "require 'bundler/setup'\nrequire 'rake/testtask'\nrequire 'stuff-classifier'\n\nRake::TestTask.new(:test) do |test|\n  test.l"
  },
  {
    "path": "lib/stuff-classifier/base.rb",
    "chars": 4515,
    "preview": "# -*- encoding : utf-8 -*-\n\nclass StuffClassifier::Base\n  extend StuffClassifier::Storage::ActAsStorable\n  attr_reader :"
  },
  {
    "path": "lib/stuff-classifier/bayes.rb",
    "chars": 2009,
    "preview": "# -*- encoding : utf-8 -*-\n\nclass StuffClassifier::Bayes < StuffClassifier::Base\n  attr_accessor :weight\n  attr_accessor"
  },
  {
    "path": "lib/stuff-classifier/storage.rb",
    "chars": 2569,
    "preview": "# -*- encoding : utf-8 -*-\nmodule StuffClassifier\n\n  class Storage\n    module ActAsStorable\n        def storable(*to_sto"
  },
  {
    "path": "lib/stuff-classifier/tf-idf.rb",
    "chars": 1013,
    "preview": "# -*- encoding : utf-8 -*-\nclass StuffClassifier::TfIdf < StuffClassifier::Base\n  extend StuffClassifier::Storage::ActAs"
  },
  {
    "path": "lib/stuff-classifier/tokenizer/tokenizer_properties.rb",
    "chars": 9099,
    "preview": "# -*- encoding : utf-8 -*-\nrequire 'set'\nStuffClassifier::Tokenizer::TOKENIZER_PROPERTIES = {\n  \"en\" => {\n    :preproces"
  },
  {
    "path": "lib/stuff-classifier/tokenizer.rb",
    "chars": 1640,
    "preview": "# -*- encoding : utf-8 -*-\nrequire \"lingua/stemmer\"\nrequire \"rseg\"\n\nclass StuffClassifier::Tokenizer\n  require  \"stuff-c"
  },
  {
    "path": "lib/stuff-classifier/version.rb",
    "chars": 45,
    "preview": "module StuffClassifier\n  VERSION = '0.5'\nend\n"
  },
  {
    "path": "lib/stuff-classifier.rb",
    "chars": 606,
    "preview": "# -*- encoding : utf-8 -*-\nmodule StuffClassifier\n  autoload :VERSION,    'stuff-classifier/version'\n\n  autoload :Storag"
  },
  {
    "path": "stuff-classifier.gemspec",
    "chars": 1307,
    "preview": "# -*- encoding: utf-8 -*-\n$:.push File.expand_path(\"../lib\", __FILE__)\nrequire \"stuff-classifier/version\"\n\nGem::Specific"
  },
  {
    "path": "test/helper.rb",
    "chars": 1160,
    "preview": "# -*- encoding : utf-8 -*-\nrequire 'simplecov'\nSimpleCov.start\n\nrequire 'turn'\nrequire 'minitest/autorun'\nrequire 'stuff"
  },
  {
    "path": "test/test_001_tokenizer.rb",
    "chars": 1528,
    "preview": "# -*- coding: utf-8 -*-\nrequire './helper.rb'\n\nclass Test001Tokenizer < TestBase\n  before do\n    @en_tokenizer = StuffCl"
  },
  {
    "path": "test/test_002_base.rb",
    "chars": 1775,
    "preview": "require 'helper'\n\n\nclass Test002Base < TestBase\n  before do\n    @cls = StuffClassifier::Bayes.new(\"Cats or Dogs\")\n    se"
  },
  {
    "path": "test/test_003_naive_bayes.rb",
    "chars": 2326,
    "preview": "require 'helper'\n\n\nclass Test003NaiveBayesClassification < TestBase\n  before do\n    set_classifier StuffClassifier::Baye"
  },
  {
    "path": "test/test_004_tf_idf.rb",
    "chars": 1630,
    "preview": "require 'helper'\n\n\nclass Test004TfIdfClassification < TestBase\n  before do\n    set_classifier StuffClassifier::TfIdf.new"
  },
  {
    "path": "test/test_005_in_memory_storage.rb",
    "chars": 1104,
    "preview": "require 'helper'\n\n\nclass Test005InMemoryStorage < TestBase\n  before do\n    StuffClassifier::Base.storage = StuffClassifi"
  },
  {
    "path": "test/test_006_file_storage.rb",
    "chars": 3266,
    "preview": "require 'helper'\n\n\nclass Test006FileStorage < TestBase\n  before do\n    @storage_path = \"/tmp/test_classifier.db\"\n    @st"
  },
  {
    "path": "test/test_007_redis_storage.rb",
    "chars": 3293,
    "preview": "require 'helper'\nrequire 'redis'\n\n\nclass Test007RedisStorage < TestBase\n  before do\n    @key = \"test_classifier\"\n    @re"
  }
]

About this extraction

This page contains the full source code of the alexandru/stuff-classifier GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 22 files (44.4 KB), approximately 13.0k tokens, and a symbol index with 105 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo