Repository: alexandru/stuff-classifier
Branch: master
Commit: eceef3207ef0
Files: 22
Total size: 44.4 KB
Directory structure:
gitextract_9kt2n9by/
├── .gitignore
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── lib/
│ ├── stuff-classifier/
│ │ ├── base.rb
│ │ ├── bayes.rb
│ │ ├── storage.rb
│ │ ├── tf-idf.rb
│ │ ├── tokenizer/
│ │ │ └── tokenizer_properties.rb
│ │ ├── tokenizer.rb
│ │ └── version.rb
│ └── stuff-classifier.rb
├── stuff-classifier.gemspec
└── test/
├── helper.rb
├── test_001_tokenizer.rb
├── test_002_base.rb
├── test_003_naive_bayes.rb
├── test_004_tf_idf.rb
├── test_005_in_memory_storage.rb
├── test_006_file_storage.rb
└── test_007_redis_storage.rb
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.rvmrc
coverage/
.DS_Store
*.gem
utils.rb
Gemfile.lock
================================================
FILE: Gemfile
================================================
source "http://rubygems.org"
gemspec
================================================
FILE: LICENSE.txt
================================================
Copyright (c) 2012 Alexandru Nedelcu
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
================================================
FILE: README.md
================================================
# stuff-classifier
## No longer maintained
This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here.
## Description
A library for classifying text into multiple categories.
Currently provided classifiers:
- a [naive bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- a classifier based on [tf-idf weights](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)
Ran a benchmark of 1345 items that I have previously manually
classified with multiple categories. Here's the rate over which the 2
algorithms have correctly detected one of those categories:
- Bayes: 79.26%
- Tf-Idf: 81.34%
I prefer the Naive Bayes approach, because while having lower stats on
this benchmark, it seems to make better decisions than I did in many
cases. For example, an item with title *"Paintball Session, 100 Balls
and Equipment"* was classified as *"Activities"* by me, but the bayes
classifier identified it as *"Sports"*, at which point I had an
intellectual orgasm. Also, the Tf-Idf classifier seems to do better on
clear-cut cases, but doesn't seem to handle uncertainty so well. Of
course, these are just quick tests I made and I have no idea which is
really better.
## Install
```bash
gem install stuff-classifier
```
## Usage
You either instantiate one class or the other. Both have the same
signature:
```ruby
require 'stuff-classifier'
# for the naive bayes implementation
cls = StuffClassifier::Bayes.new("Cats or Dogs")
# for the Tf-Idf based implementation
cls = StuffClassifier::TfIdf.new("Cats or Dogs")
# these classifiers use word stemming by default, but if it has weird
# behavior, then you can disable it on init:
cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false)
# also by default, the parsing phase filters out stop words, to
# disable or to come up with your own list of stop words, on a
# classifier instance you can do this:
cls.ignore_words = [ 'the', 'my', 'i', 'dont' ]
```
Training the classifier:
```ruby
cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")
```
And finally, classifying stuff:
```ruby
cls.classify("This test is about cats.")
#=> :cat
cls.classify("I hate ...")
#=> :cat
cls.classify("The most annoying animal on earth.")
#=> :cat
cls.classify("The preferred company of software developers.")
#=> :cat
cls.classify("My precious, my favorite!")
#=> :cat
cls.classify("Get off my keyboard!")
#=> :cat
cls.classify("Kill that bird!")
#=> :cat
cls.classify("This test is about dogs.")
#=> :dog
cls.classify("Cats or Dogs?")
#=> :dog
cls.classify("What pet will I love more?")
#=> :dog
cls.classify("Willy, where the heck are you?")
#=> :dog
cls.classify("I like big buts and I cannot lie.")
#=> :dog
cls.classify("Why is the front door of our house open?")
#=> :dog
cls.classify("Who is eating my meat?")
#=> :dog
```
## Persistency
The following layers for saving the training data between sessions are
implemented:
- in memory (by default)
- on disk
- Redis
- (coming soon) in a RDBMS
To persist the data in Redis, you can do this:
```ruby
# defaults to redis running on localhost on default port
store = StuffClassifier::RedisStorage.new(@key)
# pass in connection args
store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829})
```
To persist the data on disk, you can do this:
```ruby
store = StuffClassifier::FileStorage.new(@storage_path)
# global setting
StuffClassifier::Base.storage = store
# or alternative local setting on instantiation, by means of an
# optional param ...
cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store)
# after training is done, to persist the data ...
cls.save_state
# or you could just do this:
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
# when done, save_state is called on END
end
# to start fresh, deleting the saved training data for this classifier
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true)
```
The name you give your classifier is important, as based on it the
data will get loaded and saved. For instance, following 3 classifiers
will be stored in different buckets, being independent of each other.
```ruby
cls1 = StuffClassifier::Bayes.new("Cats or Dogs")
cls2 = StuffClassifier::Bayes.new("True or False")
cls3 = StuffClassifier::Bayes.new("Spam or Ham")
```
## License
MIT Licensed. See LICENSE.txt for details.
================================================
FILE: Rakefile
================================================
require 'bundler/setup'
require 'rake/testtask'
require 'stuff-classifier'
Rake::TestTask.new(:test) do |test|
test.libs << 'lib' << 'test'
test.pattern = 'test/**/test_*.rb'
test.verbose = true
end
task :default => :test
================================================
FILE: lib/stuff-classifier/base.rb
================================================
# -*- encoding : utf-8 -*-
class StuffClassifier::Base
extend StuffClassifier::Storage::ActAsStorable
attr_reader :name
attr_reader :word_list
attr_reader :category_list
attr_reader :training_count
attr_accessor :tokenizer
attr_accessor :language
attr_accessor :thresholds
attr_accessor :min_prob
storable :version,:word_list,:category_list,:training_count,:thresholds,:min_prob
# opts :
# language
# stemming : true | false
# weight
# assumed_prob
# storage
# purge_state ?
def initialize(name, opts={})
@version = StuffClassifier::VERSION
@name = name
# This values are nil or are loaded from storage
@word_list = {}
@category_list = {}
@training_count=0
# storage
purge_state = opts[:purge_state]
@storage = opts[:storage] || StuffClassifier::Base.storage
unless purge_state
@storage.load_state(self)
else
@storage.purge_state(self)
end
# This value can be set during initialization or overrided after load_state
@thresholds = opts[:thresholds] || {}
@min_prob = opts[:min_prob] || 0.0
@ignore_words = nil
@tokenizer = StuffClassifier::Tokenizer.new(opts)
end
def incr_word(word, category)
@word_list[word] ||= {}
@word_list[word][:categories] ||= {}
@word_list[word][:categories][category] ||= 0
@word_list[word][:categories][category] += 1
@word_list[word][:_total_word] ||= 0
@word_list[word][:_total_word] += 1
# words count by categroy
@category_list[category] ||= {}
@category_list[category][:_total_word] ||= 0
@category_list[category][:_total_word] += 1
end
def incr_cat(category)
@category_list[category] ||= {}
@category_list[category][:_count] ||= 0
@category_list[category][:_count] += 1
@training_count ||= 0
@training_count += 1
end
# return number of times the word appears in a category
def word_count(word, category)
return 0.0 unless @word_list[word] && @word_list[word][:categories] && @word_list[word][:categories][category]
@word_list[word][:categories][category].to_f
end
# return the number of times the word appears in all categories
def total_word_count(word)
return 0.0 unless @word_list[word] && @word_list[word][:_total_word]
@word_list[word][:_total_word].to_f
end
# return the number of words in a categories
def total_word_count_in_cat(cat)
return 0.0 unless @category_list[cat] && @category_list[cat][:_total_word]
@category_list[cat][:_total_word].to_f
end
# return the number of training item
def total_cat_count
@training_count
end
# return the number of training document for a category
def cat_count(category)
@category_list[category][:_count] ? @category_list[category][:_count].to_f : 0.0
end
# return the number of time categories in wich a word appear
def categories_with_word_count(word)
return 0 unless @word_list[word] && @word_list[word][:categories]
@word_list[word][:categories].length
end
# return the number of categories
def total_categories
categories.length
end
# return categories list
def categories
@category_list.keys
end
# train the classifier
def train(category, text)
@tokenizer.each_word(text) {|w| incr_word(w, category) }
incr_cat(category)
end
# classify a text
def classify(text, default=nil)
# Find the category with the highest probability
max_prob = @min_prob
best = nil
scores = cat_scores(text)
scores.each do |score|
cat, prob = score
if prob > max_prob
max_prob = prob
best = cat
end
end
# Return the default category in case the threshold condition was
# not met. For example, if the threshold for :spam is 1.2
#
# :spam => 0.73, :ham => 0.40 (OK)
# :spam => 0.80, :ham => 0.70 (Fail, :ham is too close)
return default unless best
threshold = @thresholds[best] || 1.0
scores.each do |score|
cat, prob = score
next if cat == best
return default if prob * threshold > max_prob
end
return best
end
def save_state
@storage.save_state(self)
end
class << self
attr_writer :storage
def storage
@storage = StuffClassifier::InMemoryStorage.new unless defined? @storage
@storage
end
def open(name)
inst = self.new(name)
if block_given?
yield inst
inst.save_state
else
inst
end
end
end
end
================================================
FILE: lib/stuff-classifier/bayes.rb
================================================
# -*- encoding : utf-8 -*-
class StuffClassifier::Bayes < StuffClassifier::Base
attr_accessor :weight
attr_accessor :assumed_prob
# http://en.wikipedia.org/wiki/Naive_Bayes_classifier
extend StuffClassifier::Storage::ActAsStorable
storable :weight,:assumed_prob
def initialize(name, opts={})
super(name, opts)
@weight = opts[:weight] || 1.0
@assumed_prob = opts[:assumed_prob] || 0.1
end
def word_prob(word, cat)
total_words_in_cat = total_word_count_in_cat(cat)
return 0.0 if total_words_in_cat == 0
word_count(word, cat).to_f / total_words_in_cat
end
def word_weighted_average(word, cat, opts={})
func = opts[:func]
# calculate current probability
basic_prob = func ? func.call(word, cat) : word_prob(word, cat)
# count the number of times this word has appeared in all
# categories
totals = total_word_count(word)
# the final weighted average
(@weight * @assumed_prob + totals * basic_prob) / (@weight + totals)
end
def doc_prob(text, category)
@tokenizer.each_word(text).map {|w|
word_weighted_average(w, category)
}.inject(1) {|p,c| p * c}
end
def text_prob(text, category)
cat_prob = cat_count(category) / total_cat_count
doc_prob = doc_prob(text, category)
cat_prob * doc_prob
end
def cat_scores(text)
probs = {}
categories.each do |cat|
probs[cat] = text_prob(text, cat)
end
probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}
end
def word_classification_detail(word)
p "word_prob"
result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end
p result
p "word_weighted_average"
result=categories.inject({}) do |h,cat| h[cat]=word_weighted_average(word,cat);h end
p result
p "doc_prob"
result=categories.inject({}) do |h,cat| h[cat]=doc_prob(word,cat);h end
p result
p "text_prob"
result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end
p result
end
end
================================================
FILE: lib/stuff-classifier/storage.rb
================================================
# -*- encoding : utf-8 -*-
module StuffClassifier
class Storage
module ActAsStorable
def storable(*to_store)
@to_store = to_store
end
def to_store
@to_store || []
end
end
attr_accessor :storage
def initialize(*opts)
@storage = {}
end
def storage_to_classifier(classifier)
if @storage.key? classifier.name
@storage[classifier.name].each do |var,value|
classifier.instance_variable_set "@#{var}",value
end
end
end
def classifier_to_storage(classifier)
to_store = classifier.class.to_store + classifier.class.superclass.to_store
@storage[classifier.name] = to_store.inject({}) {|h,var| h[var] = classifier.instance_variable_get("@#{var}");h}
end
def clear_storage(classifier)
@storage.delete(classifier.name)
end
end
class InMemoryStorage < Storage
def initialize
super
end
def load_state(classifier)
storage_to_classifier(classifier)
end
def save_state(classifier)
classifier_to_storage(classifier)
end
def purge_state(classifier)
clear_storage(classifier)
end
end
class FileStorage < Storage
def initialize(path)
super
@path = path
end
def load_state(classifier)
if @storage.length == 0 && File.exists?(@path)
data = File.open(@path, 'rb') { |f| f.read }
@storage = Marshal.load(data)
end
storage_to_classifier(classifier)
end
def save_state(classifier)
classifier_to_storage(classifier)
_write_to_file
end
def purge_state(classifier)
clear_storage(classifier)
_write_to_file
end
def _write_to_file
File.open(@path, 'wb') do |fh|
fh.flock(File::LOCK_EX)
fh.write(Marshal.dump(@storage))
end
end
end
class RedisStorage < Storage
def initialize(key, redis_options=nil)
super
@key = key
@redis = Redis.new(redis_options || {})
end
def load_state(classifier)
if @storage.length == 0 && @redis.exists(@key)
data = @redis.get(@key)
@storage = Marshal.load(data)
end
storage_to_classifier(classifier)
end
def save_state(classifier)
classifier_to_storage(classifier)
_write_to_redis
end
def purge_state(classifier)
clear_storage(classifier)
_write_to_redis
end
private
def _write_to_redis
data = Marshal.dump(@storage)
@redis.set(@key, data)
end
end
end
================================================
FILE: lib/stuff-classifier/tf-idf.rb
================================================
# -*- encoding : utf-8 -*-
class StuffClassifier::TfIdf < StuffClassifier::Base
extend StuffClassifier::Storage::ActAsStorable
def initialize(name, opts={})
super(name, opts)
end
def word_prob(word, cat)
word_cat_nr = word_count(word, cat)
cat_nr = cat_count(cat)
tf = 1.0 * word_cat_nr / cat_nr
idf = Math.log10((total_categories + 2) / (categories_with_word_count(word) + 1.0))
tf * idf
end
def text_prob(text, cat)
@tokenizer.each_word(text).map{|w| word_prob(w, cat)}.inject(0){|s,p| s + p}
end
def cat_scores(text)
probs = {}
categories.each do |cat|
p = text_prob(text, cat)
probs[cat] = p
end
probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]}
end
def word_classification_detail(word)
p "tf_idf"
result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end
ap result
p "text_prob"
result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end
ap result
end
end
================================================
FILE: lib/stuff-classifier/tokenizer/tokenizer_properties.rb
================================================
# -*- encoding : utf-8 -*-
require 'set'
StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES = {
"en" => {
:preprocessing_regexps => {/['`]/ => '',/[_]/ => ' '},
:stop_word => Set.new([
'的','个','得',
'a', 'about', 'above', 'across', 'after', 'afterwards',
'again', 'against', 'all', 'almost', 'alone', 'along',
'already', 'also', 'although', 'always', 'am', 'among',
'amongst', 'amoungst', 'amount', 'an', 'and', 'another',
'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere',
'are', 'around', 'as', 'at', 'back', 'be',
'became', 'because', 'become', 'becomes', 'becoming', 'been',
'before', 'beforehand', 'behind', 'being', 'below', 'beside',
'besides', 'between', 'beyond', 'bill', 'both', 'bottom',
'but', 'by', 'call', 'can', 'cannot', 'cant', 'dont',
'co', 'computer', 'con', 'could', 'couldnt', 'cry',
'de', 'describe', 'detail', 'do', 'done', 'down',
'due', 'during', 'each', 'eg', 'eight', 'either',
'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every',
'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen',
'fify', 'fill', 'find', 'fire', 'first', 'five',
'for', 'former', 'formerly', 'forty', 'found', 'four',
'from', 'front', 'full', 'further', 'get', 'give',
'go', 'had', 'has', 'hasnt', 'have', 'he',
'hence', 'her', 'here', 'hereafter', 'hereby', 'herein',
'hereupon', 'hers', 'herself', 'him', 'himself', 'his',
'how', 'however', 'hundred', 'i', 'ie', 'if',
'in', 'inc', 'indeed', 'interest', 'into', 'is',
'it', 'its', 'itself', 'keep', 'last', 'latter',
'latterly', 'least', 'less', 'ltd', 'made', 'many',
'may', 'me', 'meanwhile', 'might', 'mill', 'mine',
'more', 'moreover', 'most', 'mostly', 'move', 'much',
'must', 'my', 'myself', 'name', 'namely', 'neither',
'never', 'nevertheless', 'next', 'nine', 'no', 'nobody',
'none', 'noone', 'nor', 'not', 'nothing', 'now',
'nowhere', 'of', 'off', 'often', 'on', 'once',
'one', 'only', 'onto', 'or', 'other', 'others',
'otherwise', 'our', 'ours', 'ourselves', 'out', 'over',
'own', 'part', 'per', 'perhaps', 'please', 'put',
'rather', 're', 'same', 'see', 'seem', 'seemed',
'seeming', 'seems', 'serious', 'several', 'she', 'should',
'show', 'side', 'since', 'sincere', 'six', 'sixty',
'so', 'some', 'somehow', 'someone', 'something', 'sometime',
'sometimes', 'somewhere', 'still', 'such', 'system', 'take',
'ten', 'than', 'that', 'the', 'their', 'them',
'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby',
'therefore', 'therein', 'thereupon', 'these', 'they', 'thick',
'thin', 'third', 'this', 'those', 'though', 'three',
'through', 'throughout', 'thru', 'thus', 'to', 'together',
'too', 'top', 'toward', 'towards', 'twelve', 'twenty',
'two', 'un', 'under', 'until', 'up', 'upon',
'us', 'very', 'via', 'was', 'we', 'well',
'were', 'what', 'whatever', 'when', 'whence', 'whenever',
'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
'wherever', 'whether', 'which', 'while', 'whither', 'who',
'whoever', 'whole', 'whom', 'whose', 'why', 'will',
'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours',
'yourself', 'yourselves'
])
},
"fr" => {
:stop_word => Set.new([
'au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux',
'il', 'je', 'la', 'le', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon',
'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa',
'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre',
'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées',
'étés', 'étant', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras',
'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient',
'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes',
'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût',
'fussions', 'fussiez', 'fussent', 'ayant', 'eu', 'eue', 'eues', 'eus', 'ai', 'as',
'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais',
'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient',
'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse',
'eusses', 'eût', 'eussions', 'eussiez', 'eussent', 'ceci', 'celà ', 'cet', 'cette', 'ici',
'ils', 'les', 'leurs', 'quel', 'quels', 'quelle', 'quelles', 'sans', 'soi'
])
},
"de" => {
:stop_word => Set.new([
'aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere',
'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf',
'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das',
'daß', 'dass', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe',
'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du',
'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen',
'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es',
'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat',
'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres',
'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener',
'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche',
'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach',
'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst',
'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über',
'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst',
'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will',
'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen'
])
}
}
================================================
FILE: lib/stuff-classifier/tokenizer.rb
================================================
# -*- encoding : utf-8 -*-
require "lingua/stemmer"
require "rseg"
class StuffClassifier::Tokenizer
require "stuff-classifier/tokenizer/tokenizer_properties"
def initialize(opts={})
@language = opts.key?(:language) ? opts[:language] : "en"
@properties = StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES[@language]
@stemming = opts.key?(:stemming) ? opts[:stemming] : true
if @stemming
@stemmer = Lingua::Stemmer.new(:language => @language)
end
end
def language
@language
end
def preprocessing_regexps=(value)
@preprocessing_regexps = value
end
def preprocessing_regexps
@preprocessing_regexps || @properties[:preprocessing_regexps]
end
def ignore_words=(value)
@ignore_words = value
end
def ignore_words
@ignore_words || @properties[:stop_word]
end
def stemming?
@stemming || false
end
def each_word(string)
string = string.strip
return if string == ''
words = []
# tokenize string
string.split("\n").each do |line|
# Apply preprocessing regexps
if preprocessing_regexps
preprocessing_regexps.each { |regexp,replace_by| line.gsub!(regexp, replace_by) }
end
Rseg.segment(line).each do |w|
next if w == '' || ignore_words.member?(w.downcase)
if stemming? and stemable?(w)
w = @stemmer.stem(w).downcase
next if ignore_words.member?(w)
else
w = w.downcase
end
words << (block_given? ? (yield w) : w)
end
end
return words
end
private
def stemable?(word)
true
word =~ /^\p{Alpha}+$/
end
end
================================================
FILE: lib/stuff-classifier/version.rb
================================================
module StuffClassifier
VERSION = '0.5'
end
================================================
FILE: lib/stuff-classifier.rb
================================================
# -*- encoding : utf-8 -*-
module StuffClassifier
autoload :VERSION, 'stuff-classifier/version'
autoload :Storage, 'stuff-classifier/storage'
autoload :InMemoryStorage, 'stuff-classifier/storage'
autoload :FileStorage, 'stuff-classifier/storage'
autoload :RedisStorage, 'stuff-classifier/storage'
autoload :Tokenizer, 'stuff-classifier/tokenizer'
autoload :TOKENIZER_PROPERTIES, 'stuff-classifier/tokenizer/tokenizer_properties'
autoload :Base, 'stuff-classifier/base'
autoload :Bayes, 'stuff-classifier/bayes'
autoload :TfIdf, 'stuff-classifier/tf-idf'
end
================================================
FILE: stuff-classifier.gemspec
================================================
# -*- encoding: utf-8 -*-
$:.push File.expand_path("../lib", __FILE__)
require "stuff-classifier/version"
Gem::Specification.new do |s|
s.name = "stuff-classifier"
s.version = StuffClassifier::VERSION
s.authors = ["Alexandru Nedelcu"]
s.email = ["github@contact.bionicspirit.com"]
s.homepage = "https://github.com/alexandru/stuff-classifier/"
s.summary = %q{Simple text classifier(s) implemetation}
s.description = %q{2 methods are provided for now - (1) naive bayes implementation + (2) tf-idf weights}
s.files = `git ls-files`.split("\n")
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
s.require_paths = ["lib"]
s.required_ruby_version = '>= 1.9.1'
s.add_runtime_dependency "ruby-stemmer"
s.add_runtime_dependency "sequel"
s.add_runtime_dependency "redis"
s.add_development_dependency "bundler"
s.add_development_dependency "rake", ">= 0.9.2"
s.add_development_dependency "minitest", "~> 4"
s.add_development_dependency "turn", ">= 0.8.3"
s.add_development_dependency "simplecov"
s.add_development_dependency "awesome_print"
s.add_development_dependency "ruby-debug19"
s.add_development_dependency "rseg"
end
================================================
FILE: test/helper.rb
================================================
# -*- encoding : utf-8 -*-
require 'simplecov'
SimpleCov.start
require 'turn'
require 'minitest/autorun'
require 'stuff-classifier'
Turn.config do |c|
# use one of output formats:
# :outline - turn's original case/test outline mode [default]
# :progress - indicates progress with progress bar
# :dotted - test/unit's traditional dot-progress mode
# :pretty - new pretty reporter
# :marshal - dump output as YAML (normal run mode only)
# :cue - interactive testing
c.format = :cue
# turn on invoke/execute tracing, enable full backtrace
c.trace = true
# use humanized test names (works only with :outline format)
c.natural = true
end
class TestBase < MiniTest::Unit::TestCase
def self.before(&block)
@on_setup = block if block
@on_setup
end
def setup
on_setup = self.class.before
instance_eval(&on_setup) if on_setup
end
def set_classifier(instance)
@classifier = instance
end
def classifier
@classifier
end
def train(category, value)
@classifier.train(category, value)
end
def should_be(category, value)
assert_equal category, @classifier.classify(value), value
end
end
================================================
FILE: test/test_001_tokenizer.rb
================================================
# -*- coding: utf-8 -*-
require './helper.rb'
class Test001Tokenizer < TestBase
before do
@en_tokenizer = StuffClassifier::Tokenizer.new
@fr_tokenizer = StuffClassifier::Tokenizer.new(:language => "fr")
end
def test_simple_tokens
words = @en_tokenizer.each_word('Hello world! How are you?')
should_return = ["hello", "world"]
assert_equal should_return, words
end
def test_with_stemming
words = @en_tokenizer.each_word('Lots of dogs, lots of cats! This really is the information highway')
should_return =["lot", "dog", "lot", "cat", "realli" ,"inform", "highway" ]
assert_equal should_return, words
end
def test_complicated_tokens
words = @en_tokenizer.each_word("I don't really get what you want to
accomplish. There is a class TestEval2, you can do test_eval2 =
TestEval2.new afterwards. And: class A ... end always yields nil, so
your output is ok I guess ;-)")
should_return = [
"realli", "want", "accomplish", "class",
"testeval2", "test", "eval2","testeval2", "new", "class", "end",
"yield", "nil", "output", "ok", "guess"]
assert_equal should_return, words
end
def test_unicode
words = @fr_tokenizer.each_word("il s'appelle le vilain petit canard : en référence à Hans Christian Andersen, se démarquer négativement")
should_return = [
"appel", "vilain", "pet", "canard", "référent",
"han", "christian", "andersen", "démarqu", "négat"]
assert_equal should_return, words
end
end
================================================
FILE: test/test_002_base.rb
================================================
require 'helper'
class Test002Base < TestBase
before do
@cls = StuffClassifier::Bayes.new("Cats or Dogs")
set_classifier @cls
train :dog, "Dogs are awesome, cats too. I love my dog"
train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"
train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
train :dog, "So which one should you choose? A dog, definitely."
train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
train :dog, "A dog will eat anything, including birds or whatever meat"
train :cat, "My cat's favorite place to purr is on my keyboard"
train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
end
def test_count
assert @cls.total_cat_count == 9
assert @cls.categories.map {|c| @cls.cat_count(c)}.inject(0){|s,count| s+count} == 9
# compare word count sum to word by cat count sum
assert @cls.word_list.map {|w| @cls.total_word_count(w[0]) }.inject(0) {|s,count| s+count} == 58
assert @cls.categories.map {|c| @cls.total_word_count_in_cat(c) }.inject(0){|s,count| s+count} == 58
# test word count by categories
assert @cls.word_list.map {|w| @cls.word_count(w[0],:dog) }.inject(0) {|s,count| s+count} == 29
assert @cls.word_list.map {|w| @cls.word_count(w[0],:cat) }.inject(0) {|s,count| s+count} == 29
# for all categories
assert @cls.categories.map {|c| @cls.word_list.map {|w| @cls.word_count(w[0],c) }.inject(0) {|s,count| s+count} }.inject(0){|s,count| s+count} == 58
end
end
================================================
FILE: test/test_003_naive_bayes.rb
================================================
require 'helper'
class Test003NaiveBayesClassification < TestBase
before do
set_classifier StuffClassifier::Bayes.new("Cats or Dogs")
train :dog, "Dogs are awesome, cats too. I love my dog"
train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"
train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
train :dog, "So which one should you choose? A dog, definitely."
train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
train :dog, "A dog will eat anything, including birds or whatever meat"
train :cat, "My cat's favorite place to purr is on my keyboard"
train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
end
def test_for_cats
should_be :cat, "This test is about cats."
should_be :cat, "I hate ..."
should_be :cat, "The most annoying animal on earth."
should_be :cat, "The preferred company of software developers."
should_be :cat, "My precious, my favorite!"
should_be :cat, "Kill that bird!"
end
def test_for_dogs
should_be :dog, "This test is about dogs."
should_be :dog, "Cats or Dogs?"
should_be :dog, "What pet will I love more?"
should_be :dog, "Willy, where the heck are you?"
should_be :dog, "I like big buts and I cannot lie."
should_be :dog, "Why is the front door of our house open?"
should_be :dog, "Who ate my meat?"
end
def test_min_prob
classifier.min_prob = 0.001
should_be :cat, "This test is about cats."
should_be :cat, "I hate ..."
should_be nil, "The most annoying animal on earth."
should_be nil, "The preferred company of software developers."
should_be :cat, "My precious, my favorite!"
should_be :cat, "Kill that bird!"
should_be :dog, "This test is about dogs."
should_be :dog, "Cats or Dogs?"
should_be :dog, "What pet will I love more?"
should_be :dog, "Willy, where the heck are you?"
should_be nil, "I like big buts and I cannot lie."
should_be nil, "Why is the front door of our house open?"
should_be :dog, "Who ate my meat?"
end
end
================================================
FILE: test/test_004_tf_idf.rb
================================================
require 'helper'
class Test004TfIdfClassification < TestBase
before do
set_classifier StuffClassifier::TfIdf.new("Cats or Dogs")
train :dog, "Dogs are awesome, cats too. I love my dog"
train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog"
train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs"
train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all"
train :dog, "So which one should you choose? A dog, definitely."
train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy"
train :dog, "A dog will eat anything, including birds or whatever meat"
train :cat, "My cat's favorite place to purr is on my keyboard"
train :dog, "My dog's favorite place to take a leak is the tree in front of our house"
end
def test_for_cats
should_be :cat, "This test is about cats."
should_be :cat, "I hate ..."
should_be :cat, "The most annoying animal on earth."
should_be :cat, "The preferred company of software developers."
should_be :cat, "My precious, my favorite!"
should_be :cat, "Kill that bird!"
end
def test_for_dogs
should_be :dog, "This test is about dogs."
should_be :dog, "Cats or Dogs?"
should_be :dog, "What pet will I love more?"
should_be :dog, "Willy, where the heck are you?"
should_be :dog, "I like big buts and I cannot lie."
should_be :dog, "Why is the front door of our house open?"
should_be :dog, "Who is eating my meat?"
end
end
================================================
FILE: test/test_005_in_memory_storage.rb
================================================
require 'helper'
class Test005InMemoryStorage < TestBase
before do
StuffClassifier::Base.storage = StuffClassifier::InMemoryStorage.new
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
cls.train(:dog, "Dogs are awesome, cats too. I love my dog")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
end
end
def test_for_persistance
test = self
StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
test.assert @storage.instance_of?(StuffClassifier::InMemoryStorage),
"@storage should be an instance of FileStorage"
test.assert @word_list.length > 0, "Word count should be persisted"
test.assert @category_list.length > 0, "Category count should be persisted"
end
end
def test_purge_state
test = self
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
test.assert @word_list.length == 0, "Word count should be purged"
test.assert @category_list.length == 0, "Category count should be purged"
end
end
end
================================================
FILE: test/test_006_file_storage.rb
================================================
require 'helper'
class Test006FileStorage < TestBase
before do
@storage_path = "/tmp/test_classifier.db"
@storage = StuffClassifier::FileStorage.new(@storage_path)
StuffClassifier::Base.storage = @storage
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
cls.train(:dog, "Dogs are awesome, cats too. I love my dog.")
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
end
# redefining storage instance, forcing it to read from file again
StuffClassifier::Base.storage = StuffClassifier::FileStorage.new(@storage_path)
end
def teardown
File.unlink @storage_path if File.exists? @storage_path
end
def test_result
set_classifier StuffClassifier::Bayes.new("Cats or Dogs")
should_be :cat, "This test is about cats."
should_be :cat, "I hate ..."
should_be :cat, "The most annoying animal on earth."
should_be :cat, "The preferred company of software developers."
should_be :cat, "My precious, my favorite!"
should_be :cat, "Kill that bird!"
should_be :dog, "This test is about dogs."
should_be :dog, "Cats or Dogs?"
should_be :dog, "What pet will I love more?"
should_be :dog, "Willy, where the heck are you?"
should_be :dog, "I like big buts and I cannot lie."
should_be :dog, "Why is the front door of our house open?"
should_be :dog, "Who ate my meat?"
end
def test_for_persistance
assert ! @storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same"
test = self
StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage"
test.assert @word_list.length > 0, "Word count should be persisted"
test.assert @category_list.length > 0, "Category count should be persisted"
end
end
def test_file_created
assert File.exist?(@storage_path), "File #@storage_path should exist"
content = File.read(@storage_path)
assert content.length > 100, "Serialized content should have more than 100 chars"
end
def test_purge_state
test = self
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage"
test.assert @word_list.length == 0, "Word count should be purged"
test.assert @category_list.length == 0, "Category count should be purged"
end
end
end
================================================
FILE: test/test_007_redis_storage.rb
================================================
require 'helper'
require 'redis'
class Test007RedisStorage < TestBase
before do
@key = "test_classifier"
@redis_options = { host: 'localhost', port: 6379 }
@redis = Redis.new(@redis_options)
@storage = StuffClassifier::RedisStorage.new(@key, @redis_options)
StuffClassifier::Base.storage = @storage
StuffClassifier::Bayes.open("Cats or Dogs") do |cls|
cls.train(:dog, "Dogs are awesome, cats too. I love my dog.")
cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs")
cls.train(:dog, "So which one should you choose? A dog, definitely.")
cls.train(:dog, "A dog will eat anything, including birds or whatever meat")
cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house")
cls.train(:cat, "My cat's favorite place to purr is on my keyboard")
cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy")
cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all")
cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog")
end
# redefining storage instance, forcing it to read from file again
StuffClassifier::Base.storage = StuffClassifier::RedisStorage.new(@key, @redis_options)
end
def teardown
@redis.del(@key)
end
def test_result
set_classifier StuffClassifier::Bayes.new("Cats or Dogs")
should_be :cat, "This test is about cats."
should_be :cat, "I hate ..."
should_be :cat, "The most annoying animal on earth."
should_be :cat, "The preferred company of software developers."
should_be :cat, "My precious, my favorite!"
should_be :cat, "Kill that bird!"
should_be :dog, "This test is about dogs."
should_be :dog, "Cats or Dogs?"
should_be :dog, "What pet will I love more?"
should_be :dog, "Willy, where the heck are you?"
should_be :dog, "I like big buts and I cannot lie."
should_be :dog, "Why is the front door of our house open?"
should_be :dog, "Who ate my meat?"
end
def test_for_persistance
assert !@storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same"
test = self
StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do
test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage"
test.assert @word_list.length > 0, "Word count should be persisted"
test.assert @category_list.length > 0, "Category count should be persisted"
end
end
def test_key_created
assert @redis.exists(@key), "Redis key #{@key} should exist"
content = @redis.get(@key)
assert content.length > 100, "Serialized content should have more than 100 chars"
end
def test_purge_state
test = self
StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do
test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage"
test.assert @word_list.length == 0, "Word count should be purged"
test.assert @category_list.length == 0, "Category count should be purged"
end
end
end
gitextract_9kt2n9by/
├── .gitignore
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── lib/
│ ├── stuff-classifier/
│ │ ├── base.rb
│ │ ├── bayes.rb
│ │ ├── storage.rb
│ │ ├── tf-idf.rb
│ │ ├── tokenizer/
│ │ │ └── tokenizer_properties.rb
│ │ ├── tokenizer.rb
│ │ └── version.rb
│ └── stuff-classifier.rb
├── stuff-classifier.gemspec
└── test/
├── helper.rb
├── test_001_tokenizer.rb
├── test_002_base.rb
├── test_003_naive_bayes.rb
├── test_004_tf_idf.rb
├── test_005_in_memory_storage.rb
├── test_006_file_storage.rb
└── test_007_redis_storage.rb
SYMBOL INDEX (105 symbols across 15 files)
FILE: lib/stuff-classifier.rb
type StuffClassifier (line 2) | module StuffClassifier
FILE: lib/stuff-classifier/base.rb
class StuffClassifier::Base (line 3) | class StuffClassifier::Base
method initialize (line 27) | def initialize(name, opts={})
method incr_word (line 56) | def incr_word(word, category)
method incr_cat (line 74) | def incr_cat(category)
method word_count (line 85) | def word_count(word, category)
method total_word_count (line 91) | def total_word_count(word)
method total_word_count_in_cat (line 97) | def total_word_count_in_cat(cat)
method total_cat_count (line 103) | def total_cat_count
method cat_count (line 108) | def cat_count(category)
method categories_with_word_count (line 113) | def categories_with_word_count(word)
method total_categories (line 119) | def total_categories
method categories (line 124) | def categories
method train (line 129) | def train(category, text)
method classify (line 135) | def classify(text, default=nil)
method save_state (line 168) | def save_state
method storage (line 175) | def storage
method open (line 180) | def open(name)
FILE: lib/stuff-classifier/bayes.rb
class StuffClassifier::Bayes (line 3) | class StuffClassifier::Bayes < StuffClassifier::Base
method initialize (line 12) | def initialize(name, opts={})
method word_prob (line 18) | def word_prob(word, cat)
method word_weighted_average (line 25) | def word_weighted_average(word, cat, opts={})
method doc_prob (line 39) | def doc_prob(text, category)
method text_prob (line 45) | def text_prob(text, category)
method cat_scores (line 51) | def cat_scores(text)
method word_classification_detail (line 60) | def word_classification_detail(word)
FILE: lib/stuff-classifier/storage.rb
type StuffClassifier (line 2) | module StuffClassifier
class Storage (line 4) | class Storage
type ActAsStorable (line 5) | module ActAsStorable
function storable (line 6) | def storable(*to_store)
function to_store (line 9) | def to_store
method initialize (line 16) | def initialize(*opts)
method storage_to_classifier (line 20) | def storage_to_classifier(classifier)
method classifier_to_storage (line 28) | def classifier_to_storage(classifier)
method clear_storage (line 33) | def clear_storage(classifier)
class InMemoryStorage (line 39) | class InMemoryStorage < Storage
method initialize (line 40) | def initialize
method load_state (line 44) | def load_state(classifier)
method save_state (line 48) | def save_state(classifier)
method purge_state (line 52) | def purge_state(classifier)
class FileStorage (line 58) | class FileStorage < Storage
method initialize (line 59) | def initialize(path)
method load_state (line 64) | def load_state(classifier)
method save_state (line 72) | def save_state(classifier)
method purge_state (line 77) | def purge_state(classifier)
method _write_to_file (line 82) | def _write_to_file
class RedisStorage (line 91) | class RedisStorage < Storage
method initialize (line 92) | def initialize(key, redis_options=nil)
method load_state (line 98) | def load_state(classifier)
method save_state (line 106) | def save_state(classifier)
method purge_state (line 111) | def purge_state(classifier)
method _write_to_redis (line 117) | def _write_to_redis
FILE: lib/stuff-classifier/tf-idf.rb
class StuffClassifier::TfIdf (line 2) | class StuffClassifier::TfIdf < StuffClassifier::Base
method initialize (line 5) | def initialize(name, opts={})
method word_prob (line 10) | def word_prob(word, cat)
method text_prob (line 20) | def text_prob(text, cat)
method cat_scores (line 24) | def cat_scores(text)
method word_classification_detail (line 33) | def word_classification_detail(word)
FILE: lib/stuff-classifier/tokenizer.rb
class StuffClassifier::Tokenizer (line 5) | class StuffClassifier::Tokenizer
method initialize (line 8) | def initialize(opts={})
method language (line 18) | def language
method preprocessing_regexps= (line 22) | def preprocessing_regexps=(value)
method preprocessing_regexps (line 26) | def preprocessing_regexps
method ignore_words= (line 30) | def ignore_words=(value)
method ignore_words (line 34) | def ignore_words
method stemming? (line 38) | def stemming?
method each_word (line 42) | def each_word(string)
method stemable? (line 75) | def stemable?(word)
FILE: lib/stuff-classifier/version.rb
type StuffClassifier (line 1) | module StuffClassifier
FILE: test/helper.rb
class TestBase (line 24) | class TestBase < MiniTest::Unit::TestCase
method before (line 25) | def self.before(&block)
method setup (line 30) | def setup
method set_classifier (line 35) | def set_classifier(instance)
method classifier (line 38) | def classifier
method train (line 43) | def train(category, value)
method should_be (line 47) | def should_be(category, value)
FILE: test/test_001_tokenizer.rb
class Test001Tokenizer (line 4) | class Test001Tokenizer < TestBase
method test_simple_tokens (line 10) | def test_simple_tokens
method test_with_stemming (line 17) | def test_with_stemming
method test_complicated_tokens (line 25) | def test_complicated_tokens
method test_unicode (line 39) | def test_unicode
FILE: test/test_002_base.rb
class Test002Base (line 4) | class Test002Base < TestBase
method test_count (line 20) | def test_count
FILE: test/test_003_naive_bayes.rb
class Test003NaiveBayesClassification (line 4) | class Test003NaiveBayesClassification < TestBase
method test_for_cats (line 19) | def test_for_cats
method test_for_dogs (line 28) | def test_for_dogs
method test_min_prob (line 38) | def test_min_prob
FILE: test/test_004_tf_idf.rb
class Test004TfIdfClassification (line 4) | class Test004TfIdfClassification < TestBase
method test_for_cats (line 19) | def test_for_cats
method test_for_dogs (line 28) | def test_for_dogs
FILE: test/test_005_in_memory_storage.rb
class Test005InMemoryStorage (line 4) | class Test005InMemoryStorage < TestBase
method test_for_persistance (line 14) | def test_for_persistance
method test_purge_state (line 24) | def test_purge_state
FILE: test/test_006_file_storage.rb
class Test006FileStorage (line 4) | class Test006FileStorage < TestBase
method teardown (line 27) | def teardown
method test_result (line 31) | def test_result
method test_for_persistance (line 51) | def test_for_persistance
method test_file_created (line 62) | def test_file_created
method test_purge_state (line 69) | def test_purge_state
FILE: test/test_007_redis_storage.rb
class Test007RedisStorage (line 5) | class Test007RedisStorage < TestBase
method teardown (line 31) | def teardown
method test_result (line 35) | def test_result
method test_for_persistance (line 55) | def test_for_persistance
method test_key_created (line 66) | def test_key_created
method test_purge_state (line 73) | def test_purge_state
Condensed preview — 22 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (49K chars).
[
{
"path": ".gitignore",
"chars": 55,
"preview": ".rvmrc\ncoverage/\n.DS_Store\n*.gem\nutils.rb\nGemfile.lock\n"
},
{
"path": "Gemfile",
"chars": 38,
"preview": "source \"http://rubygems.org\"\n\ngemspec\n"
},
{
"path": "LICENSE.txt",
"chars": 1061,
"preview": "Copyright (c) 2012 Alexandru Nedelcu\n\nPermission is hereby granted, free of charge, to any person obtaining\na copy of th"
},
{
"path": "README.md",
"chars": 5174,
"preview": "# stuff-classifier\n\n## No longer maintained\n\nThis repository is no longer maintained for some time. If you're interested"
},
{
"path": "Rakefile",
"chars": 231,
"preview": "require 'bundler/setup'\nrequire 'rake/testtask'\nrequire 'stuff-classifier'\n\nRake::TestTask.new(:test) do |test|\n test.l"
},
{
"path": "lib/stuff-classifier/base.rb",
"chars": 4515,
"preview": "# -*- encoding : utf-8 -*-\n\nclass StuffClassifier::Base\n extend StuffClassifier::Storage::ActAsStorable\n attr_reader :"
},
{
"path": "lib/stuff-classifier/bayes.rb",
"chars": 2009,
"preview": "# -*- encoding : utf-8 -*-\n\nclass StuffClassifier::Bayes < StuffClassifier::Base\n attr_accessor :weight\n attr_accessor"
},
{
"path": "lib/stuff-classifier/storage.rb",
"chars": 2569,
"preview": "# -*- encoding : utf-8 -*-\nmodule StuffClassifier\n\n class Storage\n module ActAsStorable\n def storable(*to_sto"
},
{
"path": "lib/stuff-classifier/tf-idf.rb",
"chars": 1013,
"preview": "# -*- encoding : utf-8 -*-\nclass StuffClassifier::TfIdf < StuffClassifier::Base\n extend StuffClassifier::Storage::ActAs"
},
{
"path": "lib/stuff-classifier/tokenizer/tokenizer_properties.rb",
"chars": 9099,
"preview": "# -*- encoding : utf-8 -*-\nrequire 'set'\nStuffClassifier::Tokenizer::TOKENIZER_PROPERTIES = {\n \"en\" => {\n :preproces"
},
{
"path": "lib/stuff-classifier/tokenizer.rb",
"chars": 1640,
"preview": "# -*- encoding : utf-8 -*-\nrequire \"lingua/stemmer\"\nrequire \"rseg\"\n\nclass StuffClassifier::Tokenizer\n require \"stuff-c"
},
{
"path": "lib/stuff-classifier/version.rb",
"chars": 45,
"preview": "module StuffClassifier\n VERSION = '0.5'\nend\n"
},
{
"path": "lib/stuff-classifier.rb",
"chars": 606,
"preview": "# -*- encoding : utf-8 -*-\nmodule StuffClassifier\n autoload :VERSION, 'stuff-classifier/version'\n\n autoload :Storag"
},
{
"path": "stuff-classifier.gemspec",
"chars": 1307,
"preview": "# -*- encoding: utf-8 -*-\n$:.push File.expand_path(\"../lib\", __FILE__)\nrequire \"stuff-classifier/version\"\n\nGem::Specific"
},
{
"path": "test/helper.rb",
"chars": 1160,
"preview": "# -*- encoding : utf-8 -*-\nrequire 'simplecov'\nSimpleCov.start\n\nrequire 'turn'\nrequire 'minitest/autorun'\nrequire 'stuff"
},
{
"path": "test/test_001_tokenizer.rb",
"chars": 1528,
"preview": "# -*- coding: utf-8 -*-\nrequire './helper.rb'\n\nclass Test001Tokenizer < TestBase\n before do\n @en_tokenizer = StuffCl"
},
{
"path": "test/test_002_base.rb",
"chars": 1775,
"preview": "require 'helper'\n\n\nclass Test002Base < TestBase\n before do\n @cls = StuffClassifier::Bayes.new(\"Cats or Dogs\")\n se"
},
{
"path": "test/test_003_naive_bayes.rb",
"chars": 2326,
"preview": "require 'helper'\n\n\nclass Test003NaiveBayesClassification < TestBase\n before do\n set_classifier StuffClassifier::Baye"
},
{
"path": "test/test_004_tf_idf.rb",
"chars": 1630,
"preview": "require 'helper'\n\n\nclass Test004TfIdfClassification < TestBase\n before do\n set_classifier StuffClassifier::TfIdf.new"
},
{
"path": "test/test_005_in_memory_storage.rb",
"chars": 1104,
"preview": "require 'helper'\n\n\nclass Test005InMemoryStorage < TestBase\n before do\n StuffClassifier::Base.storage = StuffClassifi"
},
{
"path": "test/test_006_file_storage.rb",
"chars": 3266,
"preview": "require 'helper'\n\n\nclass Test006FileStorage < TestBase\n before do\n @storage_path = \"/tmp/test_classifier.db\"\n @st"
},
{
"path": "test/test_007_redis_storage.rb",
"chars": 3293,
"preview": "require 'helper'\nrequire 'redis'\n\n\nclass Test007RedisStorage < TestBase\n before do\n @key = \"test_classifier\"\n @re"
}
]
About this extraction
This page contains the full source code of the alexandru/stuff-classifier GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 22 files (44.4 KB), approximately 13.0k tokens, and a symbol index with 105 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.