Repository: alexandru/stuff-classifier Branch: master Commit: eceef3207ef0 Files: 22 Total size: 44.4 KB Directory structure: gitextract_9kt2n9by/ ├── .gitignore ├── Gemfile ├── LICENSE.txt ├── README.md ├── Rakefile ├── lib/ │ ├── stuff-classifier/ │ │ ├── base.rb │ │ ├── bayes.rb │ │ ├── storage.rb │ │ ├── tf-idf.rb │ │ ├── tokenizer/ │ │ │ └── tokenizer_properties.rb │ │ ├── tokenizer.rb │ │ └── version.rb │ └── stuff-classifier.rb ├── stuff-classifier.gemspec └── test/ ├── helper.rb ├── test_001_tokenizer.rb ├── test_002_base.rb ├── test_003_naive_bayes.rb ├── test_004_tf_idf.rb ├── test_005_in_memory_storage.rb ├── test_006_file_storage.rb └── test_007_redis_storage.rb ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ .rvmrc coverage/ .DS_Store *.gem utils.rb Gemfile.lock ================================================ FILE: Gemfile ================================================ source "http://rubygems.org" gemspec ================================================ FILE: LICENSE.txt ================================================ Copyright (c) 2012 Alexandru Nedelcu Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # stuff-classifier ## No longer maintained This repository is no longer maintained for some time. If you're interested in maintaining a fork, contact the author such that I can place a link here. ## Description A library for classifying text into multiple categories. Currently provided classifiers: - a [naive bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) - a classifier based on [tf-idf weights](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) Ran a benchmark of 1345 items that I have previously manually classified with multiple categories. Here's the rate over which the 2 algorithms have correctly detected one of those categories: - Bayes: 79.26% - Tf-Idf: 81.34% I prefer the Naive Bayes approach, because while having lower stats on this benchmark, it seems to make better decisions than I did in many cases. For example, an item with title *"Paintball Session, 100 Balls and Equipment"* was classified as *"Activities"* by me, but the bayes classifier identified it as *"Sports"*, at which point I had an intellectual orgasm. Also, the Tf-Idf classifier seems to do better on clear-cut cases, but doesn't seem to handle uncertainty so well. Of course, these are just quick tests I made and I have no idea which is really better. ## Install ```bash gem install stuff-classifier ``` ## Usage You either instantiate one class or the other. Both have the same signature: ```ruby require 'stuff-classifier' # for the naive bayes implementation cls = StuffClassifier::Bayes.new("Cats or Dogs") # for the Tf-Idf based implementation cls = StuffClassifier::TfIdf.new("Cats or Dogs") # these classifiers use word stemming by default, but if it has weird # behavior, then you can disable it on init: cls = StuffClassifier::TfIdf.new("Cats or Dogs", :stemming => false) # also by default, the parsing phase filters out stop words, to # disable or to come up with your own list of stop words, on a # classifier instance you can do this: cls.ignore_words = [ 'the', 'my', 'i', 'dont' ] ``` Training the classifier: ```ruby cls.train(:dog, "Dogs are awesome, cats too. I love my dog") cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog") cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs") cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all") cls.train(:dog, "So which one should you choose? A dog, definitely.") cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy") cls.train(:dog, "A dog will eat anything, including birds or whatever meat") cls.train(:cat, "My cat's favorite place to purr is on my keyboard") cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house") ``` And finally, classifying stuff: ```ruby cls.classify("This test is about cats.") #=> :cat cls.classify("I hate ...") #=> :cat cls.classify("The most annoying animal on earth.") #=> :cat cls.classify("The preferred company of software developers.") #=> :cat cls.classify("My precious, my favorite!") #=> :cat cls.classify("Get off my keyboard!") #=> :cat cls.classify("Kill that bird!") #=> :cat cls.classify("This test is about dogs.") #=> :dog cls.classify("Cats or Dogs?") #=> :dog cls.classify("What pet will I love more?") #=> :dog cls.classify("Willy, where the heck are you?") #=> :dog cls.classify("I like big buts and I cannot lie.") #=> :dog cls.classify("Why is the front door of our house open?") #=> :dog cls.classify("Who is eating my meat?") #=> :dog ``` ## Persistency The following layers for saving the training data between sessions are implemented: - in memory (by default) - on disk - Redis - (coming soon) in a RDBMS To persist the data in Redis, you can do this: ```ruby # defaults to redis running on localhost on default port store = StuffClassifier::RedisStorage.new(@key) # pass in connection args store = StuffClassifier::RedisStorage.new(@key, {host:'my.redis.server.com', port: 4829}) ``` To persist the data on disk, you can do this: ```ruby store = StuffClassifier::FileStorage.new(@storage_path) # global setting StuffClassifier::Base.storage = store # or alternative local setting on instantiation, by means of an # optional param ... cls = StuffClassifier::Bayes.new("Cats or Dogs", :storage => store) # after training is done, to persist the data ... cls.save_state # or you could just do this: StuffClassifier::Bayes.open("Cats or Dogs") do |cls| # when done, save_state is called on END end # to start fresh, deleting the saved training data for this classifier StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true) ``` The name you give your classifier is important, as based on it the data will get loaded and saved. For instance, following 3 classifiers will be stored in different buckets, being independent of each other. ```ruby cls1 = StuffClassifier::Bayes.new("Cats or Dogs") cls2 = StuffClassifier::Bayes.new("True or False") cls3 = StuffClassifier::Bayes.new("Spam or Ham") ``` ## License MIT Licensed. See LICENSE.txt for details. ================================================ FILE: Rakefile ================================================ require 'bundler/setup' require 'rake/testtask' require 'stuff-classifier' Rake::TestTask.new(:test) do |test| test.libs << 'lib' << 'test' test.pattern = 'test/**/test_*.rb' test.verbose = true end task :default => :test ================================================ FILE: lib/stuff-classifier/base.rb ================================================ # -*- encoding : utf-8 -*- class StuffClassifier::Base extend StuffClassifier::Storage::ActAsStorable attr_reader :name attr_reader :word_list attr_reader :category_list attr_reader :training_count attr_accessor :tokenizer attr_accessor :language attr_accessor :thresholds attr_accessor :min_prob storable :version,:word_list,:category_list,:training_count,:thresholds,:min_prob # opts : # language # stemming : true | false # weight # assumed_prob # storage # purge_state ? def initialize(name, opts={}) @version = StuffClassifier::VERSION @name = name # This values are nil or are loaded from storage @word_list = {} @category_list = {} @training_count=0 # storage purge_state = opts[:purge_state] @storage = opts[:storage] || StuffClassifier::Base.storage unless purge_state @storage.load_state(self) else @storage.purge_state(self) end # This value can be set during initialization or overrided after load_state @thresholds = opts[:thresholds] || {} @min_prob = opts[:min_prob] || 0.0 @ignore_words = nil @tokenizer = StuffClassifier::Tokenizer.new(opts) end def incr_word(word, category) @word_list[word] ||= {} @word_list[word][:categories] ||= {} @word_list[word][:categories][category] ||= 0 @word_list[word][:categories][category] += 1 @word_list[word][:_total_word] ||= 0 @word_list[word][:_total_word] += 1 # words count by categroy @category_list[category] ||= {} @category_list[category][:_total_word] ||= 0 @category_list[category][:_total_word] += 1 end def incr_cat(category) @category_list[category] ||= {} @category_list[category][:_count] ||= 0 @category_list[category][:_count] += 1 @training_count ||= 0 @training_count += 1 end # return number of times the word appears in a category def word_count(word, category) return 0.0 unless @word_list[word] && @word_list[word][:categories] && @word_list[word][:categories][category] @word_list[word][:categories][category].to_f end # return the number of times the word appears in all categories def total_word_count(word) return 0.0 unless @word_list[word] && @word_list[word][:_total_word] @word_list[word][:_total_word].to_f end # return the number of words in a categories def total_word_count_in_cat(cat) return 0.0 unless @category_list[cat] && @category_list[cat][:_total_word] @category_list[cat][:_total_word].to_f end # return the number of training item def total_cat_count @training_count end # return the number of training document for a category def cat_count(category) @category_list[category][:_count] ? @category_list[category][:_count].to_f : 0.0 end # return the number of time categories in wich a word appear def categories_with_word_count(word) return 0 unless @word_list[word] && @word_list[word][:categories] @word_list[word][:categories].length end # return the number of categories def total_categories categories.length end # return categories list def categories @category_list.keys end # train the classifier def train(category, text) @tokenizer.each_word(text) {|w| incr_word(w, category) } incr_cat(category) end # classify a text def classify(text, default=nil) # Find the category with the highest probability max_prob = @min_prob best = nil scores = cat_scores(text) scores.each do |score| cat, prob = score if prob > max_prob max_prob = prob best = cat end end # Return the default category in case the threshold condition was # not met. For example, if the threshold for :spam is 1.2 # # :spam => 0.73, :ham => 0.40 (OK) # :spam => 0.80, :ham => 0.70 (Fail, :ham is too close) return default unless best threshold = @thresholds[best] || 1.0 scores.each do |score| cat, prob = score next if cat == best return default if prob * threshold > max_prob end return best end def save_state @storage.save_state(self) end class << self attr_writer :storage def storage @storage = StuffClassifier::InMemoryStorage.new unless defined? @storage @storage end def open(name) inst = self.new(name) if block_given? yield inst inst.save_state else inst end end end end ================================================ FILE: lib/stuff-classifier/bayes.rb ================================================ # -*- encoding : utf-8 -*- class StuffClassifier::Bayes < StuffClassifier::Base attr_accessor :weight attr_accessor :assumed_prob # http://en.wikipedia.org/wiki/Naive_Bayes_classifier extend StuffClassifier::Storage::ActAsStorable storable :weight,:assumed_prob def initialize(name, opts={}) super(name, opts) @weight = opts[:weight] || 1.0 @assumed_prob = opts[:assumed_prob] || 0.1 end def word_prob(word, cat) total_words_in_cat = total_word_count_in_cat(cat) return 0.0 if total_words_in_cat == 0 word_count(word, cat).to_f / total_words_in_cat end def word_weighted_average(word, cat, opts={}) func = opts[:func] # calculate current probability basic_prob = func ? func.call(word, cat) : word_prob(word, cat) # count the number of times this word has appeared in all # categories totals = total_word_count(word) # the final weighted average (@weight * @assumed_prob + totals * basic_prob) / (@weight + totals) end def doc_prob(text, category) @tokenizer.each_word(text).map {|w| word_weighted_average(w, category) }.inject(1) {|p,c| p * c} end def text_prob(text, category) cat_prob = cat_count(category) / total_cat_count doc_prob = doc_prob(text, category) cat_prob * doc_prob end def cat_scores(text) probs = {} categories.each do |cat| probs[cat] = text_prob(text, cat) end probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]} end def word_classification_detail(word) p "word_prob" result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end p result p "word_weighted_average" result=categories.inject({}) do |h,cat| h[cat]=word_weighted_average(word,cat);h end p result p "doc_prob" result=categories.inject({}) do |h,cat| h[cat]=doc_prob(word,cat);h end p result p "text_prob" result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end p result end end ================================================ FILE: lib/stuff-classifier/storage.rb ================================================ # -*- encoding : utf-8 -*- module StuffClassifier class Storage module ActAsStorable def storable(*to_store) @to_store = to_store end def to_store @to_store || [] end end attr_accessor :storage def initialize(*opts) @storage = {} end def storage_to_classifier(classifier) if @storage.key? classifier.name @storage[classifier.name].each do |var,value| classifier.instance_variable_set "@#{var}",value end end end def classifier_to_storage(classifier) to_store = classifier.class.to_store + classifier.class.superclass.to_store @storage[classifier.name] = to_store.inject({}) {|h,var| h[var] = classifier.instance_variable_get("@#{var}");h} end def clear_storage(classifier) @storage.delete(classifier.name) end end class InMemoryStorage < Storage def initialize super end def load_state(classifier) storage_to_classifier(classifier) end def save_state(classifier) classifier_to_storage(classifier) end def purge_state(classifier) clear_storage(classifier) end end class FileStorage < Storage def initialize(path) super @path = path end def load_state(classifier) if @storage.length == 0 && File.exists?(@path) data = File.open(@path, 'rb') { |f| f.read } @storage = Marshal.load(data) end storage_to_classifier(classifier) end def save_state(classifier) classifier_to_storage(classifier) _write_to_file end def purge_state(classifier) clear_storage(classifier) _write_to_file end def _write_to_file File.open(@path, 'wb') do |fh| fh.flock(File::LOCK_EX) fh.write(Marshal.dump(@storage)) end end end class RedisStorage < Storage def initialize(key, redis_options=nil) super @key = key @redis = Redis.new(redis_options || {}) end def load_state(classifier) if @storage.length == 0 && @redis.exists(@key) data = @redis.get(@key) @storage = Marshal.load(data) end storage_to_classifier(classifier) end def save_state(classifier) classifier_to_storage(classifier) _write_to_redis end def purge_state(classifier) clear_storage(classifier) _write_to_redis end private def _write_to_redis data = Marshal.dump(@storage) @redis.set(@key, data) end end end ================================================ FILE: lib/stuff-classifier/tf-idf.rb ================================================ # -*- encoding : utf-8 -*- class StuffClassifier::TfIdf < StuffClassifier::Base extend StuffClassifier::Storage::ActAsStorable def initialize(name, opts={}) super(name, opts) end def word_prob(word, cat) word_cat_nr = word_count(word, cat) cat_nr = cat_count(cat) tf = 1.0 * word_cat_nr / cat_nr idf = Math.log10((total_categories + 2) / (categories_with_word_count(word) + 1.0)) tf * idf end def text_prob(text, cat) @tokenizer.each_word(text).map{|w| word_prob(w, cat)}.inject(0){|s,p| s + p} end def cat_scores(text) probs = {} categories.each do |cat| p = text_prob(text, cat) probs[cat] = p end probs.map{|k,v| [k,v]}.sort{|a,b| b[1] <=> a[1]} end def word_classification_detail(word) p "tf_idf" result=self.categories.inject({}) do |h,cat| h[cat]=self.word_prob(word,cat);h end ap result p "text_prob" result=categories.inject({}) do |h,cat| h[cat]=text_prob(word,cat);h end ap result end end ================================================ FILE: lib/stuff-classifier/tokenizer/tokenizer_properties.rb ================================================ # -*- encoding : utf-8 -*- require 'set' StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES = { "en" => { :preprocessing_regexps => {/['`]/ => '',/[_]/ => ' '}, :stop_word => Set.new([ '的','个','得', 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'dont', 'co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves' ]) }, "fr" => { :stop_word => Set.new([ 'au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'je', 'la', 'le', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent', 'ceci', 'celà ', 'cet', 'cette', 'ici', 'ils', 'les', 'leurs', 'quel', 'quels', 'quelle', 'quelles', 'sans', 'soi' ]) }, "de" => { :stop_word => Set.new([ 'aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'daß', 'dass', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener', 'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche', 'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach', 'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst', 'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über', 'um', 'und', 'uns', 'unse', 'unsem', 'unsen', 'unser', 'unses', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst', 'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will', 'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen' ]) } } ================================================ FILE: lib/stuff-classifier/tokenizer.rb ================================================ # -*- encoding : utf-8 -*- require "lingua/stemmer" require "rseg" class StuffClassifier::Tokenizer require "stuff-classifier/tokenizer/tokenizer_properties" def initialize(opts={}) @language = opts.key?(:language) ? opts[:language] : "en" @properties = StuffClassifier::Tokenizer::TOKENIZER_PROPERTIES[@language] @stemming = opts.key?(:stemming) ? opts[:stemming] : true if @stemming @stemmer = Lingua::Stemmer.new(:language => @language) end end def language @language end def preprocessing_regexps=(value) @preprocessing_regexps = value end def preprocessing_regexps @preprocessing_regexps || @properties[:preprocessing_regexps] end def ignore_words=(value) @ignore_words = value end def ignore_words @ignore_words || @properties[:stop_word] end def stemming? @stemming || false end def each_word(string) string = string.strip return if string == '' words = [] # tokenize string string.split("\n").each do |line| # Apply preprocessing regexps if preprocessing_regexps preprocessing_regexps.each { |regexp,replace_by| line.gsub!(regexp, replace_by) } end Rseg.segment(line).each do |w| next if w == '' || ignore_words.member?(w.downcase) if stemming? and stemable?(w) w = @stemmer.stem(w).downcase next if ignore_words.member?(w) else w = w.downcase end words << (block_given? ? (yield w) : w) end end return words end private def stemable?(word) true word =~ /^\p{Alpha}+$/ end end ================================================ FILE: lib/stuff-classifier/version.rb ================================================ module StuffClassifier VERSION = '0.5' end ================================================ FILE: lib/stuff-classifier.rb ================================================ # -*- encoding : utf-8 -*- module StuffClassifier autoload :VERSION, 'stuff-classifier/version' autoload :Storage, 'stuff-classifier/storage' autoload :InMemoryStorage, 'stuff-classifier/storage' autoload :FileStorage, 'stuff-classifier/storage' autoload :RedisStorage, 'stuff-classifier/storage' autoload :Tokenizer, 'stuff-classifier/tokenizer' autoload :TOKENIZER_PROPERTIES, 'stuff-classifier/tokenizer/tokenizer_properties' autoload :Base, 'stuff-classifier/base' autoload :Bayes, 'stuff-classifier/bayes' autoload :TfIdf, 'stuff-classifier/tf-idf' end ================================================ FILE: stuff-classifier.gemspec ================================================ # -*- encoding: utf-8 -*- $:.push File.expand_path("../lib", __FILE__) require "stuff-classifier/version" Gem::Specification.new do |s| s.name = "stuff-classifier" s.version = StuffClassifier::VERSION s.authors = ["Alexandru Nedelcu"] s.email = ["github@contact.bionicspirit.com"] s.homepage = "https://github.com/alexandru/stuff-classifier/" s.summary = %q{Simple text classifier(s) implemetation} s.description = %q{2 methods are provided for now - (1) naive bayes implementation + (2) tf-idf weights} s.files = `git ls-files`.split("\n") s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n") s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) } s.require_paths = ["lib"] s.required_ruby_version = '>= 1.9.1' s.add_runtime_dependency "ruby-stemmer" s.add_runtime_dependency "sequel" s.add_runtime_dependency "redis" s.add_development_dependency "bundler" s.add_development_dependency "rake", ">= 0.9.2" s.add_development_dependency "minitest", "~> 4" s.add_development_dependency "turn", ">= 0.8.3" s.add_development_dependency "simplecov" s.add_development_dependency "awesome_print" s.add_development_dependency "ruby-debug19" s.add_development_dependency "rseg" end ================================================ FILE: test/helper.rb ================================================ # -*- encoding : utf-8 -*- require 'simplecov' SimpleCov.start require 'turn' require 'minitest/autorun' require 'stuff-classifier' Turn.config do |c| # use one of output formats: # :outline - turn's original case/test outline mode [default] # :progress - indicates progress with progress bar # :dotted - test/unit's traditional dot-progress mode # :pretty - new pretty reporter # :marshal - dump output as YAML (normal run mode only) # :cue - interactive testing c.format = :cue # turn on invoke/execute tracing, enable full backtrace c.trace = true # use humanized test names (works only with :outline format) c.natural = true end class TestBase < MiniTest::Unit::TestCase def self.before(&block) @on_setup = block if block @on_setup end def setup on_setup = self.class.before instance_eval(&on_setup) if on_setup end def set_classifier(instance) @classifier = instance end def classifier @classifier end def train(category, value) @classifier.train(category, value) end def should_be(category, value) assert_equal category, @classifier.classify(value), value end end ================================================ FILE: test/test_001_tokenizer.rb ================================================ # -*- coding: utf-8 -*- require './helper.rb' class Test001Tokenizer < TestBase before do @en_tokenizer = StuffClassifier::Tokenizer.new @fr_tokenizer = StuffClassifier::Tokenizer.new(:language => "fr") end def test_simple_tokens words = @en_tokenizer.each_word('Hello world! How are you?') should_return = ["hello", "world"] assert_equal should_return, words end def test_with_stemming words = @en_tokenizer.each_word('Lots of dogs, lots of cats! This really is the information highway') should_return =["lot", "dog", "lot", "cat", "realli" ,"inform", "highway" ] assert_equal should_return, words end def test_complicated_tokens words = @en_tokenizer.each_word("I don't really get what you want to accomplish. There is a class TestEval2, you can do test_eval2 = TestEval2.new afterwards. And: class A ... end always yields nil, so your output is ok I guess ;-)") should_return = [ "realli", "want", "accomplish", "class", "testeval2", "test", "eval2","testeval2", "new", "class", "end", "yield", "nil", "output", "ok", "guess"] assert_equal should_return, words end def test_unicode words = @fr_tokenizer.each_word("il s'appelle le vilain petit canard : en référence à Hans Christian Andersen, se démarquer négativement") should_return = [ "appel", "vilain", "pet", "canard", "référent", "han", "christian", "andersen", "démarqu", "négat"] assert_equal should_return, words end end ================================================ FILE: test/test_002_base.rb ================================================ require 'helper' class Test002Base < TestBase before do @cls = StuffClassifier::Bayes.new("Cats or Dogs") set_classifier @cls train :dog, "Dogs are awesome, cats too. I love my dog" train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog" train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs" train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all" train :dog, "So which one should you choose? A dog, definitely." train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy" train :dog, "A dog will eat anything, including birds or whatever meat" train :cat, "My cat's favorite place to purr is on my keyboard" train :dog, "My dog's favorite place to take a leak is the tree in front of our house" end def test_count assert @cls.total_cat_count == 9 assert @cls.categories.map {|c| @cls.cat_count(c)}.inject(0){|s,count| s+count} == 9 # compare word count sum to word by cat count sum assert @cls.word_list.map {|w| @cls.total_word_count(w[0]) }.inject(0) {|s,count| s+count} == 58 assert @cls.categories.map {|c| @cls.total_word_count_in_cat(c) }.inject(0){|s,count| s+count} == 58 # test word count by categories assert @cls.word_list.map {|w| @cls.word_count(w[0],:dog) }.inject(0) {|s,count| s+count} == 29 assert @cls.word_list.map {|w| @cls.word_count(w[0],:cat) }.inject(0) {|s,count| s+count} == 29 # for all categories assert @cls.categories.map {|c| @cls.word_list.map {|w| @cls.word_count(w[0],c) }.inject(0) {|s,count| s+count} }.inject(0){|s,count| s+count} == 58 end end ================================================ FILE: test/test_003_naive_bayes.rb ================================================ require 'helper' class Test003NaiveBayesClassification < TestBase before do set_classifier StuffClassifier::Bayes.new("Cats or Dogs") train :dog, "Dogs are awesome, cats too. I love my dog" train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog" train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs" train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all" train :dog, "So which one should you choose? A dog, definitely." train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy" train :dog, "A dog will eat anything, including birds or whatever meat" train :cat, "My cat's favorite place to purr is on my keyboard" train :dog, "My dog's favorite place to take a leak is the tree in front of our house" end def test_for_cats should_be :cat, "This test is about cats." should_be :cat, "I hate ..." should_be :cat, "The most annoying animal on earth." should_be :cat, "The preferred company of software developers." should_be :cat, "My precious, my favorite!" should_be :cat, "Kill that bird!" end def test_for_dogs should_be :dog, "This test is about dogs." should_be :dog, "Cats or Dogs?" should_be :dog, "What pet will I love more?" should_be :dog, "Willy, where the heck are you?" should_be :dog, "I like big buts and I cannot lie." should_be :dog, "Why is the front door of our house open?" should_be :dog, "Who ate my meat?" end def test_min_prob classifier.min_prob = 0.001 should_be :cat, "This test is about cats." should_be :cat, "I hate ..." should_be nil, "The most annoying animal on earth." should_be nil, "The preferred company of software developers." should_be :cat, "My precious, my favorite!" should_be :cat, "Kill that bird!" should_be :dog, "This test is about dogs." should_be :dog, "Cats or Dogs?" should_be :dog, "What pet will I love more?" should_be :dog, "Willy, where the heck are you?" should_be nil, "I like big buts and I cannot lie." should_be nil, "Why is the front door of our house open?" should_be :dog, "Who ate my meat?" end end ================================================ FILE: test/test_004_tf_idf.rb ================================================ require 'helper' class Test004TfIdfClassification < TestBase before do set_classifier StuffClassifier::TfIdf.new("Cats or Dogs") train :dog, "Dogs are awesome, cats too. I love my dog" train :cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog" train :dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs" train :cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all" train :dog, "So which one should you choose? A dog, definitely." train :cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy" train :dog, "A dog will eat anything, including birds or whatever meat" train :cat, "My cat's favorite place to purr is on my keyboard" train :dog, "My dog's favorite place to take a leak is the tree in front of our house" end def test_for_cats should_be :cat, "This test is about cats." should_be :cat, "I hate ..." should_be :cat, "The most annoying animal on earth." should_be :cat, "The preferred company of software developers." should_be :cat, "My precious, my favorite!" should_be :cat, "Kill that bird!" end def test_for_dogs should_be :dog, "This test is about dogs." should_be :dog, "Cats or Dogs?" should_be :dog, "What pet will I love more?" should_be :dog, "Willy, where the heck are you?" should_be :dog, "I like big buts and I cannot lie." should_be :dog, "Why is the front door of our house open?" should_be :dog, "Who is eating my meat?" end end ================================================ FILE: test/test_005_in_memory_storage.rb ================================================ require 'helper' class Test005InMemoryStorage < TestBase before do StuffClassifier::Base.storage = StuffClassifier::InMemoryStorage.new StuffClassifier::Bayes.open("Cats or Dogs") do |cls| cls.train(:dog, "Dogs are awesome, cats too. I love my dog") cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog") end end def test_for_persistance test = self StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do test.assert @storage.instance_of?(StuffClassifier::InMemoryStorage), "@storage should be an instance of FileStorage" test.assert @word_list.length > 0, "Word count should be persisted" test.assert @category_list.length > 0, "Category count should be persisted" end end def test_purge_state test = self StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do test.assert @word_list.length == 0, "Word count should be purged" test.assert @category_list.length == 0, "Category count should be purged" end end end ================================================ FILE: test/test_006_file_storage.rb ================================================ require 'helper' class Test006FileStorage < TestBase before do @storage_path = "/tmp/test_classifier.db" @storage = StuffClassifier::FileStorage.new(@storage_path) StuffClassifier::Base.storage = @storage StuffClassifier::Bayes.open("Cats or Dogs") do |cls| cls.train(:dog, "Dogs are awesome, cats too. I love my dog.") cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs") cls.train(:dog, "So which one should you choose? A dog, definitely.") cls.train(:dog, "A dog will eat anything, including birds or whatever meat") cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house") cls.train(:cat, "My cat's favorite place to purr is on my keyboard") cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy") cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all") cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog") end # redefining storage instance, forcing it to read from file again StuffClassifier::Base.storage = StuffClassifier::FileStorage.new(@storage_path) end def teardown File.unlink @storage_path if File.exists? @storage_path end def test_result set_classifier StuffClassifier::Bayes.new("Cats or Dogs") should_be :cat, "This test is about cats." should_be :cat, "I hate ..." should_be :cat, "The most annoying animal on earth." should_be :cat, "The preferred company of software developers." should_be :cat, "My precious, my favorite!" should_be :cat, "Kill that bird!" should_be :dog, "This test is about dogs." should_be :dog, "Cats or Dogs?" should_be :dog, "What pet will I love more?" should_be :dog, "Willy, where the heck are you?" should_be :dog, "I like big buts and I cannot lie." should_be :dog, "Why is the front door of our house open?" should_be :dog, "Who ate my meat?" end def test_for_persistance assert ! @storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same" test = self StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage" test.assert @word_list.length > 0, "Word count should be persisted" test.assert @category_list.length > 0, "Category count should be persisted" end end def test_file_created assert File.exist?(@storage_path), "File #@storage_path should exist" content = File.read(@storage_path) assert content.length > 100, "Serialized content should have more than 100 chars" end def test_purge_state test = self StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do test.assert @storage.instance_of?(StuffClassifier::FileStorage),"@storage should be an instance of FileStorage" test.assert @word_list.length == 0, "Word count should be purged" test.assert @category_list.length == 0, "Category count should be purged" end end end ================================================ FILE: test/test_007_redis_storage.rb ================================================ require 'helper' require 'redis' class Test007RedisStorage < TestBase before do @key = "test_classifier" @redis_options = { host: 'localhost', port: 6379 } @redis = Redis.new(@redis_options) @storage = StuffClassifier::RedisStorage.new(@key, @redis_options) StuffClassifier::Base.storage = @storage StuffClassifier::Bayes.open("Cats or Dogs") do |cls| cls.train(:dog, "Dogs are awesome, cats too. I love my dog.") cls.train(:dog, "My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs") cls.train(:dog, "So which one should you choose? A dog, definitely.") cls.train(:dog, "A dog will eat anything, including birds or whatever meat") cls.train(:dog, "My dog's favorite place to take a leak is the tree in front of our house") cls.train(:cat, "My cat's favorite place to purr is on my keyboard") cls.train(:cat, "The favorite food for cats is bird meat, although mice are good, but birds are a delicacy") cls.train(:cat, "Cats are difficult animals, unlike dogs, really annoying, I hate them all") cls.train(:cat, "Cats are more preferred by software developers. I never could stand cats. I have a dog") end # redefining storage instance, forcing it to read from file again StuffClassifier::Base.storage = StuffClassifier::RedisStorage.new(@key, @redis_options) end def teardown @redis.del(@key) end def test_result set_classifier StuffClassifier::Bayes.new("Cats or Dogs") should_be :cat, "This test is about cats." should_be :cat, "I hate ..." should_be :cat, "The most annoying animal on earth." should_be :cat, "The preferred company of software developers." should_be :cat, "My precious, my favorite!" should_be :cat, "Kill that bird!" should_be :dog, "This test is about dogs." should_be :dog, "Cats or Dogs?" should_be :dog, "What pet will I love more?" should_be :dog, "Willy, where the heck are you?" should_be :dog, "I like big buts and I cannot lie." should_be :dog, "Why is the front door of our house open?" should_be :dog, "Who ate my meat?" end def test_for_persistance assert !@storage.equal?(StuffClassifier::Base.storage),"Storage instance should not be the same" test = self StuffClassifier::Bayes.new("Cats or Dogs").instance_eval do test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage" test.assert @word_list.length > 0, "Word count should be persisted" test.assert @category_list.length > 0, "Category count should be persisted" end end def test_key_created assert @redis.exists(@key), "Redis key #{@key} should exist" content = @redis.get(@key) assert content.length > 100, "Serialized content should have more than 100 chars" end def test_purge_state test = self StuffClassifier::Bayes.new("Cats or Dogs", :purge_state => true).instance_eval do test.assert @storage.instance_of?(StuffClassifier::RedisStorage),"@storage should be an instance of RedisStorage" test.assert @word_list.length == 0, "Word count should be purged" test.assert @category_list.length == 0, "Category count should be purged" end end end