Repository: zyymax/text-similarity Branch: master Commit: e2c01e426cc4 Files: 17 Total size: 38.4 KB Directory structure: gitextract_wu61udf5/ ├── .gitignore ├── README.md ├── data/ │ └── stopwords.txt ├── src/ │ ├── DictBuilder.py │ ├── DictUtils.py │ ├── DocUtils.py │ ├── Utils.py │ ├── __init__.py │ ├── features.py │ ├── isSimilar.py │ ├── launch.py │ ├── launch_incre.py │ ├── preprocess.py │ ├── simhash_imp.py │ ├── tokens.py │ └── webcontent_filter.sh └── test/ └── test_token.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] # C extensions *.so # Distribution / packaging .Python env/ bin/ build/ develop-eggs/ dist/ eggs/ lib/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .cache nosetests.xml coverage.xml # Translations *.mo # Mr Developer .mr.developer.cfg .project .pydevproject # Rope .ropeproject # Django stuff: *.log *.pot # Sphinx documentation docs/_build/ ================================================ FILE: README.md ================================================ text-similarity =============== By max.zhang@2013-11-06 说明：本项目为python语言实现的文本相似度检测工具 # 环境依赖 * python * python-jieba * bash # 目录说明 data 文件夹 -stopwords.txt （停用词表） data/temp 文件夹（存放中间结果文件和文件夹，文件中每一行均表示一个文档） -*.content 网页解析后的原始文本（有噪声） -*.ori 经过预处理后的，可用于检测的原始文本（去噪） -*.token 中文分词结果 -word.dict 根据分词结果生成的特征词典 -*.feat 特征向量文件 -*.fprint Simhash信息指纹文件 src/ 文件夹源程序 # 代码使用说明 ## 判断两个文档的重复度（整合） ### 生成特征词典 (preprocess.py) brief: 对原始文本进行分词并将结果添加到特征词典中 INPUT: 原始文本 + 停用词表 + 特征词典 OUTPUT: 将分词结果保存到.token中，并更新特征词典文件 usage: src/preprocess.py <*.ori> e.g. src/preprocess.py data/temp/doc1.ori data/stopwords.txt data/word.dict {Note: 需对待比较的两个文档分别运行一次, i.e. 两个文档的分词结果都应添加到特征词典中} ### 判断文档重复度 (isSimilar.py) brief: 判断两个文档是否重复 INPUT: 文档1 + 文档2 + 停用词表 + 特征词典 + 模式选择 + 阈值 OUTPUT: 输出两篇文档是否重复及相似度 usage: src/isSimilar.py <-c/-s> -c/-s 选择采用VSM+CosineDistance或是Simhash+HammingDistance方法进行重复判断 e.g. src/isSimilar.py data/temp/doc1.ori data/temp/doc2.ori data/stopwords.txt data/word.dict -c 0.8 ## 详细处理流程（单步） ### 去噪 (webcontent-filter.sh) brief: 原始文本的初步去噪（去特殊符号、英文字母、数字 ...），消除连续空格以及删除空白行 INPUT: 待去噪文本 (.content) OUTPUT: 去噪后的文本 (.ori) usage: src/webcontent_filter.sh <*.content> <*.ori> e.g. src/webcontent-filter.sh data/temp/all.content data/temp/all.ori ### 预处理 #### 中文分词(tokens.py) brief: 采用Jieba分词器对去噪后的原始文本进行中文分词 INPUT: 去噪后的文本 (.ori) OUTPUT: 中文分词结果 (.token) usage: ./tokens.py -s/-m <*.ori/inputfolder> <*.token/outputfolder> c/s[mode] -s[single]/-m[multiple] 对单个文本文件 (*.ori) 或对文本文件目录进行分词 -s <*.ori> <*.token> -m {Note: 采用-m模式时，原始文本名最好以.ori结尾} c/s[mode] Jieba分词器模式选择 c模式 jieba.cut(...) s模式 jieba.cut_for_search() e.g. src/tokens.py -s data/temp/all.ori data/temp/all.token c data/stopwords.txt #### 生成特征词典 (DictBuilder.py) brief: 根据分词结果文件或目录，生成以词频降序排列的特征词典 INPUT: 中文分词结果 (.token) OUTPUT:生成的特征词典，词典格式如下：ID + 特征词 + 词频 usage: src/DictBuilder.py e.g. src/DictBuilder.py data/temp/all.token data/temp/word.dict #### 生成特征向量 (features.py) brief: 根据分词结果和特征词典，生成特征向量文件 INPUT: 第一步处理中分词后的文本 + 第二步生成的特征词典 OUTPUT: 以行为单位生成各文档的特征向量：id1:nonzero-tf id2:nonzero-tf ... usage: src/feature.py -s/-m -s[single]/-m[multiple] 对单个分词文件 (*.token) 或对分词文件目录生成特征向量 e.g. src/feature.py -s data/temp/word.dict data/temp/all.token data/temp/all.feat #### 生成Simhash指纹 (simhash_imp.py) brief: 根据分词结果和特征词典，生成信息指纹文件 INPUT: 特征词典 + 特征向量文件 OUTPUT: 信息指纹文件 usage: src/simhash_imp.py <*.feat> <*.fprint> e.g. src/simhash_imp.py data/temp/word.dict data/temp/all.feat data/temp/all.fprint ## 单元测试 cd test python test_token.py ================================================ FILE: data/stopwords.txt ================================================ , ? 、。《》！，：；？人民末##末啊阿哎哎呀哎哟唉俺俺们按按照吧吧哒把罢了被本本着比比方比如鄙人彼彼此边别别的别说并并且不比不成不单不但不独不管不光不过不仅不拘不论不怕不然不如不特不惟不问不只朝朝着趁趁着乘冲除除此之外除非除了此此间此外从从而打待但但是当当着到得的的话等等等地第叮咚对对于多多少而而况而且而是而外而言而已尔后反过来反过来说反之非但非徒否则嘎嘎登该赶个各各个各位各种各自给根据跟故故此固然关于管归果然果真过哈哈哈呵和何何处何况何时嘿哼哼唷呼哧乎哗还是还有换句话说换言之或或是或者极了及及其及至即即便即或即令即若即使几几时己既既然既是继而加之假如假若假使鉴于将较较之叫接着结果借紧接着进而尽尽管经经过就就是就是说据具体地说具体说来开始开外靠咳可可见可是可以况且啦来来着离例如哩连连同两者了临另另外另一方面论嘛吗慢说漫说冒么每每当们莫若某某个某些拿哪哪边哪儿哪个哪里哪年哪怕哪天哪些哪样那那边那儿那个那会儿那里那么那么些那么样那时那些那样乃乃至呢能你你们您宁宁可宁肯宁愿哦呕啪达旁人呸凭凭借其其次其二其他其它其一其余其中起起见岂但恰恰相反前后前者且然而然后然则让人家任任何任凭如如此如果如何如其如若如上所述若若非若是啥上下尚且设若设使甚而甚么甚至省得时候什么什么样使得是是的首先谁谁知顺顺着似的虽虽然虽说虽则随随着所所以他他们他人它它们她她们倘倘或倘然倘若倘使腾替通过同同时哇万一往望为为何为了为什么为着喂嗡嗡我我们呜呜呼乌乎无论无宁毋宁嘻吓相对而言像向向着嘘呀焉沿沿着要要不要不然要不是要么要是也也罢也好一一般一旦一方面一来一切一样一则依依照矣以以便以及以免以至以至于以致抑或因因此因而因为哟用由由此可见由于有有的有关有些又于于是于是乎与与此同时与否与其越是云云哉再说再者在在下咱咱们则怎怎么怎么办怎么样怎样咋照照着者这这边这儿这个这会儿这就是说这里这么这么点儿这么些这么样这时这些这样正如吱之之类之所以之一只是只限只要只有至至于诸位着着呢自自从自个儿自各儿自己自家自身综上所述总的来看总的来说总的说来总而言之总之纵纵令纵然纵使遵照作为兮呃呗咚咦喏啐喔唷嗬嗯嗳 ~ ! . : ( ) * A 白社会主义 -- .. >> [ ] < > / \ | - _ + = & ^ % # @ ` ; $ （） —— — ￥ · ... 〉〈 … 　 0 1 2 3 4 5 6 7 8 9 二三四五六七八九零＞＜＠＃＄％︿＆＊＋～｜［］｛｝啊哈啊呀啊哟挨次挨个挨家挨户挨门挨户挨门逐户挨着按理按期按时按说暗地里暗中暗自昂然八成白白半梆保管保险饱背地里背靠背倍感倍加本人本身甭比起比如说比照毕竟必必定必将必须便别人并非并肩并没并没有并排并无勃然不不必不常不大不但...而且不得不得不不得了不得已不迭不定不对不妨不管怎样不会不仅...而且不仅仅不仅仅是不经意不可开交不可抗拒不力不了不料不满不免不能不不起不巧不然的话不日不少不胜不时不是不同不能不要不外不外乎不下不限不消不已不亦乐乎不由得不再不择手段不怎么不曾不知不觉不止不止一次不至于才才能策略地差不多差一点常常常常言道常言说常言说得好长此下去长话短说长期以来长线敞开儿彻夜陈年趁便趁机趁热趁势趁早成年成年累月成心乘机乘胜乘势乘隙乘虚诚然迟早充分充其极充其量抽冷子臭初出出来出去除此除此而外除此以外除开除去除却除外处处川流不息传传说传闻串行纯纯粹此后此中次第匆匆从不从此从此以后从古到今从古至今从今以后从宽从来从轻从速从头从未从无到有从小从新从严从优从早到晚从中从重凑巧粗存心达旦打从打开天窗说亮话大大不了大大大抵大都大多大凡大概大家大举大略大面儿上大事大体大体上大约大张旗鼓大致呆呆地带殆待到单单纯单单但愿弹指之间当场当儿当即当口儿当然当庭当头当下当真当中倒不如倒不如说倒是到处到底到了儿到目前为止到头到头来得起得天独厚的确等到叮当顶多定动不动动辄陡然都独独自断然顿时多次多多多多少少多多益善多亏多年来多年前而后而论而又尔等二话不说二话没说反倒反倒是反而反手反之亦然反之则方方才方能放量非常非得分期分期分批分头奋勇愤然风雨无阻逢弗甫嘎嘎该当概赶快赶早不赶晚敢敢情敢于刚刚才刚好刚巧高低格外隔日隔夜个人各式更更加更进一步更为公然共共总够瞧的姑且古来故而故意固怪怪不得惯常光光是归根到底归根结底过于毫不毫无毫无保留地毫无例外好在何必何尝何妨何苦何乐而不为何须何止很很多很少轰然后来呼啦忽地忽然互互相哗啦话说还恍然会豁然活伙同或多或少或许基本基本上基于极极大极度极端极力极其极为急匆匆即将即刻即是说几度几番几乎几经既...又继之加上加以间或简而言之简言之简直见将才将近将要交口较比较为接连不断接下来皆可截然截至藉以借此借以届时仅仅仅谨进来进去近近几年来近来近年来尽管如此尽可能尽快尽量尽然尽如人意尽心竭力尽心尽力尽早精光经常竟竟然究竟就此就地就算居然局外举凡据称据此据实据说据我所知据悉具体来说决不决非绝绝不绝顶绝对绝非均喀看看来看起来看上去看样子可好可能恐怕快快要来不及来得及来讲来看拦腰牢牢老老大老老实实老是累次累年理当理该理应历立立地立刻立马立时联袂连连连日连日来连声连袂临到另方面另行另一个路经屡屡次屡次三番屡屡缕缕率尔率然略略加略微略为论说马上蛮满没没有每逢每每每时每刻猛然猛然间莫莫不莫非莫如默默地默然呐那末奈难道难得难怪难说内年复一年凝神偶而偶尔怕砰碰巧譬如偏偏乒平素颇迫于扑通其后其实奇齐起初起来起首起头起先岂岂非岂止迄恰逢恰好恰恰恰巧恰如恰似千千万千万千万切切不可切莫切切切勿窃亲口亲身亲手亲眼亲自顷顷刻顷刻间顷刻之间请勿穷年累月取道去权时全都全力全年全然全身心然人人仍仍旧仍然日复一日日见日渐日益日臻如常如此等等如次如今如期如前所述如上如下汝三番两次三番五次三天两头瑟瑟沙沙上上来上去 w e r t y u i o p s d f g h j k l z x c v b n m “ ” 恩 " ' ( ) * A 白 -- .. >> [ ] < > / \ | - _ + = & ^ % # @ ` （） —— — ￥ · ... ‘ ’ 〉〈 … ＞＜＠＃＄％︿＆＊＋～｜［］｛｝ ! # % & ' ( ) * + , - . / 100% 100％ 10元 : ; = ? @ [ \ ] ^ _ ` a amp b c cm d e f g gt h i j k l ldquo love lt m mdash middot mm n no o quot r rarr rdquo s sect t times v w x y z { | } ~ 　、。～ ‖ “ ” 「」『』〖〗【】 ⊙ ≮ ≯ ☆ ★ ● ◎ ◇ ◆ ■ ▲ ※ → 〓！￥＆（）＊＋，－．／：；＞？［＼］｛｝の ◢ ◣ ◤ ◥ ㊣ " “ ” " " ‘ ’ ' ' 〇－ – — ― ︱゛＂＃＄＆︶＊ ﹐ ﹑ ．／ ﹕ ；＠［＼］＾＿﹍﹎﹏｛｜｝～ ¨ ˉ ˇ ˙ ‖ ‘ ’ ′ ″ ﹉﹊﹋﹌︴〈︿〉﹀《》「」『﹃』【︻】〔〕〖〗〝〞〃〆＋ ∕ ⊙ ＜＝＞ ± × ÷ ∈ ∏ ∑ √ ∝ ∟ ∠ ∣ ∧ ∨ ∩ ∪ ∫ ∮ ∴ ∵ ∶ ∷ ∽ ≈ ≌ ≒ ≠ ≡ ≤ ≥ ≦ ≮ ≯ ⊥ ⊿ ⌒ □ △ ▼ ▽ ◇ ○ ◎ ◢ ◣ ◤ ◥ ↑ ↗ → ↘ ↓ ↙ ← ↖ ─ ━ ┄ ┅ ┈ ┉ ═ │ ┃ ┆ ┇ ┊ ┋ ║ ┌ ┍ ┎ ┏ ╒ ╓ ╔ ╭ ┐ ┑ ┒ ┓ ╕ ╖ ╗ ╮ └ ┕ ┖ ┗ ╘ ╙ ╚ ╰ ┘ ┙ ┚ ┛ ╛ ╜ ╝ ╯ ├ ┝ ┞ ┟ ┠ ┡ ┢ ┣ ╞ ╟ ╠ ┤ ┥ ┦ ┧ ┨ ┩ ┪ ┫ ╡ ╢ ╣ ┬ ┭ ┮ ┯ ┰ ┱ ┲ ┳ ╤ ╥ ╦ ┴ ┵ ┶ ┷ ┸ ┹ ┺ ┻ ╧ ╨ ╩ ┼ ┽ ┾ ┿ ╀ ╁ ╂ ╄ ╅ ╆ ╇ ╈ ╉ ╊ ╋ ╪ ╫ ╬ ╱ ╲ ╳ ▁ ▏ ▔ ▕ ▂ ▎ ▃ ▍ ▄ ▌ ▅ ▋ ▆ ▇ ▉ █ ▓ ￠￡ ¤ ￥ § ° · … ‰ ※ 〓 ☆ ♀ ♂ ================================================ FILE: src/DictBuilder.py ================================================ #!/usr/bin/python # -*-coding:utf8-*- ''' Created on 2013-10-12 @author: zyy_max @brief: build word, idf dict from input_folder @modified: 2013-10-15 ==> check whether input a folder or a file @modified: 2013-11-06 ==> build dict from token list, load ori_dict ''' from collections import defaultdict import os import sys class WordDictBuilder: def __init__(self, ori_path='', filelist=[], tokenlist=[]): self.word_dict = defaultdict(int) if ori_path != '' and os.path.exists(ori_path): with open(ori_path) as ins: for line in ins.readlines(): self.word_dict[line.split('\t')[1]] = int(line.split('\t')[2]) self.filelist = filelist self.tokenlist = tokenlist def run(self): for filepath in self.filelist: self._updateDict(filepath) self._updateDictByTokenList() return self def _updateDict(self, filepath): with open(filepath, 'r') as ins: for line in ins.readlines(): for word in line.rstrip().split(): self.word_dict[word] += 1 def _updateDictByTokenList(self): for token in self.tokenlist: if isinstance(token, unicode): token = token.encode('utf8') self.word_dict[token] += 1 def save(self, filepath): l = [(value, key) for key, value in self.word_dict.items()] l = sorted(l, reverse=True) result_lines = [] for idx, (value, key) in enumerate(l): result_lines.append('%s\t%s\t%s%s' % (idx, key, value, os.linesep)) with open(filepath, 'w') as outs: outs.writelines(result_lines) if __name__ == "__main__": if len(sys.argv) < 3: print "Usage:\tWordDictBuilder.py " exit(-1) if not os.path.isfile(sys.argv[1]): filelist = [sys.argv[1] + os.sep + f for f in os.listdir(sys.argv[1])] else: filelist = [sys.argv[1]] builder = WordDictBuilder(filelist=filelist) builder.run() builder.save(sys.argv[2]) ================================================ FILE: src/DictUtils.py ================================================ #!/usr/bin/env python ''' Created on 2013-11-14 @author zyy_max @brief utils for word dictionary ''' class WordDict(dict): """ @brief init, update and save word dictionary """ def __init__(self, dict_path=None): if dict_path is not None: self.load_dict(dict_path) def load_dict(self, dict_path): self.dict_path = dict_path print 'Loading word dictionary from %s...' % dict_path self.clear() with open(dict_path, 'r') as ins: for line in ins.readlines(): wordid, word = line.strip().split() if isinstance(word, str): word = word.decode('utf8') self[word] = int(wordid) return self def add_one(self, word): if isinstance(word, str): word = word.decode('utf8') if not word in self: max_id = max([0] + self.values()) self[word] = max_id+1 return self def save_dict(self, dict_path): print 'Saving word dictionary to %s...' % dict_path word_list = self.items() with open(dict_path, 'w') as outs: for word, wordid in sorted(word_list): outs.write('%s\t%s\n' % (wordid, word)) def __del__(self): self.save_dict(self.dict_path) ================================================ FILE: src/DocUtils.py ================================================ #!/usr/bin/env python ''' Created on 2013-11-14 @author zyy_max @brief DocDict for loading docs from db or file, update and save them ''' class DocDict(dict): """ @brief load docs, update and """ def __init__(self, fpath=None): self.fpath = fpath if fpath is not None: self.load_from_file(fpath) def load_from_db(self): print 'Loading from db' self.clear() def load_from_file(self, fpath): print 'Loading documents from file:',fpath self.fpath = fpath self.clear() with open(fpath, 'r') as ins: for line in ins.readlines(): docid, doc_str = line.strip().split('\t') self[int(docid)] = doc_str return self def update(self, docid, doc_str): if not docid in self: self[docid] = doc_str return self def save_to_file(self, fpath): with open(fpath, 'w') as outs: for key in sorted(self.keys()): outs.write('%s\t%s\n' %(key, self[key])) def __del__(self): self.save_to_file(self.fpath) ================================================ FILE: src/Utils.py ================================================ #!/usr/bin/env python #-*-coding:utf8-*- ''' @Created on 2013-10-21 @author zyy_max @brief utils of common methods @modified on 2013-10-23 ==> change break condition of cosine(euclidean)_distance_nonzero ''' import math def norm_vector_nonzero(ori_vec): ori_sum = math.sqrt(sum([math.pow(float(value),2) for (idx,value) in ori_vec])) if ori_sum < 1e-6: return ori_vec result_vec = [] for idx, ori_value in ori_vec: result_vec.append((idx, float(ori_value)/ori_sum)) #print ori_sum return result_vec def cosine_distance_nonzero(feat_vec1, feat_vec2, norm=True): if True == norm: feat_vec1 = norm_vector_nonzero(feat_vec1) feat_vec2 = norm_vector_nonzero(feat_vec2) dist = 0 idx1 = 0 idx2 = 0 while idx1 < len(feat_vec1) and idx2 < len(feat_vec2): if feat_vec1[idx1][0] == feat_vec2[idx2][0]: dist += float(feat_vec1[idx1][1])*float(feat_vec2[idx2][1]) idx1 += 1 idx2 += 1 elif feat_vec1[idx1][0] > feat_vec2[idx2][0]: idx2 += 1 else: idx1 += 1 return dist def euclidean_distance_nonzero(feat_vec1, feat_vec2, norm=True): if True == norm: feat_vec1 = norm_vector_nonzero(feat_vec1) feat_vec2 = norm_vector_nonzero(feat_vec2) dist = 0 length = min(len(feat_vec1), len(feat_vec2)) idx1 = 0 idx2 = 0 while idx1 < len(feat_vec1) and idx2 < len(feat_vec2): if feat_vec1[idx1][0] > feat_vec2[idx2][0]: dist += math.pow(float(feat_vec2[idx2][1]), 2) idx2 += 1 elif feat_vec1[idx1][0] < feat_vec2[idx2][0]: dist += math.pow(float(feat_vec1[idx1][1]), 2) idx1 += 1 else: dist += math.pow(float(feat_vec1[idx1][1])-float(feat_vec2[idx2][1]), 2) idx2 += 1 idx1 += 1 return math.sqrt(dist) def norm_vector(ori_vec): ori_sum = math.sqrt(sum([math.pow(float(x),2) for x in ori_vec])) if ori_sum < 1e-6: return ori_vec result_vec = [] for ori_value in ori_vec: result_vec.append(float(ori_value)/ori_sum) #print ori_sum return result_vec def cosine_distance(feat_vec1, feat_vec2, norm=True): dist = 0 if True == norm: feat_vec1 = norm_vector(feat_vec1) feat_vec2 = norm_vector(feat_vec2) for idx, feat1 in enumerate(feat_vec1): if idx >= len(feat_vec2): break if abs(float(feat1)) < 1e-6 or abs(float(feat_vec2[idx])) < 1e-6: continue dist += float(feat1)*float(feat_vec2[idx]) #print dist return dist def euclidean_distance(feat_vec1, feat_vec2, norm=True): dist = 0 if True == norm: feat_vec1 = norm_vector(feat_vec1) feat_vec2 = norm_vector(feat_vec2) len1 = len(feat_vec1) len2 = len(feat_vec2) for idx in xrange(min(len2,len2)): dist += math.pow(float(feat_vec1[idx])-float(feat_vec2[idx]),2) if len1 < len2: dist += sum([math.pow(float(feat),2) for feat in feat_vec2[len1-len2:]]) if len1 > len2: dist += sum([math.pow(float(feat),2) for feat in feat_vec1[len2-len1:]]) return math.sqrt(dist) ================================================ FILE: src/__init__.py ================================================ __author__ = 'max.zhang' ================================================ FILE: src/features.py ================================================ #!/usr/bin/python #-*-coding:utf8-*- ''' Created on 2013-10-13 @author: zyy_max @brief: build feature vector with word_dict and token_list @modified: 2013-10-15 ==> add upate_word for FeatureBuilder @modified: 2013-11-06 ==> add feature_nonzero @modified: 2013-11-15 ==> add FeatureBuilderUpdate word_dict is WordDict in DictUtils ''' import os,sys class FeatureBuilder: def __init__(self, word_dict): self.word_dict = word_dict def compute(self, token_list): feature = [0]*len(self.word_dict) for token in token_list: feature[self.word_dict[token]] += 1 feature_nonzero = [(idx,value) for idx, value in enumerate(feature) if value > 0] return feature_nonzero def _add_word(self, word): if not word in self.word_dict: self.word_dict[word] = len(self.word_dict) def update_words(self, word_list=[]): for word in word_list: self._add_word(word) class FeatureBuilderUpdate(FeatureBuilder): def _add_word(self, word): self.word_dict.add_one(word) def feature_single(inputfile, outputfile): print inputfile,outputfile result_lines = [] with open(inputfile, 'r') as ins: for lineidx, line in enumerate(ins.readlines()): feature = fb.compute([token.decode('utf8') for token in line.strip().split()]) l = [] for idx,f in feature: if f > 1e-6: l.append('%s:%s' %(idx,f)) result_lines.append(' '.join(l) + os.linesep) print 'Finished\r', lineidx, with open(outputfile, 'w') as outs: outs.writelines(result_lines) print 'Wrote to ', outputfile if __name__=="__main__": if len(sys.argv) < 5: print "Usage:\tfeature.py -s/-m " exit(-1) word_dict = {} with open(sys.argv[2], 'r') as ins: for line in ins.readlines(): l = line.split() word_dict[l[1].decode('utf8')] = int(l[0]) fb = FeatureBuilder(word_dict) print 'Loaded', len(word_dict), 'words' if sys.argv[1] == '-s': feature_single(sys.argv[3], sys.argv[4]) elif sys.argv[1] == '-m': for inputfile in os.listdir(sys.argv[3]): feature_single(os.path.join(sys.argv[3],inputfile), os.path.join(sys.argv[4],inputfile.replace('.token','.feat'))) ================================================ FILE: src/isSimilar.py ================================================ #!/usr/bin/env python # -*-coding:utf8-*- ''' Created on 2013-11-06 @author zyy_max @brief check the similarity of 2 documents by VSM+cosine distance or simhash+hamming distance ''' import sys from simhash_imp import SimhashBuilder, hamming_distance from tokens import JiebaTokenizer from features import FeatureBuilder from Utils import norm_vector_nonzero, cosine_distance_nonzero class DocFeatLoader: def __init__(self, simhash_builder, feat_nonzero): self.feat_vec = feat_nonzero self.feat_vec = norm_vector_nonzero(self.feat_vec) self.fingerprint = simhash_builder.sim_hash_nonzero(self.feat_vec) if __name__ == "__main__": if len(sys.argv) < 7: print "Usage:\tisSimilar.py <-c/-s> " exit(-1) doc_path_1, doc_path_2, stopword_path, word_dict, mode, threshold = sys.argv[1:] print 'Arguments:', sys.argv[1:] with open(doc_path_1) as ins: doc_data_1 = ins.read().decode('utf8') print 'Loaded', doc_path_1 with open(doc_path_2) as ins: doc_data_2 = ins.read().decode('utf8') print 'Loaded', doc_path_2 # Init tokenizer jt = JiebaTokenizer(stopword_path, 'c') # Tokenization doc_token_1 = jt.tokens(doc_data_1) doc_token_2 = jt.tokens(doc_data_2) print 'Loading word dict...' # Load word list from word_dict word_list = [] with open(word_dict, 'r') as ins: for line in ins.readlines(): word_list.append(line.split()[1]) # Build unicode string word dict word_dict = {} for idx, ascword in enumerate(word_list): word_dict[ascword.decode('utf8')] = idx # Build nonzero-feature fb = FeatureBuilder(word_dict) doc_feat_1 = fb.compute(doc_token_1) doc_feat_2 = fb.compute(doc_token_2) # Init simhash_builder smb = SimhashBuilder(word_list) doc_fl_1 = DocFeatLoader(smb, doc_feat_1) doc_fl_2 = DocFeatLoader(smb, doc_feat_2) if mode == '-c': print 'Matching by VSM + cosine distance' dist = cosine_distance_nonzero(doc_fl_1.feat_vec, doc_fl_2.feat_vec, norm=False) if dist > float(threshold): print 'Matching Result:\t' % dist else: print 'Matching Result:\t' % dist elif mode == '-s': print 'Matching by Simhash + hamming distance' dist = hamming_distance(doc_fl_1.fingerprint, doc_fl_2.fingerprint) if dist < float(threshold): print 'Matching Result:\t' % dist else: print 'Matching Result:\t' % dist ================================================ FILE: src/launch.py ================================================ #!/usr/bin/env python #-*-coding:utf8-*- ''' Created on 2013-10-14 @author: zyy_max @brief: launch entry of near-duplicate detection system ''' import os import sys from tokens import JiebaTokenizer from simhash_imp import SimhashBuilder, hamming_distance from features import FeatureBuilder if __name__=="__main__": if len(sys.argv) < 7: print "Usage:\tlaunch.py word_dict_path stop_words_path fingerprint_path documents_path test_path result_path" exit(-1) # Load word list word_list = [] with open(sys.argv[1], 'r') as ins: for line in ins.readlines(): word_list.append(line.split()[1]) # Init tokenizer jt = JiebaTokenizer(sys.argv[2], 'c') # Init feature_builder word_dict = {} for idx, ascword in enumerate(word_list): word_dict[ascword.decode('utf8')] = idx fb = FeatureBuilder(word_dict) # Init simhash_builder smb = SimhashBuilder(word_list) # Load fingerprint list fingerprint_list = [] with open(sys.argv[3], 'r') as ins: for line in ins.readlines(): fingerprint_list.append(int(line)) # For exp: load document content doc_list = [] with open(sys.argv[4], 'r') as ins: for line in ins.readlines(): doc_list.append(line.strip()) # Detection process begins min_sim = 64 min_docid = 0 with open(sys.argv[5], 'r') as ins: for lineidx, line in enumerate(ins.readlines()): if lineidx != 642: continue # Tokenize tokens = jt.tokens(line.strip().decode('utf8')) # Compute text feature feature = fb.compute(tokens) # Compute simhash fingerprint = smb.sim_hash(feature) result_list = [] for idx, fp in enumerate(fingerprint_list): sim = hamming_distance(fingerprint, fp, 64) result_list.append((sim, idx)) result_list = sorted(result_list, cmp=lambda x,y: cmp(x[0],y[0])) if result_list[0][0] < min_sim: min_sim, min_docid = result_list[0][0], lineidx #''' with open(sys.argv[6], 'w') as outs: outs.write(line.strip()+os.linesep) for sim, idx in result_list: outs.write('%s\t%s%s' %(sim, doc_list[idx], os.linesep)) #''' #if lineidx == 2: # break print min_sim, min_docid ================================================ FILE: src/launch_incre.py ================================================ #!/usr/bin/env python #-*-coding:utf8-*- ''' Created on 2013-10-15 @author: zyy_max @brief: incremental-version launch entry of near-duplicate detection system ''' import os import sys from tokens import JiebaTokenizer from simhash_imp import SimhashBuilder, hamming_distance from features import FeatureBuilder class FeatureContainer: def __init__(self, word_dict_path): # Load word list self.word_dict_path = word_dict_path self.word_list = [] with open(word_dict_path, 'r') as ins: for line in ins.readlines(): self.word_list.append(line.split()[1]) self.word_dict = {} for idx, ascword in enumerate(self.word_list): self.word_dict[ascword.decode('utf8')] = idx self.fb = FeatureBuilder(self.word_dict) self.smb = SimhashBuilder(self.word_list) print 'Loaded ', len(self.word_list), 'words' def compute_feature(self, token_list): new_words = [] for token in token_list: if not token in self.word_dict: new_words.append(token) if len(new_words) != 0: # Update word_list and word_dict self.fb.update_words(new_words) self.smb.update_words([word.encode('utf8') for word in new_words]) self.word_dict = self.fb.word_dict self.word_list.extend([word.encode('utf8') for word in new_words]) feature_vec = self.fb.compute(token_list) return feature_vec, self.smb.sim_hash(feature_vec) ''' def __del__(self): with open(self.word_dict_path, 'w') as outs: for idx, word in enumerate(self.word_list): outs.write('%s\t%s%s'%(idx, word, os.linesep)) ''' if __name__=="__main__": if len(sys.argv) < 7: print "Usage:\tlaunch_inc.py " exit(-1) # Init tokenizer jt = JiebaTokenizer(sys.argv[2], 'c') # Init feature_builder and simhash_builder fc = FeatureContainer(sys.argv[1]) # Load fingerprint list fingerprint_list = [] with open(sys.argv[3], 'r') as ins: for line in ins.readlines(): fingerprint_list.append(int(line)) # For exp: load document content doc_list = [] with open(sys.argv[4], 'r') as ins: for line in ins.readlines(): doc_list.append(line.strip()) # Detection process begins min_sim = 64 min_docid = 0 with open(sys.argv[5], 'r') as ins: for lineidx, line in enumerate(ins.readlines()): # Tokenize tokens = jt.tokens(line.strip().decode('utf8')) feature, fingerprint = fc.compute_feature(tokens) result_list = [] for idx, fp in enumerate(fingerprint_list): sim = hamming_distance(fingerprint, fp, 64) result_list.append((sim, idx)) result_list = sorted(result_list, cmp=lambda x,y: cmp(x[0],y[0])) if result_list[0][0] < min_sim: min_sim, min_docid = result_list[0][0], lineidx #''' with open(sys.argv[6], 'w') as outs: outs.write(line.strip()+os.linesep) for sim, idx in result_list: outs.write('%s\t%s%s' %(sim, doc_list[idx], os.linesep)) #''' #if lineidx == 2: # break with open('word_dict_new.txt', 'w') as outs: for idx, word in enumerate(fc.word_list): outs.write('%s\t%s%s'%(idx, word, os.linesep)) ================================================ FILE: src/preprocess.py ================================================ #!/usr/bin/env python #-*-coding:utf8-*- ''' Created on 2013-11-06 @author zyy_max @brief update word_dict by token result of document ''' import os import sys import time from tokens import JiebaTokenizer from DictBuilder import WordDictBuilder if __name__=="__main__": if len(sys.argv) < 4: print "Usage:\tpreprocess.py " exit(-1) doc_path, stopword_path, worddict_path = sys.argv[1:] print 'Arguments:',sys.argv[1:] # Init tokenizer jt = JiebaTokenizer(stopword_path, 'c') # Load doc data with open(doc_path) as ins: doc_data = ins.read().decode('utf8') # Tokenization doc_tokens = jt.tokens(doc_data) # Write to token file with open(doc_path[:doc_path.rfind('.')]+'.token', 'w') as outs: outs.write('/'.join([token.encode('utf8') for token in doc_tokens])) # Load original word dict, update and save wdb = WordDictBuilder(worddict_path, tokenlist=doc_tokens) wdb.run() wdb.save(worddict_path) print 'Totally', len(wdb.word_dict), 'words' ================================================ FILE: src/simhash_imp.py ================================================ #!/usr/bin/env python # -*- coding=utf-8 -*- ''' Created on 2013-10-13 @author: zyy_max @brief: build simhash and compute hamming_distance @modified: 2013-10-15 ==> add update_word for SimhashBuilder ''' # Implementation of Charikar simhashes in Python # See: http://dsrg.mff.cuni.cz/~holub/sw/shash/#a1 import os, sys def hamming_distance(hash_a, hash_b, hashbits=128): x = (hash_a ^ hash_b) & ((1 << hashbits) - 1) tot = 0 while x: tot += 1 x &= x-1 return tot class SimhashBuilder: def __init__(self, word_list=[], hashbits=128): self.hashbits = hashbits self.hashval_list = [self._string_hash(word) for word in word_list] print 'Totally: %s words' %(len(self.hashval_list),) """ with open('word_hash.txt', 'w') as outs: for word in word_list: outs.write(word+'\t'+str(self._string_hash(word))+os.linesep) """ def _string_hash(self, word): # A variable-length version of Python's builtin hash if word == "": return 0 else: x = ord(word[0])<<7 m = 1000003 mask = 2**self.hashbits-1 for c in word: x = ((x*m)^ord(c)) & mask x ^= len(word) if x == -1: x = -2 return x def sim_hash_nonzero(self, feature_vec): finger_vec = [0]*self.hashbits # Feature_vec is like [(idx,nonzero-value),(idx,nonzero-value)...] for idx, feature in feature_vec: hashval = self.hashval_list[int(idx)] for i in range(self.hashbits): bitmask = 1<= 0: fingerprint += 1 << i #整个文档的fingerprint为最终各个位大于等于0的位的和 return fingerprint def sim_hash(self, feature_vec): finger_vec = [0]*self.hashbits for idx, feature in enumerate(feature_vec): if float(feature) < 1e-6: continue hashval = self.hashval_list[idx] for i in range(self.hashbits): bitmask = 1<= 0: fingerprint += 1 << i #整个文档的fingerprint为最终各个位大于等于0的位的和 return fingerprint def _add_word(self, word): self.hashval_list.append(self._string_hash(word)) def update_words(self, word_list=[]): for word in word_list: self._add_word(word) class simhash(): def __init__(self, tokens='', hashbits=128): self.hashbits = hashbits self.hash = self.simhash(tokens) def __str__(self): return str(self.hash) def __long__(self): return long(self.hash) def __float__(self): return float(self.hash) def simhash(self, tokens): # Returns a Charikar simhash with appropriate bitlength v = [0]*self.hashbits for t in [self._string_hash(x) for x in tokens]: bitmask = 0 #print (t) for i in range(self.hashbits): bitmask = 1 << i #print(t,bitmask, t & bitmask) if t & bitmask: v[i] += 1 #查看当前bit位是否为1，是的话则将该位+1 else: v[i] += -1 #否则得话，该位减1 fingerprint = 0 for i in range(self.hashbits): if v[i] >= 0: fingerprint += 1 << i #整个文档的fingerprint为最终各个位大于等于0的位的和 return fingerprint def _string_hash(self, v): # A variable-length version of Python's builtin hash if v == "": return 0 else: x = ord(v[0])<<7 m = 1000003 mask = 2**self.hashbits-1 for c in v: x = ((x*m)^ord(c)) & mask x ^= len(v) if x == -1: x = -2 return x def hamming_distance(self, other_hash): x = (self.hash ^ other_hash.hash) & ((1 << self.hashbits) - 1) tot = 0 while x: tot += 1 x &= x-1 return tot def similarity(self, other_hash): a = float(self.hash) b = float(other_hash) if a>b: return b/a return a/b if __name__ == '__main__': #看看哪些东西google最看重？标点？ #s = '看看哪些东西google最看重？标点？' #hash1 =simhash(s.split()) #print("0x%x" % hash1) #print ("%s\t0x%x" % (s, hash1)) #s = '看看哪些东西google最看重！标点！' #hash2 = simhash(s.split()) #print ("%s\t[simhash = 0x%x]" % (s, hash2)) #print '%f%% percent similarity on hash' %(100*(hash1.similarity(hash2))) #print hash1.hamming_distance(hash2),"bits differ out of", hash1.hashbits if len(sys.argv) < 4: print "Usage:\tsimhash_imp.py " exit(-1) word_list = [] with open(sys.argv[1], 'r') as ins: for idx, line in enumerate(ins.readlines()): word_list.append(line.split()[1]) print '\rloading word', idx, sim_b = SimhashBuilder(word_list) result_lines = [] print '' with open(sys.argv[2], 'r') as ins: for idx, line in enumerate(ins.readlines()): print '\rprocessing doc', idx, feature_vec = line.strip().split() feature_vec = [(int(item.split(':')[0]),float(item.split(':')[1])) for item in feature_vec] fingerprint = sim_b.sim_hash_nonzero(feature_vec) result_lines.append(str(fingerprint)+os.linesep) with open(sys.argv[3], 'w') as outs: outs.writelines(result_lines) ================================================ FILE: src/tokens.py ================================================ #!/usr/bin/python # -*- coding: utf-8 -*- ''' Created on 20131012 @author: zyy_max @brief: get tokens from input file by jieba ''' import jieba import os import sys class JiebaTokenizer: def __init__(self, stop_words_path, mode='s'): self.stopword_set = set() # load stopwords with open(stop_words_path) as ins: for line in ins: self.stopword_set.add(line.strip().decode('utf8')) self.mode = mode def tokens(self, intext): intext = u' '.join(intext.split()) if self.mode == 's': token_list = jieba.cut_for_search(intext) else: token_list = jieba.cut(intext) return [token for token in token_list if token.strip() != u'' and not token in self.stopword_set] def token_single_file(input_fname, output_fname): result_lines = [] with open(input_fname) as ins: for line in ins: line = line.strip().decode('utf8') tokens = jt.tokens(line) result_lines.append(u' '.join(tokens).encode('utf8')) open(output_fname, 'w').write(os.linesep.join(result_lines)) print 'Wrote to ', output_fname if __name__ == "__main__": if len(sys.argv) < 6 or sys.argv[1] not in ['-s', '-m'] or sys.argv[4] not in ['c', 's']: print "Usage:\ttokens.py " \ " " print "file_mode:\t-s:\tsingle file" print "\t\t-m:\tmultiple files" print "cut_mode:\tc:\tnormal mode of Jieba" print "\t\ts:\tcut_for_search mode of Jieba" exit(-1) file_mode, input_filepath, output_filepath, cut_mode, stopword_file = sys.argv[1:] jt = JiebaTokenizer(stopword_file, cut_mode) # extract tokens and filter by stopwords if file_mode == '-s': token_single_file(input_filepath, output_filepath) elif file_mode == '-m': for input_file in os.listdir(input_filepath): prefix = input_file.rsplit(os.sep, 1)[0] token_single_file(os.path.join(input_filepath, input_file), os.path.join(output_filepath, prefix+'.token')) ================================================ FILE: src/webcontent_filter.sh ================================================ #!/bin/bash # Delete nonprint characters # Delete 0-9a-zA-z and some useless characters # Turn sequence of empty char to single one # Delete empty lines sed 's/[^[:print:]]//g' $1 \ | sed 's/[0-9a-zA-Z+=\./:\"<>|_&#]/ /g' \ | sed 's/ */ /g' > $2 # sed '/^ *$/d' > $2 ================================================ FILE: test/test_token.py ================================================ #!/usr/bin/python # -*- coding: utf-8 -*- ''' Created on 20150825 @author: zyy_max @brief: unit test of src/tokens.py ''' import unittest import sys sys.path.append('..') from src.tokens import JiebaTokenizer class JiebaTokenizerTestCase(unittest.TestCase): def setUp(self): self.jt = JiebaTokenizer("../data/stopwords.txt") def testTokens(self): in_text = u"完整的单元测试很少只执行一个测试用例，" \ u"开发人员通常都需要编写多个测试用例才能" \ u"对某一软件功能进行比较完整的测试，这些" \ u"相关的测试用例称为一个测试用例集，在" \ u"PyUnit中是用TestSuite类来表示的。" tokens_text = u"完整/单元/测试/单元测试/只/执行/" \ u"一个/测试/试用/测试用例/开发/发人/" \ u"人员/开发人员/通常/需要/编写/多个/" \ u"测试/试用/测试用例/软件/功能/进行/" \ u"比较/完整/测试/相关/测试/试用/测试用例/" \ u"称为/一个/测试/试用/测试用例/集/PyUnit/" \ u"中是/TestSuite/类来/表示" self.assertEqual(tokens_text, u'/'.join(self.jt.tokens(in_text)), "Tokenization Results differ") if __name__ == "__main__": unittest.main()