Full Code of zyymax/text-similarity for AI

master e2c01e426cc4 cached

17 files

38.4 KB

14.2k tokens

62 symbols

1 requests

Download .txt

Repository: zyymax/text-similarity
Branch: master
Commit: e2c01e426cc4
Files: 17
Total size: 38.4 KB

Directory structure:
gitextract_wu61udf5/

├── .gitignore
├── README.md
├── data/
│   └── stopwords.txt
├── src/
│   ├── DictBuilder.py
│   ├── DictUtils.py
│   ├── DocUtils.py
│   ├── Utils.py
│   ├── __init__.py
│   ├── features.py
│   ├── isSimilar.py
│   ├── launch.py
│   ├── launch_incre.py
│   ├── preprocess.py
│   ├── simhash_imp.py
│   ├── tokens.py
│   └── webcontent_filter.sh
└── test/
    └── test_token.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

# C extensions
*.so

# Distribution / packaging
.Python
env/
bin/
build/
develop-eggs/
dist/
eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.cache
nosetests.xml
coverage.xml

# Translations
*.mo

# Mr Developer
.mr.developer.cfg
.project
.pydevproject

# Rope
.ropeproject

# Django stuff:
*.log
*.pot

# Sphinx documentation
docs/_build/



================================================
FILE: README.md
================================================
text-similarity
===============
By max.zhang@2013-11-06

说明：本项目为python语言实现的文本相似度检测工具

# 环境依赖
*	python
*	python-jieba
*	bash

# 目录说明
data 文件夹

	-stopwords.txt （停用词表）

data/temp 文件夹 （存放中间结果文件和文件夹，文件中每一行均表示一个文档）

	-*.content	网页解析后的原始文本（有噪声）

	-*.ori		经过预处理后的，可用于检测的原始文本（去噪）

	-*.token		中文分词结果

	-word.dict	根据分词结果生成的特征词典

	-*.feat		特征向量文件

	-*.fprint		Simhash信息指纹文件

src/ 文件夹  

	源程序


# 代码使用说明

## 判断两个文档的重复度（整合）

### 生成特征词典 (preprocess.py)

brief: 对原始文本进行分词并将结果添加到特征词典中

INPUT: 原始文本 + 停用词表 + 特征词典

OUTPUT: 将分词结果保存到.token中，并更新特征词典文件

usage:

	src/preprocess.py <*.ori> <stopword_path> <word_dict>

e.g.

	src/preprocess.py data/temp/doc1.ori data/stopwords.txt data/word.dict

{Note: 需对待比较的两个文档分别运行一次, i.e. 两个文档的分词结果都应添加到特征词典中}


### 判断文档重复度 (isSimilar.py)

brief: 判断两个文档是否重复

INPUT: 文档1 + 文档2 + 停用词表 + 特征词典 + 模式选择 + 阈值

OUTPUT: 输出两篇文档是否重复及相似度

usage:

	src/isSimilar.py <doc1> <doc2> <stopword_path> <word_dict> <-c/-s> <threshold>

	-c/-s	选择采用VSM+CosineDistance或是Simhash+HammingDistance方法进行重复判断

e.g.

	src/isSimilar.py data/temp/doc1.ori data/temp/doc2.ori data/stopwords.txt data/word.dict -c 0.8


## 详细处理流程（单步）

### 去噪 (webcontent-filter.sh)

brief: 原始文本的初步去噪（去特殊符号、英文字母、数字 ...），消除连续空格以及删除空白行

INPUT: 待去噪文本 (.content)

OUTPUT: 去噪后的文本 (.ori)

usage:

	src/webcontent_filter.sh <*.content> <*.ori>
	
e.g.

	src/webcontent-filter.sh data/temp/all.content data/temp/all.ori
	

### 预处理

#### 中文分词(tokens.py)

brief: 采用Jieba分词器对去噪后的原始文本进行中文分词

INPUT: 去噪后的文本 (.ori)

OUTPUT: 中文分词结果 (.token)

usage:

	./tokens.py  -s/-m <*.ori/inputfolder> <*.token/outputfolder> c/s[mode] <stopword.list>

	-s[single]/-m[multiple]  对单个文本文件 (*.ori) 或对文本文件目录进行分词

		-s <*.ori> <*.token>

		-m <inputfolder> <outputfolder> {Note: 采用-m模式时，原始文本名最好以.ori结尾}

	c/s[mode]	Jieba分词器模式选择

		c模式	jieba.cut(...)

		s模式	jieba.cut_for_search()

e.g.

	src/tokens.py  -s  data/temp/all.ori data/temp/all.token c data/stopwords.txt 


#### 生成特征词典 (DictBuilder.py)

brief: 根据分词结果文件或目录，生成以词频降序排列的特征词典

INPUT: 中文分词结果 (.token)

OUTPUT:生成的特征词典，词典格式如下：ID + 特征词 + 词频

usage:

	src/DictBuilder.py <input_folder/*.token> <output_file>

e.g.

	src/DictBuilder.py data/temp/all.token data/temp/word.dict


#### 生成特征向量 (features.py)

brief: 根据分词结果和特征词典，生成特征向量文件

INPUT: 第一步处理中分词后的文本 + 第二步生成的特征词典

OUTPUT: 以行为单位生成各文档的特征向量：id1:nonzero-tf id2:nonzero-tf ...

usage:

	src/feature.py -s/-m <word_dict_path> <tokens_file/tokens_folder> <feature_file/feature_folder>

	-s[single]/-m[multiple]  对单个分词文件 (*.token) 或对分词文件目录生成特征向量
	
e.g.

	src/feature.py -s data/temp/word.dict data/temp/all.token data/temp/all.feat


#### 生成Simhash指纹 (simhash_imp.py)

brief: 根据分词结果和特征词典，生成信息指纹文件

INPUT: 特征词典 + 特征向量文件

OUTPUT: 信息指纹文件

usage:

	src/simhash_imp.py <word_dict_path> <*.feat> <*.fprint>

e.g.

	src/simhash_imp.py data/temp/word.dict data/temp/all.feat data/temp/all.fprint

## 单元测试

    cd test
    python test_token.py


================================================
FILE: data/stopwords.txt
================================================
,
?
、
。
《
》
！
，
：
；
？
人民
末##末
啊
阿
哎
哎呀
哎哟
唉
俺
俺们
按
按照
吧
吧哒
把
罢了
被
本
本着
比
比方
比如
鄙人
彼
彼此
边
别
别的
别说
并
并且
不比
不成
不单
不但
不独
不管
不光
不过
不仅
不拘
不论
不怕
不然
不如
不特
不惟
不问
不只
朝
朝着
趁
趁着
乘
冲
除
除此之外
除非
除了
此
此间
此外
从
从而
打
待
但
但是
当
当着
到
得
的
的话
等
等等
地
第
叮咚
对
对于
多
多少
而
而况
而且
而是
而外
而言
而已
尔后
反过来
反过来说
反之
非但
非徒
否则
嘎
嘎登
该
赶
个
各
各个
各位
各种
各自
给
根据
跟
故
故此
固然
关于
管
归
果然
果真
过
哈
哈哈
呵
和
何
何处
何况
何时
嘿
哼
哼唷
呼哧
乎
哗
还是
还有
换句话说
换言之
或
或是
或者
极了
及
及其
及至
即
即便
即或
即令
即若
即使
几
几时
己
既
既然
既是
继而
加之
假如
假若
假使
鉴于
将
较
较之
叫
接着
结果
借
紧接着
进而
尽
尽管
经
经过
就
就是
就是说
据
具体地说
具体说来
开始
开外
靠
咳
可
可见
可是
可以
况且
啦
来
来着
离
例如
哩
连
连同
两者
了
临
另
另外
另一方面
论
嘛
吗
慢说
漫说
冒
么
每
每当
们
莫若
某
某个
某些
拿
哪
哪边
哪儿
哪个
哪里
哪年
哪怕
哪天
哪些
哪样
那
那边
那儿
那个
那会儿
那里
那么
那么些
那么样
那时
那些
那样
乃
乃至
呢
能
你
你们
您
宁
宁可
宁肯
宁愿
哦
呕
啪达
旁人
呸
凭
凭借
其
其次
其二
其他
其它
其一
其余
其中
起
起见
岂但
恰恰相反
前后
前者
且
然而
然后
然则
让
人家
任
任何
任凭
如
如此
如果
如何
如其
如若
如上所述
若
若非
若是
啥
上下
尚且
设若
设使
甚而
甚么
甚至
省得
时候
什么
什么样
使得
是
是的
首先
谁
谁知
顺
顺着
似的
虽
虽然
虽说
虽则
随
随着
所
所以
他
他们
他人
它
它们
她
她们
倘
倘或
倘然
倘若
倘使
腾
替
通过
同
同时
哇
万一
往
望
为
为何
为了
为什么
为着
喂
嗡嗡
我
我们
呜
呜呼
乌乎
无论
无宁
毋宁
嘻
吓
相对而言
像
向
向着
嘘
呀
焉
沿
沿着
要
要不
要不然
要不是
要么
要是
也
也罢
也好
一
一般
一旦
一方面
一来
一切
一样
一则
依
依照
矣
以
以便
以及
以免
以至
以至于
以致
抑或
因
因此
因而
因为
哟
用
由
由此可见
由于
有
有的
有关
有些
又
于
于是
于是乎
与
与此同时
与否
与其
越是
云云
哉
再说
再者
在
在下
咱
咱们
则
怎
怎么
怎么办
怎么样
怎样
咋
照
照着
者
这
这边
这儿
这个
这会儿
这就是说
这里
这么
这么点儿
这么些
这么样
这时
这些
这样
正如
吱
之
之类
之所以
之一
只是
只限
只要
只有
至
至于
诸位
着
着呢
自
自从
自个儿
自各儿
自己
自家
自身
综上所述
总的来看
总的来说
总的说来
总而言之
总之
纵
纵令
纵然
纵使
遵照
作为
兮
呃
呗
咚
咦
喏
啐
喔唷
嗬
嗯
嗳
~
!
.
:
(
)
*
A
白
社会主义
--
..
>>
 [
 ]

<
>
/
\
|
-
_
+
=
&
^
%
#
@
`
;
$
（
）
——
—
￥
·
...
〉
〈
…
　
0
1
2
3
4
5
6
7
8
9
二
三
四
五
六
七
八
九
零
＞
＜
＠
＃
＄
％
︿
＆
＊
＋
～
｜
［
］
｛
｝
啊哈
啊呀
啊哟
挨次
挨个
挨家挨户
挨门挨户
挨门逐户
挨着
按理
按期
按时
按说
暗地里
暗中
暗自
昂然
八成
白白
半
梆
保管
保险
饱
背地里
背靠背
倍感
倍加
本人
本身
甭
比起
比如说
比照
毕竟
必
必定
必将
必须
便
别人
并非
并肩
并没
并没有
并排
并无
勃然
不
不必
不常
不大
不但...而且
不得
不得不
不得了
不得已
不迭
不定
不对
不妨
不管怎样
不会
不仅...而且
不仅仅
不仅仅是
不经意
不可开交
不可抗拒
不力
不了
不料
不满
不免
不能不
不起
不巧
不然的话
不日
不少
不胜
不时
不是
不同
不能
不要
不外
不外乎
不下
不限
不消
不已
不亦乐乎
不由得
不再
不择手段
不怎么
不曾
不知不觉
不止
不止一次
不至于
才
才能
策略地
差不多
差一点
常
常常
常言道
常言说
常言说得好
长此下去
长话短说
长期以来
长线
敞开儿
彻夜
陈年
趁便
趁机
趁热
趁势
趁早
成年
成年累月
成心
乘机
乘胜
乘势
乘隙
乘虚
诚然
迟早
充分
充其极
充其量
抽冷子
臭
初
出
出来
出去
除此
除此而外
除此以外
除开
除去
除却
除外
处处
川流不息
传
传说
传闻
串行
纯
纯粹
此后
此中
次第
匆匆
从不
从此
从此以后
从古到今
从古至今
从今以后
从宽
从来
从轻
从速
从头
从未
从无到有
从小
从新
从严
从优
从早到晚
从中
从重
凑巧
粗
存心
达旦
打从
打开天窗说亮话
大
大不了
大大
大抵
大都
大多
大凡
大概
大家
大举
大略
大面儿上
大事
大体
大体上
大约
大张旗鼓
大致
呆呆地
带
殆
待到
单
单纯
单单
但愿
弹指之间
当场
当儿
当即
当口儿
当然
当庭
当头
当下
当真
当中
倒不如
倒不如说
倒是
到处
到底
到了儿
到目前为止
到头
到头来
得起
得天独厚
的确
等到
叮当
顶多
定
动不动
动辄
陡然
都
独
独自
断然
顿时
多次
多多
多多少少
多多益善
多亏
多年来
多年前
而后
而论
而又
尔等
二话不说
二话没说
反倒
反倒是
反而
反手
反之亦然
反之则
方
方才
方能
放量
非常
非得
分期
分期分批
分头
奋勇
愤然
风雨无阻
逢
弗
甫
嘎嘎
该当
概
赶快
赶早不赶晚
敢
敢情
敢于
刚
刚才
刚好
刚巧
高低
格外
隔日
隔夜
个人
各式
更
更加
更进一步
更为
公然
共
共总
够瞧的
姑且
古来
故而
故意
固
怪
怪不得
惯常
光
光是
归根到底
归根结底
过于
毫不
毫无
毫无保留地
毫无例外
好在
何必
何尝
何妨
何苦
何乐而不为
何须
何止
很
很多
很少
轰然
后来
呼啦
忽地
忽然
互
互相
哗啦
话说
还
恍然
会
豁然
活
伙同
或多或少
或许
基本
基本上
基于
极
极大
极度
极端
极力
极其
极为
急匆匆
即将
即刻
即是说
几度
几番
几乎
几经
既...又
继之
加上
加以
间或
简而言之
简言之
简直
见
将才
将近
将要
交口
较比
较为
接连不断
接下来
皆可
截然
截至
藉以
借此
借以
届时
仅
仅仅
谨
进来
进去
近
近几年来
近来
近年来
尽管如此
尽可能
尽快
尽量
尽然
尽如人意
尽心竭力
尽心尽力
尽早
精光
经常
竟
竟然
究竟
就此
就地
就算
居然
局外
举凡
据称
据此
据实
据说
据我所知
据悉
具体来说
决不
决非
绝
绝不
绝顶
绝对
绝非
均
喀
看
看来
看起来
看上去
看样子
可好
可能
恐怕
快
快要
来不及
来得及
来讲
来看
拦腰
牢牢
老
老大
老老实实
老是
累次
累年
理当
理该
理应
历
立
立地
立刻
立马
立时
联袂
连连
连日
连日来
连声
连袂
临到
另方面
另行
另一个
路经
屡
屡次
屡次三番
屡屡
缕缕
率尔
率然
略
略加
略微
略为
论说
马上
蛮
满
没
没有
每逢
每每
每时每刻
猛然
猛然间
莫
莫不
莫非
莫如
默默地
默然
呐
那末
奈
难道
难得
难怪
难说
内
年复一年
凝神
偶而
偶尔
怕
砰
碰巧
譬如
偏偏
乒
平素
颇
迫于
扑通
其后
其实
奇
齐
起初
起来
起首
起头
起先
岂
岂非
岂止
迄
恰逢
恰好
恰恰
恰巧
恰如
恰似
千
千万
千万千万
切
切不可
切莫
切切
切勿
窃
亲口
亲身
亲手
亲眼
亲自
顷
顷刻
顷刻间
顷刻之间
请勿
穷年累月
取道
去
权时
全都
全力
全年
全然
全身心
然
人人
仍
仍旧
仍然
日复一日
日见
日渐
日益
日臻
如常
如此等等
如次
如今
如期
如前所述
如上
如下
汝
三番两次
三番五次
三天两头
瑟瑟
沙沙
上
上来
上去
w
e
r
t
y
u
i
o
p
s
d
f
g
h
j
k
l
z
x
c
v
b
n
m
“
”
恩
"
'
(
)
*
A
白
--
..
>>
 [
 ]

<
>
/
\
|
-
_
+
=
&
^
%
#
@
`
（
）
——
—
￥
·
...
‘
’
〉
〈
…
＞
＜
＠
＃
＄
％
︿
＆
＊
＋
～
｜
［
］
｛
｝
!
#
%
&
'
(
)
*
+
,
-
.
/
100%
100％
10元
:
;
=
?
@
[
\
]
^
_
`
a
amp
b
c
cm
d
e
f
g
gt
h
i
j
k
l
ldquo
love
lt
m
mdash
middot
mm
n
no
o
quot
r
rarr
rdquo
s
sect
t
times
v
w
x
y
z
{
|
}
~
　
、
。
～
‖
“
”
「
」
『
』
〖
〗
【
】
⊙
≮
≯
☆
★
●
◎
◇
◆
■
▲
※
→
〓
！
￥
＆
（
）
＊
＋
，
－
．
／
：
；
＞
？
［
＼
］
｛
｝
の
◢
◣
◤
◥
㊣
"
“
”
"
"
‘
’
'
'
〇

－
–
—
―
︱
゛
＂
＃
＄
＆
︶
＊
﹐
﹑
．
／
﹕
；
＠
［
＼
］
＾
＿
﹍
﹎
﹏
｛
｜
｝
～
¨
ˉ
ˇ
˙
‖
‘
’
′
″
﹉
﹊
﹋
﹌
︴
〈
︿
〉
﹀
《
》
「
」
『
﹃
』
【
︻
】
〔
〕
〖
〗
〝
〞
〃
〆
＋
∕
⊙
＜
＝
＞
±
×
÷
∈
∏
∑
√
∝
∟
∠
∣
∧
∨
∩
∪
∫
∮
∴
∵
∶
∷
∽
≈
≌
≒
≠
≡
≤
≥
≦
≮
≯
⊥
⊿
⌒
□
△
▼
▽
◇
○
◎
◢
◣
◤
◥
↑
↗
→
↘
↓
↙
←
↖
─
━
┄
┅
┈
┉
═
│
┃
┆
┇
┊
┋
║
┌
┍
┎
┏
╒
╓
╔
╭
┐
┑
┒
┓
╕
╖
╗
╮
└
┕
┖
┗
╘
╙
╚
╰
┘
┙
┚
┛
╛
╜
╝
╯
├
┝
┞
┟
┠
┡
┢
┣
╞
╟
╠
┤
┥
┦
┧
┨
┩
┪
┫
╡
╢
╣
┬
┭
┮
┯
┰
┱
┲
┳
╤
╥
╦
┴
┵
┶
┷
┸
┹
┺
┻
╧
╨
╩
┼
┽
┾
┿
╀
╁
╂
╄
╅
╆
╇
╈
╉
╊
╋
╪
╫
╬
╱
╲
╳
▁
▏
▔
▕
▂
▎
▃
▍
▄
▌
▅
▋
▆
▇
▉
█
▓
￠
￡
¤
￥
§
°
·
…
‰
※
〓
☆
♀
♂


================================================
FILE: src/DictBuilder.py
================================================
#!/usr/bin/python
# -*-coding:utf8-*-
'''
Created on 2013-10-12
@author:   zyy_max
@brief: build word, idf dict from input_folder
@modified: 2013-10-15 ==> check whether input a folder or a file
@modified: 2013-11-06 ==> build dict from token list, load ori_dict
'''
from collections import defaultdict
import os
import sys


class WordDictBuilder:
    def __init__(self, ori_path='', filelist=[], tokenlist=[]):
        self.word_dict = defaultdict(int)
        if ori_path != '' and os.path.exists(ori_path):
            with open(ori_path) as ins:
                for line in ins.readlines():
                    self.word_dict[line.split('\t')[1]] = int(line.split('\t')[2])
        self.filelist = filelist
        self.tokenlist = tokenlist

    def run(self):
        for filepath in self.filelist:
            self._updateDict(filepath)
        self._updateDictByTokenList()
        return self

    def _updateDict(self, filepath):
        with open(filepath, 'r') as ins:
            for line in ins.readlines():
                for word in line.rstrip().split():
                    self.word_dict[word] += 1

    def _updateDictByTokenList(self):
        for token in self.tokenlist:
            if isinstance(token, unicode):
                token = token.encode('utf8')
            self.word_dict[token] += 1

    def save(self, filepath):
        l = [(value, key) for key, value in self.word_dict.items()]
        l = sorted(l, reverse=True)
        result_lines = []
        for idx, (value, key) in enumerate(l):
            result_lines.append('%s\t%s\t%s%s' % (idx, key, value, os.linesep))
        with open(filepath, 'w') as outs:
            outs.writelines(result_lines)


if __name__ == "__main__":
    if len(sys.argv) < 3:
        print "Usage:\tWordDictBuilder.py <input_folder/file> <output_file>"
        exit(-1)
    if not os.path.isfile(sys.argv[1]):
        filelist = [sys.argv[1] + os.sep + f for f in os.listdir(sys.argv[1])]
    else:
        filelist = [sys.argv[1]]
    builder = WordDictBuilder(filelist=filelist)
    builder.run()
    builder.save(sys.argv[2])


================================================
FILE: src/DictUtils.py
================================================
#!/usr/bin/env python
'''
Created on 2013-11-14
@author zyy_max
@brief utils for word dictionary
'''

class WordDict(dict):
    """
    @brief init, update and save word dictionary
    """
    def __init__(self, dict_path=None):
        if dict_path is not None:
            self.load_dict(dict_path)
    def load_dict(self, dict_path):
        self.dict_path = dict_path
        print 'Loading word dictionary from %s...' % dict_path
        self.clear()
        with open(dict_path, 'r') as ins:
            for line in ins.readlines():
                wordid, word = line.strip().split()
                if isinstance(word, str):
                    word = word.decode('utf8')
                self[word] = int(wordid)
        return self
    def add_one(self, word):
        if isinstance(word, str):
            word = word.decode('utf8')
        if not word in self:
            max_id = max([0] + self.values())
            self[word] = max_id+1
        return self
    def save_dict(self, dict_path):
        print 'Saving word dictionary to %s...' % dict_path
        word_list = self.items()
        with open(dict_path, 'w') as outs:
            for word, wordid in sorted(word_list):
                outs.write('%s\t%s\n' % (wordid, word)) 
    def __del__(self):
        self.save_dict(self.dict_path)
       


================================================
FILE: src/DocUtils.py
================================================
#!/usr/bin/env python
'''
Created on 2013-11-14
@author zyy_max
@brief DocDict for loading docs from db or file, update and save them
'''

class DocDict(dict):
    """
    @brief load docs, update and 
    """
    def __init__(self, fpath=None):
        self.fpath = fpath
        if fpath is not None:
            self.load_from_file(fpath)
    def load_from_db(self):
        print 'Loading from db' 
        self.clear()
    def load_from_file(self, fpath):
        print 'Loading documents from file:',fpath
        self.fpath = fpath
        self.clear()
        with open(fpath, 'r') as ins:
            for line in ins.readlines():
                docid, doc_str = line.strip().split('\t')
                self[int(docid)] = doc_str
        return self
    def update(self, docid, doc_str):
        if not docid in self:
            self[docid] = doc_str
        return self
    def save_to_file(self, fpath):
        with open(fpath, 'w') as outs:
            for key in sorted(self.keys()):
                outs.write('%s\t%s\n' %(key, self[key]))
    def __del__(self):
        self.save_to_file(self.fpath)




================================================
FILE: src/Utils.py
================================================
#!/usr/bin/env python
#-*-coding:utf8-*-
'''
@Created on 2013-10-21
@author zyy_max
@brief utils of common methods
@modified on 2013-10-23 ==> change break condition of cosine(euclidean)_distance_nonzero
'''

import math

def norm_vector_nonzero(ori_vec):
    ori_sum = math.sqrt(sum([math.pow(float(value),2) for (idx,value) in ori_vec]))
    if ori_sum < 1e-6:
        return ori_vec
    result_vec = []
    for idx, ori_value in ori_vec:
        result_vec.append((idx, float(ori_value)/ori_sum))
    #print ori_sum
    return result_vec

def cosine_distance_nonzero(feat_vec1, feat_vec2, norm=True):
    if True == norm:
        feat_vec1 = norm_vector_nonzero(feat_vec1)
        feat_vec2 = norm_vector_nonzero(feat_vec2)
    dist = 0
    idx1 = 0
    idx2 = 0
    while idx1 < len(feat_vec1) and idx2 < len(feat_vec2):
        if feat_vec1[idx1][0] == feat_vec2[idx2][0]:
            dist += float(feat_vec1[idx1][1])*float(feat_vec2[idx2][1])
            idx1 += 1
            idx2 += 1
        elif feat_vec1[idx1][0] > feat_vec2[idx2][0]:
            idx2 += 1
        else:
            idx1 += 1
    return dist

def euclidean_distance_nonzero(feat_vec1, feat_vec2, norm=True):
    if True == norm:
        feat_vec1 = norm_vector_nonzero(feat_vec1)
        feat_vec2 = norm_vector_nonzero(feat_vec2)
    dist = 0
    length = min(len(feat_vec1), len(feat_vec2))
    idx1 = 0
    idx2 = 0
    while idx1 < len(feat_vec1) and idx2 < len(feat_vec2):
        if feat_vec1[idx1][0] > feat_vec2[idx2][0]:
            dist += math.pow(float(feat_vec2[idx2][1]), 2)
            idx2 += 1
        elif feat_vec1[idx1][0] < feat_vec2[idx2][0]:
            dist += math.pow(float(feat_vec1[idx1][1]), 2)
            idx1 += 1
        else:
            dist += math.pow(float(feat_vec1[idx1][1])-float(feat_vec2[idx2][1]), 2)
            idx2 += 1
            idx1 += 1
    return math.sqrt(dist)

def norm_vector(ori_vec):
    ori_sum = math.sqrt(sum([math.pow(float(x),2) for x in ori_vec]))
    if ori_sum < 1e-6:
        return ori_vec
    result_vec = []
    for ori_value in ori_vec:
        result_vec.append(float(ori_value)/ori_sum)
    #print ori_sum
    return result_vec

def cosine_distance(feat_vec1, feat_vec2, norm=True):
    dist = 0
    if True == norm:
        feat_vec1 = norm_vector(feat_vec1)
        feat_vec2 = norm_vector(feat_vec2)
    for idx, feat1 in enumerate(feat_vec1):
        if idx >= len(feat_vec2):
            break
        if abs(float(feat1)) < 1e-6  or abs(float(feat_vec2[idx])) < 1e-6:
            continue
        dist += float(feat1)*float(feat_vec2[idx])
        #print dist
    return dist

def euclidean_distance(feat_vec1, feat_vec2, norm=True):
    dist = 0
    if True == norm:
        feat_vec1 = norm_vector(feat_vec1)
        feat_vec2 = norm_vector(feat_vec2)
    len1 = len(feat_vec1)
    len2 = len(feat_vec2)
    for idx in xrange(min(len2,len2)):
        dist += math.pow(float(feat_vec1[idx])-float(feat_vec2[idx]),2)
    if len1 < len2:
        dist += sum([math.pow(float(feat),2) for feat in feat_vec2[len1-len2:]])
    if len1 > len2:
        dist += sum([math.pow(float(feat),2) for feat in feat_vec1[len2-len1:]])
    return math.sqrt(dist)




================================================
FILE: src/__init__.py
================================================
__author__ = 'max.zhang'


================================================
FILE: src/features.py
================================================
#!/usr/bin/python
#-*-coding:utf8-*-
'''
Created on 2013-10-13
@author: zyy_max
@brief: build feature vector with word_dict and token_list
@modified: 2013-10-15 ==> add upate_word for FeatureBuilder
@modified: 2013-11-06 ==> add feature_nonzero
@modified: 2013-11-15 ==> add FeatureBuilderUpdate
                          word_dict is WordDict in DictUtils
'''
import os,sys
class FeatureBuilder:
    def __init__(self, word_dict):
        self.word_dict = word_dict
    
    def compute(self, token_list):
        feature = [0]*len(self.word_dict)
        for token in token_list:
            feature[self.word_dict[token]] += 1
        feature_nonzero = [(idx,value) for idx, value in enumerate(feature) if value > 0]
        return feature_nonzero

    def _add_word(self, word):
        if not word in self.word_dict:
            self.word_dict[word] = len(self.word_dict)

    def update_words(self, word_list=[]):
        for word in word_list:
            self._add_word(word)

class FeatureBuilderUpdate(FeatureBuilder):
    def _add_word(self, word):
        self.word_dict.add_one(word)


def feature_single(inputfile, outputfile):
    print inputfile,outputfile
    result_lines = []
    with open(inputfile, 'r') as ins:
        for lineidx, line in enumerate(ins.readlines()):
            feature = fb.compute([token.decode('utf8') for token in line.strip().split()])
            l = []
            for idx,f in feature:
                if f > 1e-6:
                    l.append('%s:%s' %(idx,f))
            result_lines.append(' '.join(l) + os.linesep)
            print 'Finished\r', lineidx,
    with open(outputfile, 'w') as outs:
        outs.writelines(result_lines)
    print 'Wrote to ', outputfile

if __name__=="__main__":
    if len(sys.argv) < 5:
        print "Usage:\tfeature.py -s/-m <word_dict_path> <tokens_file/tokens_folder> <feature_file/feature_folder>"
        exit(-1)
    word_dict = {}
    with open(sys.argv[2], 'r') as ins:
        for line in ins.readlines():
            l = line.split()
            word_dict[l[1].decode('utf8')] = int(l[0])
    fb = FeatureBuilder(word_dict)
    print 'Loaded', len(word_dict), 'words'
    if sys.argv[1] == '-s':
        feature_single(sys.argv[3], sys.argv[4])
    elif sys.argv[1] == '-m':
        for inputfile in os.listdir(sys.argv[3]):
            feature_single(os.path.join(sys.argv[3],inputfile), os.path.join(sys.argv[4],inputfile.replace('.token','.feat')))


================================================
FILE: src/isSimilar.py
================================================
#!/usr/bin/env python
# -*-coding:utf8-*-
'''
Created on 2013-11-06
@author zyy_max
@brief check the similarity of 2 documents by VSM+cosine distance or simhash+hamming distance
'''
import sys
from simhash_imp import SimhashBuilder, hamming_distance
from tokens import JiebaTokenizer
from features import FeatureBuilder
from Utils import norm_vector_nonzero, cosine_distance_nonzero


class DocFeatLoader:
    def __init__(self, simhash_builder, feat_nonzero):
        self.feat_vec = feat_nonzero
        self.feat_vec = norm_vector_nonzero(self.feat_vec)
        self.fingerprint = simhash_builder.sim_hash_nonzero(self.feat_vec)


if __name__ == "__main__":
    if len(sys.argv) < 7:
        print "Usage:\tisSimilar.py <doc1> <doc2> <stopword_path> <word_dict> <-c/-s> <threshold>"
        exit(-1)
    doc_path_1, doc_path_2, stopword_path, word_dict, mode, threshold = sys.argv[1:]
    print 'Arguments:', sys.argv[1:]
    with open(doc_path_1) as ins:
        doc_data_1 = ins.read().decode('utf8')
        print 'Loaded', doc_path_1
    with open(doc_path_2) as ins:
        doc_data_2 = ins.read().decode('utf8')
        print 'Loaded', doc_path_2

    # Init tokenizer
    jt = JiebaTokenizer(stopword_path, 'c')

    # Tokenization
    doc_token_1 = jt.tokens(doc_data_1)
    doc_token_2 = jt.tokens(doc_data_2)

    print 'Loading word dict...'
    # Load word list from word_dict
    word_list = []
    with open(word_dict, 'r') as ins:
        for line in ins.readlines():
            word_list.append(line.split()[1])

    # Build unicode string word dict
    word_dict = {}
    for idx, ascword in enumerate(word_list):
        word_dict[ascword.decode('utf8')] = idx
        # Build nonzero-feature
    fb = FeatureBuilder(word_dict)
    doc_feat_1 = fb.compute(doc_token_1)
    doc_feat_2 = fb.compute(doc_token_2)

    # Init simhash_builder
    smb = SimhashBuilder(word_list)

    doc_fl_1 = DocFeatLoader(smb, doc_feat_1)
    doc_fl_2 = DocFeatLoader(smb, doc_feat_2)

    if mode == '-c':
        print 'Matching by VSM + cosine distance'
        dist = cosine_distance_nonzero(doc_fl_1.feat_vec, doc_fl_2.feat_vec, norm=False)
        if dist > float(threshold):
            print 'Matching Result:\t<True:%s>' % dist
        else:
            print 'Matching Result:\t<False:%s>' % dist
    elif mode == '-s':
        print 'Matching by Simhash + hamming distance'
        dist = hamming_distance(doc_fl_1.fingerprint, doc_fl_2.fingerprint)
        if dist < float(threshold):
            print 'Matching Result:\t<True:%s>' % dist
        else:
            print 'Matching Result:\t<False:%s>' % dist


================================================
FILE: src/launch.py
================================================
#!/usr/bin/env python
#-*-coding:utf8-*-
'''
Created on 2013-10-14
@author: zyy_max
@brief: launch entry of near-duplicate detection system
'''

import os
import sys
from tokens import JiebaTokenizer
from simhash_imp import SimhashBuilder, hamming_distance
from features import FeatureBuilder

if __name__=="__main__":
    if len(sys.argv) < 7:
        print "Usage:\tlaunch.py word_dict_path stop_words_path fingerprint_path documents_path test_path result_path"
        exit(-1)
    # Load word list
    word_list = []
    with open(sys.argv[1], 'r') as ins:
        for line in ins.readlines():
            word_list.append(line.split()[1])
    # Init tokenizer
    jt = JiebaTokenizer(sys.argv[2], 'c')
    # Init feature_builder
    word_dict = {}
    for idx, ascword in enumerate(word_list):
        word_dict[ascword.decode('utf8')] = idx
    fb = FeatureBuilder(word_dict)
    # Init simhash_builder
    smb = SimhashBuilder(word_list)
    # Load fingerprint list
    fingerprint_list = []
    with open(sys.argv[3], 'r') as ins:
        for line in ins.readlines():
            fingerprint_list.append(int(line))
    # For exp: load document content
    doc_list = []
    with open(sys.argv[4], 'r') as ins:
        for line in ins.readlines():
            doc_list.append(line.strip())
    # Detection process begins
    min_sim = 64
    min_docid = 0
    with open(sys.argv[5], 'r') as ins:
        for lineidx, line in enumerate(ins.readlines()):
            if lineidx != 642:
                continue
            # Tokenize
            tokens = jt.tokens(line.strip().decode('utf8'))
            # Compute text feature
            feature = fb.compute(tokens)
            # Compute simhash
            fingerprint = smb.sim_hash(feature)
            result_list = []
            for idx, fp in enumerate(fingerprint_list):
                sim = hamming_distance(fingerprint, fp, 64)
                result_list.append((sim, idx))
            result_list = sorted(result_list, cmp=lambda x,y: cmp(x[0],y[0]))
            if result_list[0][0] < min_sim:
                min_sim, min_docid = result_list[0][0], lineidx
            #'''
            with open(sys.argv[6], 'w') as outs:
                outs.write(line.strip()+os.linesep)
                for sim, idx in result_list:
                    outs.write('%s\t%s%s' %(sim, doc_list[idx], os.linesep)) 
            #'''
            #if lineidx == 2:
            #    break           
    print min_sim, min_docid



================================================
FILE: src/launch_incre.py
================================================
#!/usr/bin/env python
#-*-coding:utf8-*-
'''
Created on 2013-10-15
@author: zyy_max
@brief: incremental-version launch entry of near-duplicate detection system
'''

import os
import sys
from tokens import JiebaTokenizer
from simhash_imp import SimhashBuilder, hamming_distance
from features import FeatureBuilder


class FeatureContainer:
    def __init__(self, word_dict_path):
        # Load word list
        self.word_dict_path = word_dict_path
        self.word_list = []
        with open(word_dict_path, 'r') as ins:
            for line in ins.readlines():
                self.word_list.append(line.split()[1])
        self.word_dict = {}
        for idx, ascword in enumerate(self.word_list):
            self.word_dict[ascword.decode('utf8')] = idx
        self.fb = FeatureBuilder(self.word_dict)
        self.smb = SimhashBuilder(self.word_list)
        print 'Loaded ', len(self.word_list), 'words'

    def compute_feature(self, token_list):
        new_words = []
        for token in token_list:
            if not token in self.word_dict:
                new_words.append(token)
        if len(new_words) != 0:
            # Update word_list and word_dict
            self.fb.update_words(new_words)
            self.smb.update_words([word.encode('utf8') for word in new_words])
            self.word_dict = self.fb.word_dict
            self.word_list.extend([word.encode('utf8') for word in new_words])
        feature_vec = self.fb.compute(token_list)
        return feature_vec, self.smb.sim_hash(feature_vec)
'''
    def __del__(self):
        with open(self.word_dict_path, 'w') as outs:
            for idx, word in enumerate(self.word_list):
                outs.write('%s\t%s%s'%(idx, word, os.linesep))
'''
if __name__=="__main__":
    if len(sys.argv) < 7:
        print "Usage:\tlaunch_inc.py <word_dict_path> <stop_words_path> <fingerprint_path> <documents_path> <test_path> <result_path>"
        exit(-1)
    # Init tokenizer
    jt = JiebaTokenizer(sys.argv[2], 'c')
    # Init feature_builder and simhash_builder 
    fc = FeatureContainer(sys.argv[1])
    # Load fingerprint list
    fingerprint_list = []
    with open(sys.argv[3], 'r') as ins:
        for line in ins.readlines():
            fingerprint_list.append(int(line))
    # For exp: load document content
    doc_list = []
    with open(sys.argv[4], 'r') as ins:
        for line in ins.readlines():
            doc_list.append(line.strip())
    # Detection process begins
    min_sim = 64
    min_docid = 0
    with open(sys.argv[5], 'r') as ins:
        for lineidx, line in enumerate(ins.readlines()):
            # Tokenize
            tokens = jt.tokens(line.strip().decode('utf8'))
            feature, fingerprint = fc.compute_feature(tokens)
            result_list = []
            for idx, fp in enumerate(fingerprint_list):
                sim = hamming_distance(fingerprint, fp, 64)
                result_list.append((sim, idx))
            result_list = sorted(result_list, cmp=lambda x,y: cmp(x[0],y[0]))
            if result_list[0][0] < min_sim:
                min_sim, min_docid = result_list[0][0], lineidx
            #'''
            with open(sys.argv[6], 'w') as outs:
                outs.write(line.strip()+os.linesep)
                for sim, idx in result_list:
                    outs.write('%s\t%s%s' %(sim, doc_list[idx], os.linesep)) 
            #'''
            #if lineidx == 2:
            #    break   
    with open('word_dict_new.txt', 'w') as outs:
        for idx, word in enumerate(fc.word_list):
            outs.write('%s\t%s%s'%(idx, word, os.linesep))
            


================================================
FILE: src/preprocess.py
================================================
#!/usr/bin/env python
#-*-coding:utf8-*-
'''
Created on 2013-11-06
@author zyy_max
@brief update word_dict by token result of document
'''
import os
import sys
import time
from tokens import JiebaTokenizer
from DictBuilder import WordDictBuilder

if __name__=="__main__":
    if len(sys.argv) < 4:
        print "Usage:\tpreprocess.py <docpath> <stopword_path> <worddict_path>"
        exit(-1)
    doc_path, stopword_path, worddict_path = sys.argv[1:]
    print 'Arguments:',sys.argv[1:]
    
    # Init tokenizer
    jt = JiebaTokenizer(stopword_path, 'c')
    # Load doc data
    with open(doc_path) as ins:
        doc_data = ins.read().decode('utf8')
    # Tokenization
    doc_tokens = jt.tokens(doc_data)
    # Write to token file
    with open(doc_path[:doc_path.rfind('.')]+'.token', 'w') as outs:
        outs.write('/'.join([token.encode('utf8') for token in doc_tokens]))
    
    # Load original word dict, update and save
    wdb = WordDictBuilder(worddict_path, tokenlist=doc_tokens)
    wdb.run()
    wdb.save(worddict_path)
    print 'Totally', len(wdb.word_dict), 'words'
    


================================================
FILE: src/simhash_imp.py
================================================
#!/usr/bin/env python
# -*- coding=utf-8 -*-
'''
Created on 2013-10-13
@author: zyy_max
@brief: build simhash and compute hamming_distance
@modified: 2013-10-15 ==> add update_word for SimhashBuilder
'''

# Implementation of Charikar simhashes in Python
# See: http://dsrg.mff.cuni.cz/~holub/sw/shash/#a1

import os, sys

def hamming_distance(hash_a, hash_b, hashbits=128):
    x = (hash_a ^ hash_b) & ((1 << hashbits) - 1)
    tot = 0
    while x:
        tot += 1
        x &= x-1
    return tot
class SimhashBuilder:
    def __init__(self, word_list=[], hashbits=128):
        self.hashbits = hashbits
        self.hashval_list = [self._string_hash(word) for word in word_list]
        print 'Totally: %s words' %(len(self.hashval_list),)
        """
        with open('word_hash.txt', 'w') as outs:
            for word in word_list:
                outs.write(word+'\t'+str(self._string_hash(word))+os.linesep)
        """

    def _string_hash(self, word):
        # A variable-length version of Python's builtin hash
        if word == "":
            return 0
        else:
            x = ord(word[0])<<7
            m = 1000003
            mask = 2**self.hashbits-1
            for c in word:
                x = ((x*m)^ord(c)) & mask
            x ^= len(word)
            if x == -1:
                x = -2
            return x

    def sim_hash_nonzero(self, feature_vec):
        finger_vec = [0]*self.hashbits
        # Feature_vec is like [(idx,nonzero-value),(idx,nonzero-value)...]
        for idx, feature in feature_vec:
            hashval = self.hashval_list[int(idx)]
            for i in range(self.hashbits):
                bitmask = 1<<i
                if bitmask&hashval != 0:
                    finger_vec[i] += float(feature)
                else:
                    finger_vec[i] -= float(feature)
        #print finger_vec
        fingerprint = 0
        for i in range(self.hashbits):
            if finger_vec[i] >= 0:
                fingerprint += 1 << i
#整个文档的fingerprint为最终各个位大于等于0的位的和
        return fingerprint    
    
    def sim_hash(self, feature_vec):
        finger_vec = [0]*self.hashbits
        for idx, feature in enumerate(feature_vec):
            if float(feature) < 1e-6:
                continue
            hashval = self.hashval_list[idx]
            for i in range(self.hashbits):
                bitmask = 1<<i
                if bitmask&hashval != 0:
                    finger_vec[i] += float(feature)
                else:
                    finger_vec[i] -= float(feature)
        #print finger_vec
        fingerprint = 0
        for i in range(self.hashbits):
            if finger_vec[i] >= 0:
                fingerprint += 1 << i
#整个文档的fingerprint为最终各个位大于等于0的位的和
        return fingerprint

    def _add_word(self, word):
        self.hashval_list.append(self._string_hash(word))

    def update_words(self, word_list=[]):
        for word in word_list:
            self._add_word(word)

class simhash():
    def __init__(self, tokens='', hashbits=128):
        self.hashbits = hashbits
        self.hash = self.simhash(tokens)

    def __str__(self):
        return str(self.hash)

    def __long__(self):
        return long(self.hash)

    def __float__(self):
        return float(self.hash)

    def simhash(self, tokens):
        # Returns a Charikar simhash with appropriate bitlength
        v = [0]*self.hashbits

        for t in [self._string_hash(x) for x in tokens]:
            bitmask = 0
            #print (t)
            for i in range(self.hashbits):
                bitmask = 1 << i
                #print(t,bitmask, t & bitmask)
                if t & bitmask:
                    v[i] += 1 #查看当前bit位是否为1，是的话则将该位+1
                else:
                    v[i] += -1 #否则得话，该位减1

        fingerprint = 0
        for i in range(self.hashbits):
            if v[i] >= 0:
                fingerprint += 1 << i
#整个文档的fingerprint为最终各个位大于等于0的位的和
        return fingerprint

    def _string_hash(self, v):
        # A variable-length version of Python's builtin hash
        if v == "":
            return 0
        else:
            x = ord(v[0])<<7
            m = 1000003
            mask = 2**self.hashbits-1
            for c in v:
                x = ((x*m)^ord(c)) & mask
            x ^= len(v)
            if x == -1:
                x = -2
            return x

    def hamming_distance(self, other_hash):
        x = (self.hash ^ other_hash.hash) & ((1 << self.hashbits) - 1)
        tot = 0
        while x:
            tot += 1
            x &= x-1
        return tot

    def similarity(self, other_hash):
        a = float(self.hash)
        b = float(other_hash)
        if a>b: return b/a
        return a/b

if __name__ == '__main__':
    #看看哪些东西google最看重？标点？
    #s = '看看哪些东西google最看重？标点？'
    #hash1 =simhash(s.split())
    #print("0x%x" % hash1)
    #print ("%s\t0x%x" % (s, hash1))

    #s = '看看哪些东西google最看重！标点！'
    #hash2 = simhash(s.split())
    #print ("%s\t[simhash = 0x%x]" % (s, hash2))

    #print '%f%% percent similarity on hash' %(100*(hash1.similarity(hash2)))
    #print hash1.hamming_distance(hash2),"bits differ out of", hash1.hashbits

    if len(sys.argv) < 4:
        print "Usage:\tsimhash_imp.py <word_dict_path> <feature_file> <finger_print_file>"
        exit(-1)
    word_list = []
    with open(sys.argv[1], 'r') as ins:
        for idx, line in enumerate(ins.readlines()):
            word_list.append(line.split()[1])
            print '\rloading word', idx,
    sim_b = SimhashBuilder(word_list)
    result_lines = []
    print ''
    with open(sys.argv[2], 'r') as ins:
        for idx, line in enumerate(ins.readlines()):
            print '\rprocessing doc', idx,
            feature_vec = line.strip().split()
            feature_vec = [(int(item.split(':')[0]),float(item.split(':')[1])) for item in feature_vec]
            fingerprint = sim_b.sim_hash_nonzero(feature_vec)
            result_lines.append(str(fingerprint)+os.linesep)
    with open(sys.argv[3], 'w') as outs:
        outs.writelines(result_lines)





================================================
FILE: src/tokens.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
Created on 20131012
@author:    zyy_max

@brief: get tokens from input file by jieba
'''
import jieba
import os
import sys


class JiebaTokenizer:
    def __init__(self, stop_words_path, mode='s'):
        self.stopword_set = set()
        # load stopwords
        with open(stop_words_path) as ins:
            for line in ins:
                self.stopword_set.add(line.strip().decode('utf8'))
        self.mode = mode

    def tokens(self, intext):
        intext = u' '.join(intext.split())
        if self.mode == 's':
            token_list = jieba.cut_for_search(intext)
        else:
            token_list = jieba.cut(intext)
        return [token for token in token_list if token.strip() != u'' and not token in self.stopword_set]


def token_single_file(input_fname, output_fname):
    result_lines = []
    with open(input_fname) as ins:
        for line in ins:
            line = line.strip().decode('utf8')
            tokens = jt.tokens(line)
            result_lines.append(u' '.join(tokens).encode('utf8'))
    open(output_fname, 'w').write(os.linesep.join(result_lines))
    print 'Wrote to ', output_fname


if __name__ == "__main__":
    if len(sys.argv) < 6 or sys.argv[1] not in ['-s', '-m'] or sys.argv[4] not in ['c', 's']:
        print "Usage:\ttokens.py <file_mode(-s/-m)> <input_file/input_folder> " \
              "<output_file/output_folder> <cut_mode(c/s)> <stopword.list>"
        print "file_mode:\t-s:\tsingle file"
        print "\t\t-m:\tmultiple files"
        print "cut_mode:\tc:\tnormal mode of Jieba"
        print "\t\ts:\tcut_for_search mode of Jieba"
        exit(-1)
    file_mode, input_filepath, output_filepath, cut_mode, stopword_file = sys.argv[1:]
    jt = JiebaTokenizer(stopword_file, cut_mode)
    # extract tokens and filter by stopwords
    if file_mode == '-s':
        token_single_file(input_filepath, output_filepath)
    elif file_mode == '-m':
        for input_file in os.listdir(input_filepath):
            prefix = input_file.rsplit(os.sep, 1)[0]
            token_single_file(os.path.join(input_filepath, input_file),
                              os.path.join(output_filepath, prefix+'.token'))


================================================
FILE: src/webcontent_filter.sh
================================================
#!/bin/bash
# Delete nonprint characters
# Delete 0-9a-zA-z and some useless characters
# Turn sequence of empty char to single one
# Delete empty lines
sed 's/[^[:print:]]//g' $1 \
| sed 's/[0-9a-zA-Z+=\./:\"<>|_&#]/ /g' \
| sed 's/  */ /g' > $2
# sed '/^ *$/d' > $2


================================================
FILE: test/test_token.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
Created on 20150825
@author:    zyy_max

@brief: unit test of src/tokens.py
'''
import unittest
import sys
sys.path.append('..')
from src.tokens import JiebaTokenizer


class JiebaTokenizerTestCase(unittest.TestCase):

    def setUp(self):
        self.jt = JiebaTokenizer("../data/stopwords.txt")

    def testTokens(self):
        in_text = u"完整的单元测试很少只执行一个测试用例，" \
                  u"开发人员通常都需要编写多个测试用例才能" \
                  u"对某一软件功能进行比较完整的测试，这些" \
                  u"相关的测试用例称为一个测试用例集，在" \
                  u"PyUnit中是用TestSuite类来表示的。"
        tokens_text = u"完整/单元/测试/单元测试/只/执行/" \
                      u"一个/测试/试用/测试用例/开发/发人/" \
                      u"人员/开发人员/通常/需要/编写/多个/" \
                      u"测试/试用/测试用例/软件/功能/进行/" \
                      u"比较/完整/测试/相关/测试/试用/测试用例/" \
                      u"称为/一个/测试/试用/测试用例/集/PyUnit/" \
                      u"中是/TestSuite/类来/表示"
        self.assertEqual(tokens_text, u'/'.join(self.jt.tokens(in_text)), "Tokenization Results differ")

if __name__ == "__main__":
    unittest.main()

Download .txt

gitextract_wu61udf5/

├── .gitignore
├── README.md
├── data/
│   └── stopwords.txt
├── src/
│   ├── DictBuilder.py
│   ├── DictUtils.py
│   ├── DocUtils.py
│   ├── Utils.py
│   ├── __init__.py
│   ├── features.py
│   ├── isSimilar.py
│   ├── launch.py
│   ├── launch_incre.py
│   ├── preprocess.py
│   ├── simhash_imp.py
│   ├── tokens.py
│   └── webcontent_filter.sh
└── test/
    └── test_token.py

Download .txt

SYMBOL INDEX (62 symbols across 10 files)

FILE: src/DictBuilder.py
  class WordDictBuilder (line 15) | class WordDictBuilder:
    method __init__ (line 16) | def __init__(self, ori_path='', filelist=[], tokenlist=[]):
    method run (line 25) | def run(self):
    method _updateDict (line 31) | def _updateDict(self, filepath):
    method _updateDictByTokenList (line 37) | def _updateDictByTokenList(self):
    method save (line 43) | def save(self, filepath):

FILE: src/DictUtils.py
  class WordDict (line 8) | class WordDict(dict):
    method __init__ (line 12) | def __init__(self, dict_path=None):
    method load_dict (line 15) | def load_dict(self, dict_path):
    method add_one (line 26) | def add_one(self, word):
    method save_dict (line 33) | def save_dict(self, dict_path):
    method __del__ (line 39) | def __del__(self):

FILE: src/DocUtils.py
  class DocDict (line 8) | class DocDict(dict):
    method __init__ (line 12) | def __init__(self, fpath=None):
    method load_from_db (line 16) | def load_from_db(self):
    method load_from_file (line 19) | def load_from_file(self, fpath):
    method update (line 28) | def update(self, docid, doc_str):
    method save_to_file (line 32) | def save_to_file(self, fpath):
    method __del__ (line 36) | def __del__(self):

FILE: src/Utils.py
  function norm_vector_nonzero (line 12) | def norm_vector_nonzero(ori_vec):
  function cosine_distance_nonzero (line 22) | def cosine_distance_nonzero(feat_vec1, feat_vec2, norm=True):
  function euclidean_distance_nonzero (line 40) | def euclidean_distance_nonzero(feat_vec1, feat_vec2, norm=True):
  function norm_vector (line 61) | def norm_vector(ori_vec):
  function cosine_distance (line 71) | def cosine_distance(feat_vec1, feat_vec2, norm=True):
  function euclidean_distance (line 85) | def euclidean_distance(feat_vec1, feat_vec2, norm=True):

FILE: src/features.py
  class FeatureBuilder (line 13) | class FeatureBuilder:
    method __init__ (line 14) | def __init__(self, word_dict):
    method compute (line 17) | def compute(self, token_list):
    method _add_word (line 24) | def _add_word(self, word):
    method update_words (line 28) | def update_words(self, word_list=[]):
  class FeatureBuilderUpdate (line 32) | class FeatureBuilderUpdate(FeatureBuilder):
    method _add_word (line 33) | def _add_word(self, word):
  function feature_single (line 37) | def feature_single(inputfile, outputfile):

FILE: src/isSimilar.py
  class DocFeatLoader (line 15) | class DocFeatLoader:
    method __init__ (line 16) | def __init__(self, simhash_builder, feat_nonzero):

FILE: src/launch_incre.py
  class FeatureContainer (line 16) | class FeatureContainer:
    method __init__ (line 17) | def __init__(self, word_dict_path):
    method compute_feature (line 31) | def compute_feature(self, token_list):

FILE: src/simhash_imp.py
  function hamming_distance (line 15) | def hamming_distance(hash_a, hash_b, hashbits=128):
  class SimhashBuilder (line 22) | class SimhashBuilder:
    method __init__ (line 23) | def __init__(self, word_list=[], hashbits=128):
    method _string_hash (line 33) | def _string_hash(self, word):
    method sim_hash_nonzero (line 48) | def sim_hash_nonzero(self, feature_vec):
    method sim_hash (line 67) | def sim_hash(self, feature_vec):
    method _add_word (line 87) | def _add_word(self, word):
    method update_words (line 90) | def update_words(self, word_list=[]):
  class simhash (line 94) | class simhash():
    method __init__ (line 95) | def __init__(self, tokens='', hashbits=128):
    method __str__ (line 99) | def __str__(self):
    method __long__ (line 102) | def __long__(self):
    method __float__ (line 105) | def __float__(self):
    method simhash (line 108) | def simhash(self, tokens):
    method _string_hash (line 130) | def _string_hash(self, v):
    method hamming_distance (line 145) | def hamming_distance(self, other_hash):
    method similarity (line 153) | def similarity(self, other_hash):

FILE: src/tokens.py
  class JiebaTokenizer (line 14) | class JiebaTokenizer:
    method __init__ (line 15) | def __init__(self, stop_words_path, mode='s'):
    method tokens (line 23) | def tokens(self, intext):
  function token_single_file (line 32) | def token_single_file(input_fname, output_fname):

FILE: test/test_token.py
  class JiebaTokenizerTestCase (line 15) | class JiebaTokenizerTestCase(unittest.TestCase):
    method setUp (line 17) | def setUp(self):
    method testTokens (line 20) | def testTokens(self):

Download .json

Condensed preview — 17 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (52K chars).

[
  {
    "path": ".gitignore",
    "chars": 544,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\n"
  },
  {
    "path": "README.md",
    "chars": 2860,
    "preview": "text-similarity\n===============\nBy max.zhang@2013-11-06\n\n说明：本项目为python语言实现的文本相似度检测工具\n\n# 环境依赖\n*\tpython\n*\tpython-jieba\n*\tb"
  },
  {
    "path": "data/stopwords.txt",
    "chars": 6229,
    "preview": ",\r\n?\r\n、\r\n。\r\n《\r\n》\r\n！\r\n，\r\n：\r\n；\r\n？\r\n人民\r\n末##末\r\n啊\r\n阿\r\n哎\r\n哎呀\r\n哎哟\r\n唉\r\n俺\r\n俺们\r\n按\r\n按照\r\n吧\r\n吧哒\r\n把\r\n罢了\r\n被\r\n本\r\n本着\r\n比\r\n比方\r\n比如\r\n鄙人\r\n彼\r\n彼"
  },
  {
    "path": "src/DictBuilder.py",
    "chars": 2103,
    "preview": "#!/usr/bin/python\n# -*-coding:utf8-*-\n'''\nCreated on 2013-10-12\n@author:   zyy_max\n@brief: build word, idf dict from inp"
  },
  {
    "path": "src/DictUtils.py",
    "chars": 1322,
    "preview": "#!/usr/bin/env python\n'''\nCreated on 2013-11-14\n@author zyy_max\n@brief utils for word dictionary\n'''\n\nclass WordDict(dic"
  },
  {
    "path": "src/DocUtils.py",
    "chars": 1120,
    "preview": "#!/usr/bin/env python\n'''\nCreated on 2013-11-14\n@author zyy_max\n@brief DocDict for loading docs from db or file, update "
  },
  {
    "path": "src/Utils.py",
    "chars": 3208,
    "preview": "#!/usr/bin/env python\n#-*-coding:utf8-*-\n'''\n@Created on 2013-10-21\n@author zyy_max\n@brief utils of common methods\n@modi"
  },
  {
    "path": "src/__init__.py",
    "chars": 25,
    "preview": "__author__ = 'max.zhang'\n"
  },
  {
    "path": "src/features.py",
    "chars": 2449,
    "preview": "#!/usr/bin/python\n#-*-coding:utf8-*-\n'''\nCreated on 2013-10-13\n@author: zyy_max\n@brief: build feature vector with word_d"
  },
  {
    "path": "src/isSimilar.py",
    "chars": 2627,
    "preview": "#!/usr/bin/env python\n# -*-coding:utf8-*-\n'''\nCreated on 2013-11-06\n@author zyy_max\n@brief check the similarity of 2 doc"
  },
  {
    "path": "src/launch.py",
    "chars": 2483,
    "preview": "#!/usr/bin/env python\n#-*-coding:utf8-*-\n'''\nCreated on 2013-10-14\n@author: zyy_max\n@brief: launch entry of near-duplica"
  },
  {
    "path": "src/launch_incre.py",
    "chars": 3611,
    "preview": "#!/usr/bin/env python\n#-*-coding:utf8-*-\n'''\nCreated on 2013-10-15\n@author: zyy_max\n@brief: incremental-version launch e"
  },
  {
    "path": "src/preprocess.py",
    "chars": 1132,
    "preview": "#!/usr/bin/env python\r\n#-*-coding:utf8-*-\r\n'''\r\nCreated on 2013-11-06\r\n@author zyy_max\r\n@brief update word_dict by token"
  },
  {
    "path": "src/simhash_imp.py",
    "chars": 6059,
    "preview": "#!/usr/bin/env python\n# -*- coding=utf-8 -*-\n'''\nCreated on 2013-10-13\n@author: zyy_max\n@brief: build simhash and comput"
  },
  {
    "path": "src/tokens.py",
    "chars": 2211,
    "preview": "#!/usr/bin/python\n# -*- coding: utf-8 -*-\n'''\nCreated on 20131012\n@author:    zyy_max\n\n@brief: get tokens from input fil"
  },
  {
    "path": "src/webcontent_filter.sh",
    "chars": 268,
    "preview": "#!/bin/bash\n# Delete nonprint characters\n# Delete 0-9a-zA-z and some useless characters\n# Turn sequence of empty char to"
  },
  {
    "path": "test/test_token.py",
    "chars": 1081,
    "preview": "#!/usr/bin/python\n# -*- coding: utf-8 -*-\n'''\nCreated on 20150825\n@author:    zyy_max\n\n@brief: unit test of src/tokens.p"
  }
]

About this extraction

This page contains the full source code of the zyymax/text-similarity GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 17 files (38.4 KB), approximately 14.2k tokens, and a symbol index with 62 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo