Full Code of someus/TextRank4ZH for AI

master 78c9b1165787 cached

24 files

42.3 KB

17.4k tokens

34 symbols

1 requests

Download .txt

Repository: someus/TextRank4ZH
Branch: master
Commit: 78c9b1165787
Files: 24
Total size: 42.3 KB

Directory structure:
gitextract_tdn32462/

├── .gitignore
├── HISTORY.md
├── LICENSE
├── README.md
├── example/
│   ├── example01.py
│   └── example02.py
├── setup.py
├── test/
│   ├── Segmentation_test.py
│   ├── TextRank4Keyword_test.py
│   ├── TextRank4Sentence_test.py
│   ├── codecs_test.py
│   ├── doc/
│   │   ├── 01.txt
│   │   ├── 02.txt
│   │   ├── 03.txt
│   │   ├── 04.txt
│   │   └── 05.txt
│   ├── jieba_test.py
│   └── util_test.py
└── textrank4zh/
    ├── Segmentation.py
    ├── TextRank4Keyword.py
    ├── TextRank4Sentence.py
    ├── __init__.py
    ├── stopwords.txt
    └── util.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
build
dist
MANIFEST
textrank4zh/__pycache__
*.pyc
test.py

================================================
FILE: HISTORY.md
================================================
### 2014

主要功能的实现。

### 2015-12

更新到v0.2。

接口有变化。



================================================
FILE: LICENSE
================================================
The MIT License (MIT)

Copyright (c) 2015 Letian Sun

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.



================================================
FILE: README.md
================================================
# TextRank4ZH

TextRank算法可以用来从文本中提取关键词和摘要（重要的句子）。TextRank4ZH是针对中文文本的TextRank算法的python算法实现。

## 安装

方式1：
```
$ python setup.py install --user
```

方式2：
```
$ sudo python setup.py install
```

方式3：
```
$ pip install textrank4zh --user
```

方式4：
```
$ sudo pip install textrank4zh
```

Python 3下需要将上面的python改成python3，pip改成pip3。


## 卸载
```plain
$ pip uninstall textrank4zh
```

## 依赖
jieba >= 0.35  
numpy >= 1.7.1  
networkx >= 1.9.1  

## 兼容性
在Python 2.7.9和Python 3.4.3中测试通过。


## 原理

TextRank的详细原理请参考：

> Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]. Association for Computational Linguistics, 2004.

关于TextRank4ZH的原理和使用介绍：[使用TextRank算法为文本生成关键字和摘要](https://www.letiantian.xyz/p/101666.html)

### 关键词提取
将原文本拆分为句子，在每个句子中过滤掉停用词（可选），并只保留指定词性的单词（可选）。由此可以得到句子的集合和单词的集合。

每个单词作为pagerank中的一个节点。设定窗口大小为k，假设一个句子依次由下面的单词组成：
```
w1, w2, w3, w4, w5, ..., wn
```
`w1, w2, ..., wk`、`w2, w3, ...,wk+1`、`w3, w4, ...,wk+2`等都是一个窗口。在一个窗口中的任两个单词对应的节点之间存在一个无向无权的边。

基于上面构成图，可以计算出每个单词节点的重要性。最重要的若干单词可以作为关键词。


### 关键短语提取
参照[关键词提取](#关键词提取)提取出若干关键词。若原文本中存在若干个关键词相邻的情况，那么这些关键词可以构成一个关键词组。

例如，在一篇介绍`支持向量机`的文章中，可以找到关键词`支持`、`向量`、`机`，通过关键词组提取，可以得到`支持向量机`。

### 摘要生成
将每个句子看成图中的一个节点，若两个句子之间有相似性，认为对应的两个节点之间有一个无向有权边，权值是相似度。

通过pagerank算法计算得到的重要性最高的若干句子可以当作摘要。


## 示例
见[example](./example)、[test](./test)。

example/example01.py:

```python
#-*- encoding:utf-8 -*-
from __future__ import print_function

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

import codecs
from textrank4zh import TextRank4Keyword, TextRank4Sentence

text = codecs.open('../test/doc/01.txt', 'r', 'utf-8').read()
tr4w = TextRank4Keyword()

tr4w.analyze(text=text, lower=True, window=2)  # py2中text必须是utf8编码的str或者unicode对象，py3中必须是utf8编码的bytes或者str对象

print( '关键词：' )
for item in tr4w.get_keywords(20, word_min_len=1):
    print(item.word, item.weight)

print()
print( '关键短语：' )
for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num= 2):
    print(phrase)

tr4s = TextRank4Sentence()
tr4s.analyze(text=text, lower=True, source = 'all_filters')

print()
print( '摘要：' )
for item in tr4s.get_key_sentences(num=3):
    print(item.index, item.weight, item.sentence)  # index是语句在文本中位置，weight是权重
```

运行结果如下：
```plain
关键词：
媒体 0.02155864734852778
高圆圆 0.020220281898126486
微 0.01671909730824073
宾客 0.014328439104001788
赵又廷 0.014035488254875914
答谢 0.013759845912857732
谢娜 0.013361244496632448
现身 0.012724133346018603
记者 0.01227742092899235
新人 0.01183128428494362
北京 0.011686712993089671
博 0.011447168887452668
展示 0.010889176260920504
捧场 0.010507502237123278
礼物 0.010447275379792245
张杰 0.009558332870902892
当晚 0.009137982757893915
戴 0.008915271161035208
酒店 0.00883521621207796
外套 0.008822082954131174

关键短语：
微博

摘要：
摘要：
0 0.0709719557171 中新网北京12月1日电(记者 张曦) 30日晚，高圆圆和赵又廷在京举行答谢宴，诸多明星现身捧场，其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等
6 0.0541037236415 高圆圆身穿粉色外套，看到大批记者在场露出娇羞神色，赵又廷则戴着鸭舌帽，十分淡定，两人快步走进电梯，未接受媒体采访
27 0.0490428312984 记者了解到，出席高圆圆、赵又廷答谢宴的宾客近百人，其中不少都是女方的高中同学

```

## 使用说明

类TextRank4Keyword、TextRank4Sentence在处理一段文本时会将文本拆分成4种格式：

* sentences：由句子组成的列表。
* words_no_filter：对sentences中每个句子分词而得到的两级列表。
* words_no_stop_words：去掉words_no_filter中的停止词而得到的二维列表。
* words_all_filters：保留words_no_stop_words中指定词性的单词而得到的二维列表。

例如，对于：
```
这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。
```

```python
#-*- encoding:utf-8 -*-
from __future__ import print_function
import codecs
from textrank4zh import TextRank4Keyword, TextRank4Sentence

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

text = "这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。"
tr4w = TextRank4Keyword()

tr4w.analyze(text=text, lower=True, window=2)

print()
print('sentences:')
for s in tr4w.sentences:
    print(s)                 # py2中是unicode类型。py3中是str类型。

print()
print('words_no_filter')
for words in tr4w.words_no_filter:
    print('/'.join(words))   # py2中是unicode类型。py3中是str类型。

print()
print('words_no_stop_words')
for words in tr4w.words_no_stop_words:
    print('/'.join(words))   # py2中是unicode类型。py3中是str类型。

print()
print('words_all_filters')
for words in tr4w.words_all_filters:
    print('/'.join(words))   # py2中是unicode类型。py3中是str类型。
```

运行结果如下：
```plain
sentences:
这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足
答谢宴于晚上8点开始

words_no_filter
这/间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足
答谢/宴于/晚上/8/点/开始

words_no_stop_words
间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足
答谢/宴于/晚上/8/点

words_all_filters
酒店/位于/北京/东三环/摆放/雕塑/文艺/气息
答谢/宴于/晚上

```


## API
TODO.

类的实现、函数的参数请参考源码注释。

## License
[MIT](./LICENSE)











================================================
FILE: example/example01.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

import codecs
from textrank4zh import TextRank4Keyword, TextRank4Sentence

text = codecs.open('../test/doc/01.txt', 'r', 'utf-8').read()
tr4w = TextRank4Keyword()

tr4w.analyze(text=text, lower=True, window=2)   # py2中text必须是utf8编码的str或者unicode对象，py3中必须是utf8编码的bytes或者str对象

print( '关键词：' )
for item in tr4w.get_keywords(20, word_min_len=1):
    print(item.word, item.weight)

print()
print( '关键短语：' )
for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num= 2):
    print(phrase)

tr4s = TextRank4Sentence()
tr4s.analyze(text=text, lower=True, source = 'all_filters')

print()
print( '摘要：' )
for item in tr4s.get_key_sentences(num=3):
    print(item.index, item.weight, item.sentence)

================================================
FILE: example/example02.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function
import codecs
from textrank4zh import TextRank4Keyword, TextRank4Sentence

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

text = "这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。"
tr4w = TextRank4Keyword()

tr4w.analyze(text=text, lower=True, window=2)

print()
print('sentences:')
for s in tr4w.sentences:
    print(s)                 # py2中是unicode类型。py3中是str类型。

print()
print('words_no_filter')
for words in tr4w.words_no_filter:
    print('/'.join(words))   # py2中是unicode类型。py3中是str类型。

print()
print('words_no_stop_words')
for words in tr4w.words_no_stop_words:
    print('/'.join(words))   # py2中是unicode类型。py3中是str类型。

print()
print('words_all_filters')
for words in tr4w.words_all_filters:
    print('/'.join(words))   # py2中是unicode类型。py3中是str类型。

================================================
FILE: setup.py
================================================
# -*- coding: utf-8 -*-
from distutils.core import setup
LONGDOC = """
Please go to https://github.com/someus/TextRank4ZH for more info.
"""

setup(
    name='textrank4zh',
    version='0.3',
    description='Extract keywords and abstract Chinese article',
    long_description=LONGDOC,
    author='Letian Sun',
    author_email='sunlt1699@gmail.com',
    url='https://github.com/someus/TextRank4ZH',
    license="MIT",
    classifiers=[
        'Intended Audience :: Developers',
        'License :: OSI Approved :: MIT License',
        'Operating System :: OS Independent',
        'Natural Language :: Chinese (Simplified)',
        'Natural Language :: Chinese (Traditional)',
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.4',
        'Topic :: Text Processing',
        'Topic :: Text Processing :: Linguistic',
    ],
    keywords='NLP,Chinese,Keywords extraction, Abstract extraction',
    install_requires=['jieba >= 0.35', 'numpy >= 1.7.1', 'networkx >= 1.9.1'],
    packages=['textrank4zh'],
    package_dir={'textrank4zh':'textrank4zh'},
    package_data={'textrank4zh':['*.txt',]},
)

================================================
FILE: test/Segmentation_test.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

import codecs
from textrank4zh import Segmentation

seg = Segmentation.Segmentation()

text = codecs.open('./doc/01.txt', 'r', 'utf-8', 'ignore').read()
text = "视频里，我们的杰宝热情地用英文和全场观众打招呼并清唱了一段《Heal The World》。我们的世界充满了未知数。"

result = seg.segment(text=text, lower=True)

for key in result:
    print(key)

print(20*'#')
for s in result['sentences']:
    print(s)

print(20*'*')
for s in result.sentences:
    print (s)

print
for ss in result.words_no_filter:
    print( '  '.join(ss) )

print
for ss in result.words_no_stop_words:
    print( ' / '.join(ss) )

print
for ss in result.words_all_filters:
    print (' | '.join(ss) )


================================================
FILE: test/TextRank4Keyword_test.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

import codecs
from textrank4zh import TextRank4Keyword

text = codecs.open('./doc/02.txt', 'r', 'utf-8').read()
# text = "世界的美好。世界美国英国。 世界和平。"

tr4w = TextRank4Keyword()
tr4w.analyze(text=text,lower=True, window=3, pagerank_config={'alpha':0.85})

for item in tr4w.get_keywords(30, word_min_len=2):
    print(item.word, item.weight, type(item.word))

print('--phrase--')

for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num = 0):
    print(phrase, type(phrase))

================================================
FILE: test/TextRank4Sentence_test.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

import codecs
from textrank4zh import TextRank4Sentence

text = codecs.open('./doc/03.txt', 'r', 'utf-8').read()
text = "这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。"
tr4s = TextRank4Sentence()
tr4s.analyze(text=text, lower=True, source = 'all_filters')

for st in tr4s.sentences:
    print(type(st), st)

print(20*'*')
for item in tr4s.get_key_sentences(num=4):
    print(item.weight, item.sentence, type(item.sentence))

================================================
FILE: test/codecs_test.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function


import codecs
text = codecs.open('./doc/01.txt', 'r', 'utf-8', 'ignore').read()
print( type(text) )  # in py2 is unicode, py3 is str

================================================
FILE: test/doc/01.txt
================================================
中新网北京12月1日电(记者 张曦) 30日晚，高圆圆和赵又廷在京举行答谢宴，诸多明星现身捧场，其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等。

30日中午，有媒体曝光高圆圆和赵又廷现身台北桃园机场的照片，照片中两人小动作不断，尽显恩爱。事实上，夫妻俩此行是回女方老家北京举办答谢宴。

群星捧场 谢娜张杰亮相

当晚不到7点，两人十指紧扣率先抵达酒店。这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。

高圆圆身穿粉色外套，看到大批记者在场露出娇羞神色，赵又廷则戴着鸭舌帽，十分淡定，两人快步走进电梯，未接受媒体采访。

随后，谢娜、何炅也一前一后到场庆贺，并对一对新人表示恭喜。接着蔡康永满脸笑容现身，他直言：“我没有参加台湾婚礼，所以这次觉得蛮开心。”

曾与赵又廷合作《狄仁杰之神都龙王》的导演徐克则携女助理亮相，面对媒体的长枪短炮，他只大呼“恭喜！恭喜！”

作为高圆圆的好友，黄轩虽然拍杂志收工较晚，但也赶过来参加答谢宴。问到给新人带什么礼物，他大方拉开外套，展示藏在包里厚厚的红包，并笑言：“封红包吧！”但不愿透露具体数额。

值得一提的是，当晚10点，张杰压轴抵达酒店，他戴着黑色口罩，透露因刚下飞机所以未和妻子谢娜同行。虽然他没有接受采访，但在进电梯后大方向媒体挥手致意。

《我们结婚吧》主创捧场

黄海波(微博)获释仍未出席

在电视剧《咱们结婚吧》里，饰演高圆圆母亲的张凯丽，当晚身穿黄色大衣出席，但只待了一个小时就匆忙离去。

同样有份参演该剧，并扮演高圆圆男闺蜜的大左(微信号：dazuozone) 也到场助阵，28日，他已在台湾参加两人的盛大婚礼。大左30日晚接受采访时直言当时场面感人，“每个人都哭得稀里哗啦，晚上是吴宗宪(微博)(微信号：wushowzongxian) 主持，现场欢声笑语，讲了好多不能播的事，新人都非常开心”。

最令人关注的是在这部剧里和高圆圆出演夫妻的黄海波。巧合的是，他刚好于30日收容教育期满，解除收容教育。

答谢宴细节

宾客近百人，获赠礼物

记者了解到，出席高圆圆、赵又廷答谢宴的宾客近百人，其中不少都是女方的高中同学。

答谢宴位于酒店地下一层，现场安保森严，大批媒体只好在酒店大堂等待。期间有工作人员上来送上喜糖，代两位新人向媒体问好。

记者注意到，虽然答谢宴于晚上8点开始，但从9点开始就陆续有宾客离开，每个宾客都手持礼物，有宾客大方展示礼盒，只见礼盒上印有两只正在接吻的烫金兔子，不过工作人员迅速赶来，拒绝宾客继续展示。

================================================
FILE: test/doc/02.txt
================================================
如何在美国把贪官送进监狱——
法律用美国的：被用来修理黑帮的美国联邦法律也能“顺便”对付中国贪官

在美国起诉中国贪官，不可能适用中国自己的法律。所以，必须得搞清楚外逃贪官触犯美国法律的证据。贪官明明是在中国国内贪腐，还能触犯到美国的法律？没错。

有先例可循。1994年—2001年，原中国银行开平支行三任行长许超凡、余振东和许国俊勾结贪污、挪用了4.85亿美元巨资。他们都逃向了美国。后来，余振东和二许分别在美国被起诉。以“二许案”为例，这两个巨贪触犯了多项美国联邦刑法，首当其冲的是《反勒索及受贿组织法》。该法律是美国在上个世纪70年代通过的，当时的立意是对付各种黑帮。由于黑帮犯罪常常是一套完整的步骤，所以这个法案把有组织犯罪作为一条完整的“产业链”做考虑。具体来说，二许在中国国内贪污后，后续有一系列涉及到美国的行为——通过各种办法洗钱；把赃款转移到美国；为转移非法所得开设空壳公司……这三个人用了拉斯维加斯的赌场洗钱。所以，最后都是在拉斯维加斯所在的内华达州被美联邦法院审判。

除了“有组织犯罪”相关法条外，洗钱、伪造签证等贪官可能涉及到的触及美国法律行为都能被提起控诉。这里的美国法律主要指的是美国联邦法律，而不是州法律，所以这些贪官也是被联邦警察给抓获的。

总之，在国内的贪污行为是“上游”，中国的检察官们没可能因为这些发生在中国的“上游”事件要求美国法院给中国贪官定罪；而美国法官也不可能运用中国的法律来做判决。不过，把赃款和人转移出去这个“下游”过程是有很大部分是在美国发生的，可能触犯到各种美国法律。严格说起来，要想在美国对贪官们治罪，那么得找到他们的贪污关联行为触犯到美国法律的证据。
在美国坐牢的贪官许超凡和许国俊曾经通过拉斯维加斯的赌场洗钱在美国坐牢的贪官许超凡和许国俊曾经通过拉斯维加斯的赌场洗钱
一般也在美国蹲监狱：美国法律对付中国贪官并不手软，他们可能被判得很重

既然是运用美国法律判的，也得在美国服刑。而大家会担心，会不会“有组织犯罪”等罪名对付中国贪官太过温和、间接，对他们下手轻呢？其实不会。还是说“二许案”，他们一个判25年，另一个是22年。因为所涉及的犯罪基本在美国都是重罪。2009年，法制日报的报道《中行开平案八年追诉始末》分析道，“‘二许’此次在美国所获刑期，均已经超出我国刑法有期徒刑的最高量刑标准。”

在二许坐满牢之后，他们面临着被美国驱逐出境。他们都是通过欺诈的手段获得了美国的签证。
不过也有办法把贪官“换回”中国坐牢：中、美和嫌犯三方达成协议
余振东被遣返回中国后被判处12年有期徒刑。余振东被遣返回中国后被判处12年有期徒刑。

前文提到的余振东案里，余在美国被判入监144个月，但他现在处于中国的监狱中。这又是怎么回事呢？原来，余振东表示自愿接受遣返。余振东向美国方面递交《递解出境司法命令和放弃听证约定申请书》，承认自己在美所犯罪行应导致递解出境的法律后果，并且明确指定中国为其递解出境的接收国。当然，这种“自愿”是有前提条件的。因为中国也向美国的司法机关作出承诺，余振东在中国国内被宣判的刑期不会长于美国。而也因为余振东的自愿认罪，美国司法机关对他的判罚从轻。
还可以追究共同犯罪的贪官家属刑责：贪官的家属倘若一起触犯了美国法律，也得受罚
贪官背后往往有“贪内助”，而参与了犯罪的贪官家属也可能在美国被起诉贪官背后往往有“贪内助”，而参与了犯罪的贪官家属也可能在美国被起诉

“二许案”中一共有四个人被追究刑责。因为，两个贪官的太太也没少参与触犯到美国法律的洗钱等行为。她们分别被判处监禁8年。除了大家能够想到的洗钱等常规动作外，这两对巨贪夫妻很“奇葩”的一点是，两位妻子先通过和美国人假结婚获得了美国公民资格。没有后顾之忧，真丈夫们开始疯狂地转移资金。等到逃跑时候，男人们也运用了假结婚的方式。所以在“二许案”的指控中，有一项是“护照、签证欺诈”。
要做到以上这些，重要的还是中国官方的努力，争取美国的积极合作

看起来，好像动用美国的司法体系来追究中国贪官的刑责并不难，也只需要美国方面的努力。那么，这是一条追诉逃美贪官的康庄大道？当然不是这样的。否则不会在2009年“二许案”宣判之后，暂时没有再出现过这样的案例。余振东案和“二许案”在当年都轰动一时，关系重大。因此是被当作大案要案在办。其时，恰逢中国和美国签署了《刑事司法互助协定》不久。所以这三个金融系统大蛀虫首当其冲被起诉了——几个巨贪被美国联邦警察逮捕就是中方努力的结果。中国也向美方提供了大量的证据，证明钱财是非法所得。找出财产转移链、挖出洗钱的细节……种种犯罪事实都需要经过繁琐的查证。另一方面，美国办案子也需要付出大量的司法成本，所以不能希冀美国的司法部门多么主动地去发现中国外逃贪官。

当然，时代在前行。随着国内反腐的高涨，海外追逃也越来越得到重视。这次外交部条约法律司司长徐宏的发言，也给了大家一个期许。
美国司法部关于“二许案”的通告美国司法部关于“二许案”的通告
如何在美国打官司，向贪官要回钱——
拿着刑事判决的结果来打民事官司追款相对容易

对于民事诉讼追赃，《联合国反腐败公约》里有制度支持。而相对容易的一种形式就是拿着刑事判决去追赃。刑事判决对于财产的非法性是强有力的证明。所以，“二许案”后，中国银行在美国当地提起诉讼，追回了一些财产。

尽管“二许案”的许多赃款并没有转移到美国，而是在加拿大，美国的这份刑事判决也有助于“苦主”在加拿大追偿。就在今年11月24日，加拿大的大不列颠哥伦比亚BC省法院正式开庭审理中国银行向许超凡妻子和母亲追赃的民事诉讼。

一些学者认为，中国国内的刑事判决也是有助于发起民事诉讼法追赃的。不过一个现实是，中国的刑法不允许“缺席审判”，贪官不到位就没法动了，因此许多学者也提出中国应该建立起相关的制度来。
不管刑事，直接打民事官司也可以，就是费时、费力、费钱

民事诉讼相对于刑事诉讼来说要容易得多。所以被认为是一个非常好的向外逃贪官追责的路径。追回贪官的赃款，既挽回损失，还能够断了贪官的财源。

当然，以上都是最理想的说法。实际情况难多了。所以公开报道的海外成功追赃的民事诉讼案例真是屈指可数。在美国，目前唯一公开的一起是前述的中国银行向“二许”追赃。但是情况特殊，并且真正的大头在加拿大，所以参考性不强。倒是有一起不在美国，在澳大利亚的民事诉讼追赃可以做参考。被诉方是原北京市城乡建设集团副总经理李化学。只是过程非常艰辛曲折，为了顺利起诉，中方不得不聘请了一名当地律师。付出和回报存疑。办案人员彭唯良检察官的原话是：“在国外打官司，经济上必须有坚强的后盾来支持。另外，由于语言上的障碍，有些我们想通过律师要达到的目的，律师不太了解，返工的次数比较多。”

当然，这里需要说明的一点是，民事诉讼的主体也不宜是中国政府，而是具体的单位。所以在北京城乡集团的案子里，尽管检察官们为民事诉讼付出了大量的努力，但还得找来单位做原告。
目前经验看，最省事、有效的是争取到美国司法部的最大限度合作
陈水扁用非法所得在美国购买的房产陈水扁用非法所得在美国购买的房产

说一个台湾地区的例子。陈水扁弊案爆发后，被美国司法部发现，陈家用“不法所得”在美国购买了两处房产。后来，由美国司法部出面进行没收。美国司法部也需要向法院提出诉讼。这其实是属于美国的一个“腐败政府国家资产追回”计划。这个案子最后的结果是，法院支持了美国司法部的请求，陈家房产被拍卖。而根据相关法律，拍卖所得美国是有权分得一部分的。

由美国司法部出面提起诉讼，恐怕是最好的办法了。而这需要两点：第一，还是追赃国的申请和完整的证据；第二，则涉及到一个积极性问题。对追缴财产进行分享也是国际上一个比较流行的做法，可以大限度地调动赃款流向国的积极性。不失为一个参考。
结语
看来，在美国起诉贪官确实可行。但是，不管是追人还是追钱，都存在一个和美国的紧密合作问题，不然也是白搭。


================================================
FILE: test/doc/03.txt
================================================
据BI消息，Netflix 正准备在本月上线其最新的原创剧集《马可波罗》。而据纽约时报报道，《马可波罗》第一季 10 集的总投资高达 9000 万美元，这不仅创下了 Netflix 的最高电视剧投资记录，在全球电视剧制作成本的排名中也是数一数二的，仅次于 HBO 原创的《权利的游戏》。

《马可波罗》在意大利、哈萨克斯坦、马来西亚等多国取景拍摄，数百名演员来自多个国家，电视剧把传奇冒险、战争、武术、性诱惑、政治阴谋等元素都融了进去，看起来会包含不少大家喜闻乐见的题材。Netflix 也为《马可波罗》的播出制定了庞大的市场营销计划。比如，Netflix 将携主要演员参加巴西的圣地亚哥国际动漫展，另外也会在墨西哥的一个大型购物中心展示《马可波罗》演出所用的服装和道具。

Netflix 怎么会如此大手笔？毫无疑问 Netflix 对这部剧寄予了重望——Netflix 的海外市场。Netflix 现在已经进入全球 50 多个国家和地区，付费订户高达 5000 万人。由于在美国本土的增长开始下滑，寻求海外增长机会成为 Netflix 的当务之急。除了四处购买电影电视剧的全球版权等海外市场豪赌，Netflix 在影视内容上也有一场豪赌——鸿篇巨制的《马可波罗》。此剧由独立片商威斯坦公司制作，Netflix 掌握全球版权，将从 12 月 12 日开始在 Netflix 面向全球订户提供点播。

Netflix 早期靠《纸牌屋》和《女子监狱》等原创剧一鸣惊人，但可惜的是，Netflix 并没有掌握《纸牌屋》等剧的海外版权。比如在德国和法国，观众可以在电视频道上收看该剧。不过，《纸牌屋》的成功仍然帮助 Netflix 提升了知名度。目前，Netflix 正在筹拍中的原创剧有不少，《马可波罗》成功与否可以在一定程度上验证 Netflix 的原创剧战略是否在海外市场是否有效。

不过，一些媒体行业分析师预测，Netflix 的国际化将会遇到各国本地视频网站的狙击。此外 HBO 也是其最强劲的竞争对手。众所周知，HBO 在国际化方面已经先行一步。比如在中国市场，HBO 就刚刚与腾讯视频签订了独家播放权。而迄今为止，Netflix 尚未在亚洲任何一个国家展开业务。


================================================
FILE: test/doc/04.txt
================================================
京菜擅长烤、爆、烧、焖、涮，听起来豪爽，吃起来痛快。北京烤鸭是来京游玩必食的美味；西四缸瓦市一家名叫砂锅居的老店所烧的砂锅白肉名满京城，相传他们用的原汤已有二三百年历史；涮羊肉是最受北京人欢迎的冬令美食，其中阳坊涮肉连锁店以价格便宜，味道正宗而倍受青睐。可以以半份起卖，在全市有许多分店。除此之外，还有东来顺、又一顺、能仁居的涮羊肉名气也很大。
北京风味小吃有600多年历史，包括汉民风味小吃、回民风味小吃和宫廷风味小吃等300多种。
北京的各大饭店历来是名厨荟萃，如北京饭店的谭家菜、建国饭店的法式西餐都是别处不易享用到的佳肴；北京还有正宗的法式、美式、意式、俄式餐厅和日本料理、韩国烧烤以及越南、印尼、泰国风味的菜馆。若为省时实惠，还可以光顾街头小店，这里不乏北京特有的包子、饺子、面条及家常炒菜，当然，环境就不如大餐馆讲究了。 
东直门内大街原是北京最富特色的餐饮一条街，大街南北两侧云集了各种风味的餐馆，多为24小时营业。但现在因为拆迁，这条路上的餐饮店多已搬迁。 

================================================
FILE: test/doc/05.txt
================================================
支持向量机（英语：Support Vector Machine，常简称为SVM）是一种监督式学习的方法，可广泛地应用于统计分类以及回归分析。

支持向量机属于一般化线性分类器，也可以被认为是提克洛夫规范化（Tikhonov Regularization）方法的一个特例。这族分类器的特点是他们能够同时最小化经验误差与最大化几何边缘区，因此支持向量机也被称为最大边缘区分类器。

支持向量机构造一个超平面或者多个超平面，这些超平面可能是高维的，甚至可能是无限多维的。在分类任务中，它的原理是，将决策面（超平面）放置在这样的一个位置，两类中接近这个位置的点距离的都最远。我们来考虑两类线性可分问题，如果要在两个类之间画一条线，那么按照支持向量机的原理，我们会先找两类之间最大的空白间隔，然后在空白间隔的中点画一条线，这条线平行于空白间隔。通过核函数，可以使得支持向量机对非线性可分的任务进行分类。一个极好的指南是C.J.C Burges的《模式识别支持向量机指南》。van der Walt和Barnard将支持向量机和其他分类器进行了比较。

================================================
FILE: test/jieba_test.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function

import sys
try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass

import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门.。；‘你的#")
for w in words:
    # print(w.word)
    print('{0} {1}'.format(w.word, w.flag))
    print(type(w.word))  # in py2 is unicode, py3 is str



================================================
FILE: test/util_test.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import print_function

from textrank4zh import util

def testAttrDict():
    r = util.AttrDict(a=2)
    print( r )
    print( r.a )
    print( r['a'] )

def testCombine():
    print(20*'*')
    for item in util.combine(['a', 'b', 'c', 'd'], 2):
        print(item)
    print
    for item in util.combine(['a', 'b', 'c', 'd'], 3):
        print (item)

def testDebug():
    import sys
    print(sys.getdefaultencoding())
    util.debug('你好')
    util.debug(u'世界')


if __name__ == "__main__":
    testAttrDict()
    testCombine()
    testDebug()


================================================
FILE: textrank4zh/Segmentation.py
================================================
#-*- encoding:utf-8 -*-
"""
@author:   letian
@homepage: http://www.letiantian.me
@github:   https://github.com/someus/
"""
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import jieba.posseg as pseg
import codecs
import os

from . import util

def get_default_stop_words_file():
    d = os.path.dirname(os.path.realpath(__file__))
    return os.path.join(d, 'stopwords.txt')

class WordSegmentation(object):
    """ 分词 """
    
    def __init__(self, stop_words_file = None, allow_speech_tags = util.allow_speech_tags):
        """
        Keyword arguments:
        stop_words_file    -- 保存停止词的文件路径，utf8编码，每行一个停止词。若不是str类型，则使用默认的停止词
        allow_speech_tags  -- 词性列表，用于过滤
        """     
        
        allow_speech_tags = [util.as_text(item) for item in allow_speech_tags]

        self.default_speech_tag_filter = allow_speech_tags
        self.stop_words = set()
        self.stop_words_file = get_default_stop_words_file()
        if type(stop_words_file) is str:
            self.stop_words_file = stop_words_file
        for word in codecs.open(self.stop_words_file, 'r', 'utf-8', 'ignore'):
            self.stop_words.add(word.strip())
    
    def segment(self, text, lower = True, use_stop_words = True, use_speech_tags_filter = False):
        """对一段文本进行分词，返回list类型的分词结果

        Keyword arguments:
        lower                  -- 是否将单词小写（针对英文）
        use_stop_words         -- 若为True，则利用停止词集合来过滤（去掉停止词）
        use_speech_tags_filter -- 是否基于词性进行过滤。若为True，则使用self.default_speech_tag_filter过滤。否则，不过滤。    
        """
        text = util.as_text(text)
        jieba_result = pseg.cut(text)
        
        if use_speech_tags_filter == True:
            jieba_result = [w for w in jieba_result if w.flag in self.default_speech_tag_filter]
        else:
            jieba_result = [w for w in jieba_result]

        # 去除特殊符号
        word_list = [w.word.strip() for w in jieba_result if w.flag!='x']
        word_list = [word for word in word_list if len(word)>0]
        
        if lower:
            word_list = [word.lower() for word in word_list]

        if use_stop_words:
            word_list = [word.strip() for word in word_list if word.strip() not in self.stop_words]

        return word_list
        
    def segment_sentences(self, sentences, lower=True, use_stop_words=True, use_speech_tags_filter=False):
        """将列表sequences中的每个元素/句子转换为由单词构成的列表。
        
        sequences -- 列表，每个元素是一个句子（字符串类型）
        """
        
        res = []
        for sentence in sentences:
            res.append(self.segment(text=sentence, 
                                    lower=lower, 
                                    use_stop_words=use_stop_words, 
                                    use_speech_tags_filter=use_speech_tags_filter))
        return res
        
class SentenceSegmentation(object):
    """ 分句 """
    
    def __init__(self, delimiters=util.sentence_delimiters):
        """
        Keyword arguments:
        delimiters -- 可迭代对象，用来拆分句子
        """
        self.delimiters = set([util.as_text(item) for item in delimiters])
    
    def segment(self, text):
        res = [util.as_text(text)]
        
        util.debug(res)
        util.debug(self.delimiters)

        for sep in self.delimiters:
            text, res = res, []
            for seq in text:
                res += seq.split(sep)
        res = [s.strip() for s in res if len(s.strip()) > 0]
        return res 
        
class Segmentation(object):
    
    def __init__(self, stop_words_file = None, 
                    allow_speech_tags = util.allow_speech_tags,
                    delimiters = util.sentence_delimiters):
        """
        Keyword arguments:
        stop_words_file -- 停止词文件
        delimiters      -- 用来拆分句子的符号集合
        """
        self.ws = WordSegmentation(stop_words_file=stop_words_file, allow_speech_tags=allow_speech_tags)
        self.ss = SentenceSegmentation(delimiters=delimiters)
        
    def segment(self, text, lower = False):
        text = util.as_text(text)
        sentences = self.ss.segment(text)
        words_no_filter = self.ws.segment_sentences(sentences=sentences, 
                                                    lower = lower, 
                                                    use_stop_words = False,
                                                    use_speech_tags_filter = False)
        words_no_stop_words = self.ws.segment_sentences(sentences=sentences, 
                                                    lower = lower, 
                                                    use_stop_words = True,
                                                    use_speech_tags_filter = False)

        words_all_filters = self.ws.segment_sentences(sentences=sentences, 
                                                    lower = lower, 
                                                    use_stop_words = True,
                                                    use_speech_tags_filter = True)

        return util.AttrDict(
                    sentences           = sentences, 
                    words_no_filter     = words_no_filter, 
                    words_no_stop_words = words_no_stop_words, 
                    words_all_filters   = words_all_filters
                )
    
        

if __name__ == '__main__':
    pass

================================================
FILE: textrank4zh/TextRank4Keyword.py
================================================
#-*- encoding:utf-8 -*-
"""
@author:   letian
@homepage: http://www.letiantian.me
@github:   https://github.com/someus/
"""
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import networkx as nx
import numpy as np

from . import util
from .Segmentation import Segmentation

class TextRank4Keyword(object):
    
    def __init__(self, stop_words_file = None, 
                 allow_speech_tags = util.allow_speech_tags, 
                 delimiters = util.sentence_delimiters):
        """
        Keyword arguments:
        stop_words_file  --  str，指定停止词文件路径（一行一个停止词），若为其他类型，则使用默认停止词文件
        delimiters       --  默认值是`?!;？！。；…\n`，用来将文本拆分为句子。
        
        Object Var:
        self.words_no_filter      --  对sentences中每个句子分词而得到的两级列表。
        self.words_no_stop_words  --  去掉words_no_filter中的停止词而得到的两级列表。
        self.words_all_filters    --  保留words_no_stop_words中指定词性的单词而得到的两级列表。
        """
        self.text = ''
        self.keywords = None
        
        self.seg = Segmentation(stop_words_file=stop_words_file, 
                                allow_speech_tags=allow_speech_tags, 
                                delimiters=delimiters)

        self.sentences = None
        self.words_no_filter = None     # 2维列表
        self.words_no_stop_words = None
        self.words_all_filters = None
        
    def analyze(self, text, 
                window = 2, 
                lower = False,
                vertex_source = 'all_filters',
                edge_source = 'no_stop_words',
                pagerank_config = {'alpha': 0.85,}):
        """分析文本

        Keyword arguments:
        text       --  文本内容，字符串。
        window     --  窗口大小，int，用来构造单词之间的边。默认值为2。
        lower      --  是否将文本转换为小写。默认为False。
        vertex_source   --  选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点。
                            默认值为`'all_filters'`，可选值为`'no_filter', 'no_stop_words', 'all_filters'`。关键词也来自`vertex_source`。
        edge_source     --  选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点之间的边。
                            默认值为`'no_stop_words'`，可选值为`'no_filter', 'no_stop_words', 'all_filters'`。边的构造要结合`window`参数。
        """
        
        # self.text = util.as_text(text)
        self.text = text
        self.word_index = {}
        self.index_word = {}
        self.keywords = []
        self.graph = None
        
        result = self.seg.segment(text=text, lower=lower)
        self.sentences = result.sentences
        self.words_no_filter = result.words_no_filter
        self.words_no_stop_words = result.words_no_stop_words
        self.words_all_filters   = result.words_all_filters

        util.debug(20*'*')
        util.debug('self.sentences in TextRank4Keyword:\n', ' || '.join(self.sentences))
        util.debug('self.words_no_filter in TextRank4Keyword:\n', self.words_no_filter)
        util.debug('self.words_no_stop_words in TextRank4Keyword:\n', self.words_no_stop_words)
        util.debug('self.words_all_filters in TextRank4Keyword:\n', self.words_all_filters)


        options = ['no_filter', 'no_stop_words', 'all_filters']

        if vertex_source in options:
            _vertex_source = result['words_'+vertex_source]
        else:
            _vertex_source = result['words_all_filters']

        if edge_source in options:
            _edge_source   = result['words_'+edge_source]
        else:
            _edge_source   = result['words_no_stop_words']

        self.keywords = util.sort_words(_vertex_source, _edge_source, window = window, pagerank_config = pagerank_config)

    def get_keywords(self, num = 6, word_min_len = 1):
        """获取最重要的num个长度大于等于word_min_len的关键词。

        Return:
        关键词列表。
        """
        result = []
        count = 0
        for item in self.keywords:
            if count >= num:
                break
            if len(item.word) >= word_min_len:
                result.append(item)
                count += 1
        return result
    
    def get_keyphrases(self, keywords_num = 12, min_occur_num = 2): 
        """获取关键短语。
        获取 keywords_num 个关键词构造的可能出现的短语，要求这个短语在原文本中至少出现的次数为min_occur_num。

        Return:
        关键短语的列表。
        """
        keywords_set = set([ item.word for item in self.get_keywords(num=keywords_num, word_min_len = 1)])
        keyphrases = set()
        for sentence in self.words_no_filter:
            one = []
            for word in sentence:
                if word in keywords_set:
                    one.append(word)
                else:
                    if len(one) >  1:
                        keyphrases.add(''.join(one))
                    if len(one) == 0:
                        continue
                    else:
                        one = []
            # 兜底
            if len(one) >  1:
                keyphrases.add(''.join(one))

        return [phrase for phrase in keyphrases 
                if self.text.count(phrase) >= min_occur_num]

if __name__ == '__main__':
    pass

================================================
FILE: textrank4zh/TextRank4Sentence.py
================================================
#-*- encoding:utf-8 -*-
"""
@author:   letian
@homepage: http://www.letiantian.me
@github:   https://github.com/someus/
"""
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import networkx as nx
import numpy as np

from . import util
from .Segmentation import Segmentation

class TextRank4Sentence(object):
    
    def __init__(self, stop_words_file = None, 
                 allow_speech_tags = util.allow_speech_tags,
                 delimiters = util.sentence_delimiters):
        """
        Keyword arguments:
        stop_words_file  --  str，停止词文件路径，若不是str则是使用默认停止词文件
        delimiters       --  默认值是`?!;？！。；…\n`，用来将文本拆分为句子。
        
        Object Var:
        self.sentences               --  由句子组成的列表。
        self.words_no_filter         --  对sentences中每个句子分词而得到的两级列表。
        self.words_no_stop_words     --  去掉words_no_filter中的停止词而得到的两级列表。
        self.words_all_filters       --  保留words_no_stop_words中指定词性的单词而得到的两级列表。
        """
        self.seg = Segmentation(stop_words_file=stop_words_file,
                                allow_speech_tags=allow_speech_tags,
                                delimiters=delimiters)
        
        self.sentences = None
        self.words_no_filter = None     # 2维列表
        self.words_no_stop_words = None
        self.words_all_filters = None
        
        self.key_sentences = None
        
    def analyze(self, text, lower = False, 
              source = 'no_stop_words', 
              sim_func = util.get_similarity,
              pagerank_config = {'alpha': 0.85,}):
        """
        Keyword arguments:
        text                 --  文本内容，字符串。
        lower                --  是否将文本转换为小写。默认为False。
        source               --  选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来生成句子之间的相似度。
                                 默认值为`'all_filters'`，可选值为`'no_filter', 'no_stop_words', 'all_filters'`。
        sim_func             --  指定计算句子相似度的函数。
        """
        
        self.key_sentences = []
        
        result = self.seg.segment(text=text, lower=lower)
        self.sentences = result.sentences
        self.words_no_filter = result.words_no_filter
        self.words_no_stop_words = result.words_no_stop_words
        self.words_all_filters   = result.words_all_filters

        options = ['no_filter', 'no_stop_words', 'all_filters']
        if source in options:
            _source = result['words_'+source]
        else:
            _source = result['words_no_stop_words']

        self.key_sentences = util.sort_sentences(sentences = self.sentences,
                                                 words     = _source,
                                                 sim_func  = sim_func,
                                                 pagerank_config = pagerank_config)

            
    def get_key_sentences(self, num = 6, sentence_min_len = 6):
        """获取最重要的num个长度大于等于sentence_min_len的句子用来生成摘要。

        Return:
        多个句子组成的列表。
        """
        result = []
        count = 0
        for item in self.key_sentences:
            if count >= num:
                break
            if len(item['sentence']) >= sentence_min_len:
                result.append(item)
                count += 1
        return result
    

if __name__ == '__main__':
    pass

================================================
FILE: textrank4zh/__init__.py
================================================
#-*- encoding:utf-8 -*-
from __future__ import absolute_import
from .TextRank4Keyword import TextRank4Keyword
from .TextRank4Sentence import TextRank4Sentence
from . import Segmentation
from . import util

version = '0.2'

================================================
FILE: textrank4zh/stopwords.txt
================================================
?
、
。
“
”
《
》
！
，
：
；
？
啊
阿
哎
哎呀
哎哟
唉
俺
俺们
按
按照
吧
吧哒
把
罢了
被
本
本着
比
比方
比如
鄙人
彼
彼此
边
别
别的
别说
并
并且
不比
不成
不单
不但
不独
不管
不光
不过
不仅
不拘
不论
不怕
不然
不如
不特
不惟
不问
不只
朝
朝着
趁
趁着
乘
冲
除
除此之外
除非
除了
此
此间
此外
从
从而
打
待
但
但是
当
当着
到
得
的
的话
等
等等
地
第
叮咚
对
对于
多
多少
而
而况
而且
而是
而外
而言
而已
尔后
反过来
反过来说
反之
非但
非徒
否则
嘎
嘎登
该
赶
个
各
各个
各位
各种
各自
给
根据
跟
故
故此
固然
关于
管
归
果然
果真
过
哈
哈哈
呵
和
何
何处
何况
何时
嘿
哼
哼唷
呼哧
乎
哗
还是
还有
换句话说
换言之
或
或是
或者
极了
及
及其
及至
即
即便
即或
即令
即若
即使
几
几时
己
既
既然
既是
继而
加之
假如
假若
假使
鉴于
将
较
较之
叫
接着
结果
借
紧接着
进而
尽
尽管
经
经过
就
就是
就是说
据
具体地说
具体说来
开始
开外
靠
咳
可
可见
可是
可以
况且
啦
来
来着
离
例如
哩
连
连同
两者
了
临
另
另外
另一方面
论
嘛
吗
慢说
漫说
冒
么
每
每当
们
莫若
某
某个
某些
拿
哪
哪边
哪儿
哪个
哪里
哪年
哪怕
哪天
哪些
哪样
那
那边
那儿
那个
那会儿
那里
那么
那么些
那么样
那时
那些
那样
乃
乃至
呢
能
你
你们
您
宁
宁可
宁肯
宁愿
哦
呕
啪达
旁人
呸
凭
凭借
其
其次
其二
其他
其它
其一
其余
其中
起
起见
起见
岂但
恰恰相反
前后
前者
且
然而
然后
然则
让
人家
任
任何
任凭
如
如此
如果
如何
如其
如若
如上所述
若
若非
若是
啥
上下
尚且
设若
设使
甚而
甚么
甚至
省得
时候
什么
什么样
使得
是
是的
首先
谁
谁知
顺
顺着
似的
虽
虽然
虽说
虽则
随
随着
所
所以
他
他们
他人
它
它们
她
她们
倘
倘或
倘然
倘若
倘使
腾
替
通过
同
同时
哇
万一
往
望
为
为何
为了
为什么
为着
喂
嗡嗡
我
我们
呜
呜呼
乌乎
无论
无宁
毋宁
嘻
吓
相对而言
像
向
向着
嘘
呀
焉
沿
沿着
要
要不
要不然
要不是
要么
要是
也
也罢
也好
一
一般
一旦
一方面
一来
一切
一样
一则
依
依照
矣
以
以便
以及
以免
以至
以至于
以致
抑或
因
因此
因而
因为
哟
用
由
由此可见
由于
有
有的
有关
有些
又
于
于是
于是乎
与
与此同时
与否
与其
越是
云云
哉
再说
再者
在
在下
咱
咱们
则
怎
怎么
怎么办
怎么样
怎样
咋
照
照着
者
这
这边
这儿
这个
这会儿
这就是说
这里
这么
这么点儿
这么些
这么样
这时
这些
这样
正如
吱
之
之类
之所以
之一
只是
只限
只要
只有
至
至于
诸位
着
着呢
自
自从
自个儿
自各儿
自己
自家
自身
综上所述
总的来看
总的来说
总的说来
总而言之
总之
纵
纵令
纵然
纵使
遵照
作为
兮
呃
呗
咚
咦
喏
啐
喔唷
嗬
嗯
嗳
a
able
about
above
abroad
according
accordingly
across
actually
adj
after
afterwards
again
against
ago
ahead
ain't
all
allow
allows
almost
alone
along
alongside
already
also
although
always
am
amid
amidst
among
amongst
an
and
another
any
anybody
anyhow
anyone
anything
anyway
anyways
anywhere
apart
appear
appreciate
appropriate
are
aren't
around
as
a's
aside
ask
asking
associated
at
available
away
awfully
b
back
backward
backwards
be
became
because
become
becomes
becoming
been
before
beforehand
begin
behind
being
believe
below
beside
besides
best
better
between
beyond
both
brief
but
by
c
came
can
cannot
cant
can't
caption
cause
causes
certain
certainly
changes
clearly
c'mon
co
co.
com
come
comes
concerning
consequently
consider
considering
contain
containing
contains
corresponding
could
couldn't
course
c's
currently
d
dare
daren't
definitely
described
despite
did
didn't
different
directly
do
does
doesn't
doing
done
don't
down
downwards
during
e
each
edu
eg
eight
eighty
either
else
elsewhere
end
ending
enough
entirely
especially
et
etc
even
ever
evermore
every
everybody
everyone
everything
everywhere
ex
exactly
example
except
f
fairly
far
farther
few
fewer
fifth
first
five
followed
following
follows
for
forever
former
formerly
forth
forward
found
four
from
further
furthermore
g
get
gets
getting
given
gives
go
goes
going
gone
got
gotten
greetings
h
had
hadn't
half
happens
hardly
has
hasn't
have
haven't
having
he
he'd
he'll
hello
help
hence
her
here
hereafter
hereby
herein
here's
hereupon
hers
herself
he's
hi
him
himself
his
hither
hopefully
how
howbeit
however
hundred
i
i'd
ie
if
ignored
i'll
i'm
immediate
in
inasmuch
inc
inc.
indeed
indicate
indicated
indicates
inner
inside
insofar
instead
into
inward
is
isn't
it
it'd
it'll
its
it's
itself
i've
j
just
k
keep
keeps
kept
know
known
knows
l
last
lately
later
latter
latterly
least
less
lest
let
let's
like
liked
likely
likewise
little
look
looking
looks
low
lower
ltd
m
made
mainly
make
makes
many
may
maybe
mayn't
me
mean
meantime
meanwhile
merely
might
mightn't
mine
minus
miss
more
moreover
most
mostly
mr
mrs
much
must
mustn't
my
myself
n
name
namely
nd
near
nearly
necessary
need
needn't
needs
neither
never
neverf
neverless
nevertheless
new
next
nine
ninety
no
nobody
non
none
nonetheless
noone
no-one
nor
normally
not
nothing
notwithstanding
novel
now
nowhere
o
obviously
of
off
often
oh
ok
okay
old
on
once
one
ones
one's
only
onto
opposite
or
other
others
otherwise
ought
oughtn't
our
ours
ourselves
out
outside
over
overall
own
p
particular
particularly
past
per
perhaps
placed
please
plus
possible
presumably
probably
provided
provides
q
que
quite
qv
r
rather
rd
re
really
reasonably
recent
recently
regarding
regardless
regards
relatively
respectively
right
round
s
said
same
saw
say
saying
says
second
secondly
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sensible
sent
serious
seriously
seven
several
shall
shan't
she
she'd
she'll
she's
should
shouldn't
since
six
so
some
somebody
someday
somehow
someone
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specified
specify
specifying
still
sub
such
sup
sure
t
take
taken
taking
tell
tends
th
than
thank
thanks
thanx
that
that'll
thats
that's
that've
the
their
theirs
them
themselves
then
thence
there
thereafter
thereby
there'd
therefore
therein
there'll
there're
theres
there's
thereupon
there've
these
they
they'd
they'll
they're
they've
thing
things
think
third
thirty
this
thorough
thoroughly
those
though
three
through
throughout
thru
thus
till
to
together
too
took
toward
towards
tried
tries
truly
try
trying
t's
twice
two
u
un
under
underneath
undoing
unfortunately
unless
unlike
unlikely
until
unto
up
upon
upwards
us
use
used
useful
uses
using
usually
v
value
various
versus
very
via
viz
vs
w
want
wants
was
wasn't
way
we
we'd
welcome
well
we'll
went
were
we're
weren't
we've
what
whatever
what'll
what's
what've
when
whence
whenever
where
whereafter
whereas
whereby
wherein
where's
whereupon
wherever
whether
which
whichever
while
whilst
whither
who
who'd
whoever
whole
who'll
whom
whomever
who's
whose
why
will
willing
wish
with
within
without
wonder
won't
would
wouldn't
x
y
yes
yet
you
you'd
you'll
your
you're
yours
yourself
yourselves
you've
z
zero

================================================
FILE: textrank4zh/util.py
================================================
#-*- encoding:utf-8 -*-
"""
@author:   letian
@homepage: http://www.letiantian.me
@github:   https://github.com/someus/
"""
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)

import os
import math
import networkx as nx
import numpy as np
import sys

try:
    reload(sys)
    sys.setdefaultencoding('utf-8')
except:
    pass
    
sentence_delimiters = ['?', '!', ';', '？', '！', '。', '；', '……', '…', '\n']
allow_speech_tags = ['an', 'i', 'j', 'l', 'n', 'nr', 'nrfg', 'ns', 'nt', 'nz', 't', 'v', 'vd', 'vn', 'eng']

PY2 = sys.version_info[0] == 2
if not PY2:
    # Python 3.x and up
    text_type    = str
    string_types = (str,)
    xrange       = range

    def as_text(v):  ## 生成unicode字符串
        if v is None:
            return None
        elif isinstance(v, bytes):
            return v.decode('utf-8', errors='ignore')
        elif isinstance(v, str):
            return v
        else:
            raise ValueError('Unknown type %r' % type(v))

    def is_text(v):
        return isinstance(v, text_type)

else:
    # Python 2.x
    text_type    = unicode
    string_types = (str, unicode)
    xrange       = xrange

    def as_text(v):
        if v is None:
            return None
        elif isinstance(v, unicode):
            return v
        elif isinstance(v, str):
            return v.decode('utf-8', errors='ignore')
        else:
            raise ValueError('Invalid type %r' % type(v))

    def is_text(v):
        return isinstance(v, text_type)

__DEBUG = None

def debug(*args):
    global __DEBUG
    if __DEBUG is None:
        try:
            if os.environ['DEBUG'] == '1':
                __DEBUG = True
            else:
                __DEBUG = False
        except:
            __DEBUG = False
    if __DEBUG:
        print( ' '.join([str(arg) for arg in args]) )

class AttrDict(dict):
    """Dict that can get attribute by dot"""
    def __init__(self, *args, **kwargs):
        super(AttrDict, self).__init__(*args, **kwargs)
        self.__dict__ = self


def combine(word_list, window = 2):
    """构造在window下的单词组合，用来构造单词之间的边。
    
    Keyword arguments:
    word_list  --  list of str, 由单词组成的列表。
    windows    --  int, 窗口大小。
    """
    if window < 2: window = 2
    for x in xrange(1, window):
        if x >= len(word_list):
            break
        word_list2 = word_list[x:]
        res = zip(word_list, word_list2)
        for r in res:
            yield r

def get_similarity(word_list1, word_list2):
    """默认的用于计算两个句子相似度的函数。

    Keyword arguments:
    word_list1, word_list2  --  分别代表两个句子，都是由单词组成的列表
    """
    words   = list(set(word_list1 + word_list2))        
    vector1 = [float(word_list1.count(word)) for word in words]
    vector2 = [float(word_list2.count(word)) for word in words]
    
    vector3 = [vector1[x]*vector2[x]  for x in xrange(len(vector1))]
    vector4 = [1 for num in vector3 if num > 0.]
    co_occur_num = sum(vector4)

    if abs(co_occur_num) <= 1e-12:
        return 0.
    
    denominator = math.log(float(len(word_list1))) + math.log(float(len(word_list2))) # 分母
    
    if abs(denominator) < 1e-12:
        return 0.
    
    return co_occur_num / denominator

def sort_words(vertex_source, edge_source, window = 2, pagerank_config = {'alpha': 0.85,}):
    """将单词按关键程度从大到小排序

    Keyword arguments:
    vertex_source   --  二维列表，子列表代表句子，子列表的元素是单词，这些单词用来构造pagerank中的节点
    edge_source     --  二维列表，子列表代表句子，子列表的元素是单词，根据单词位置关系构造pagerank中的边
    window          --  一个句子中相邻的window个单词，两两之间认为有边
    pagerank_config --  pagerank的设置
    """
    sorted_words   = []
    word_index     = {}
    index_word     = {}
    _vertex_source = vertex_source
    _edge_source   = edge_source
    words_number   = 0
    for word_list in _vertex_source:
        for word in word_list:
            if not word in word_index:
                word_index[word] = words_number
                index_word[words_number] = word
                words_number += 1

    graph = np.zeros((words_number, words_number))
    
    for word_list in _edge_source:
        for w1, w2 in combine(word_list, window):
            if w1 in word_index and w2 in word_index:
                index1 = word_index[w1]
                index2 = word_index[w2]
                graph[index1][index2] = 1.0
                graph[index2][index1] = 1.0

    debug('graph:\n', graph)
    
    nx_graph = nx.from_numpy_matrix(graph)
    scores = nx.pagerank(nx_graph, **pagerank_config)          # this is a dict
    sorted_scores = sorted(scores.items(), key = lambda item: item[1], reverse=True)
    for index, score in sorted_scores:
        item = AttrDict(word=index_word[index], weight=score)
        sorted_words.append(item)

    return sorted_words

def sort_sentences(sentences, words, sim_func = get_similarity, pagerank_config = {'alpha': 0.85,}):
    """将句子按照关键程度从大到小排序

    Keyword arguments:
    sentences         --  列表，元素是句子
    words             --  二维列表，子列表和sentences中的句子对应，子列表由单词组成
    sim_func          --  计算两个句子的相似性，参数是两个由单词组成的列表
    pagerank_config   --  pagerank的设置
    """
    sorted_sentences = []
    _source = words
    sentences_num = len(_source)        
    graph = np.zeros((sentences_num, sentences_num))
    
    for x in xrange(sentences_num):
        for y in xrange(x, sentences_num):
            similarity = sim_func( _source[x], _source[y] )
            graph[x, y] = similarity
            graph[y, x] = similarity
            
    nx_graph = nx.from_numpy_matrix(graph)
    scores = nx.pagerank(nx_graph, **pagerank_config)              # this is a dict
    sorted_scores = sorted(scores.items(), key = lambda item: item[1], reverse=True)

    for index, score in sorted_scores:
        item = AttrDict(index=index, sentence=sentences[index], weight=score)
        sorted_sentences.append(item)

    return sorted_sentences

if __name__ == '__main__':
    pass

Download .txt

gitextract_tdn32462/

├── .gitignore
├── HISTORY.md
├── LICENSE
├── README.md
├── example/
│   ├── example01.py
│   └── example02.py
├── setup.py
├── test/
│   ├── Segmentation_test.py
│   ├── TextRank4Keyword_test.py
│   ├── TextRank4Sentence_test.py
│   ├── codecs_test.py
│   ├── doc/
│   │   ├── 01.txt
│   │   ├── 02.txt
│   │   ├── 03.txt
│   │   ├── 04.txt
│   │   └── 05.txt
│   ├── jieba_test.py
│   └── util_test.py
└── textrank4zh/
    ├── Segmentation.py
    ├── TextRank4Keyword.py
    ├── TextRank4Sentence.py
    ├── __init__.py
    ├── stopwords.txt
    └── util.py

Download .txt

SYMBOL INDEX (34 symbols across 5 files)

FILE: test/util_test.py
  function testAttrDict (line 6) | def testAttrDict():
  function testCombine (line 12) | def testCombine():
  function testDebug (line 20) | def testDebug():

FILE: textrank4zh/Segmentation.py
  function get_default_stop_words_file (line 16) | def get_default_stop_words_file():
  class WordSegmentation (line 20) | class WordSegmentation(object):
    method __init__ (line 23) | def __init__(self, stop_words_file = None, allow_speech_tags = util.al...
    method segment (line 40) | def segment(self, text, lower = True, use_stop_words = True, use_speec...
    method segment_sentences (line 68) | def segment_sentences(self, sentences, lower=True, use_stop_words=True...
  class SentenceSegmentation (line 82) | class SentenceSegmentation(object):
    method __init__ (line 85) | def __init__(self, delimiters=util.sentence_delimiters):
    method segment (line 92) | def segment(self, text):
  class Segmentation (line 105) | class Segmentation(object):
    method __init__ (line 107) | def __init__(self, stop_words_file = None,
    method segment (line 118) | def segment(self, text, lower = False):

FILE: textrank4zh/TextRank4Keyword.py
  class TextRank4Keyword (line 16) | class TextRank4Keyword(object):
    method __init__ (line 18) | def __init__(self, stop_words_file = None,
    method analyze (line 43) | def analyze(self, text,
    method get_keywords (line 95) | def get_keywords(self, num = 6, word_min_len = 1):
    method get_keyphrases (line 111) | def get_keyphrases(self, keywords_num = 12, min_occur_num = 2):

FILE: textrank4zh/TextRank4Sentence.py
  class TextRank4Sentence (line 16) | class TextRank4Sentence(object):
    method __init__ (line 18) | def __init__(self, stop_words_file = None,
    method analyze (line 43) | def analyze(self, text, lower = False,
    method get_key_sentences (line 76) | def get_key_sentences(self, num = 6, sentence_min_len = 6):

FILE: textrank4zh/util.py
  function as_text (line 32) | def as_text(v):  ## 生成unicode字符串
  function is_text (line 42) | def is_text(v):
  function as_text (line 51) | def as_text(v):
  function is_text (line 61) | def is_text(v):
  function debug (line 66) | def debug(*args):
  class AttrDict (line 79) | class AttrDict(dict):
    method __init__ (line 81) | def __init__(self, *args, **kwargs):
  function combine (line 86) | def combine(word_list, window = 2):
  function get_similarity (line 102) | def get_similarity(word_list1, word_list2):
  function sort_words (line 126) | def sort_words(vertex_source, edge_source, window = 2, pagerank_config =...
  function sort_sentences (line 169) | def sort_sentences(sentences, words, sim_func = get_similarity, pagerank...

Download .json

Condensed preview — 24 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (65K chars).

[
  {
    "path": ".gitignore",
    "chars": 57,
    "preview": "build\ndist\nMANIFEST\ntextrank4zh/__pycache__\n*.pyc\ntest.py"
  },
  {
    "path": "HISTORY.md",
    "chars": 51,
    "preview": "### 2014\n\n主要功能的实现。\n\n### 2015-12\n\n更新到v0.2。\n\n接口有变化。\n\n"
  },
  {
    "path": "LICENSE",
    "chars": 1078,
    "preview": "The MIT License (MIT)\n\nCopyright (c) 2015 Letian Sun\n\nPermission is hereby granted, free of charge, to any person obtain"
  },
  {
    "path": "README.md",
    "chars": 4475,
    "preview": "# TextRank4ZH\n\nTextRank算法可以用来从文本中提取关键词和摘要（重要的句子）。TextRank4ZH是针对中文文本的TextRank算法的python算法实现。\n\n## 安装\n\n方式1：\n```\n$ python set"
  },
  {
    "path": "example/example01.py",
    "chars": 843,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\n\nimport sys\ntry:\n    reload(sys)\n    sys.setdefaultencodin"
  },
  {
    "path": "example/example02.py",
    "chars": 849,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\nimport codecs\nfrom textrank4zh import TextRank4Keyword, Te"
  },
  {
    "path": "setup.py",
    "chars": 1238,
    "preview": "# -*- coding: utf-8 -*-\nfrom distutils.core import setup\nLONGDOC = \"\"\"\nPlease go to https://github.com/someus/TextRank4Z"
  },
  {
    "path": "test/Segmentation_test.py",
    "chars": 776,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\n\nimport sys\ntry:\n    reload(sys)\n    sys.setdefaultencodin"
  },
  {
    "path": "test/TextRank4Keyword_test.py",
    "chars": 623,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\n\nimport sys\ntry:\n    reload(sys)\n    sys.setdefaultencodin"
  },
  {
    "path": "test/TextRank4Sentence_test.py",
    "chars": 565,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\n\nimport sys\ntry:\n    reload(sys)\n    sys.setdefaultencodin"
  },
  {
    "path": "test/codecs_test.py",
    "chars": 196,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\n\n\nimport codecs\ntext = codecs.open('./doc/01.txt', 'r', 'u"
  },
  {
    "path": "test/doc/01.txt",
    "chars": 1080,
    "preview": "中新网北京12月1日电(记者 张曦) 30日晚，高圆圆和赵又廷在京举行答谢宴，诸多明星现身捧场，其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等。\n\n30日中午，有媒体曝光高圆圆和赵又廷现身"
  },
  {
    "path": "test/doc/02.txt",
    "chars": 3263,
    "preview": "如何在美国把贪官送进监狱——\n法律用美国的：被用来修理黑帮的美国联邦法律也能“顺便”对付中国贪官\n\n在美国起诉中国贪官，不可能适用中国自己的法律。所以，必须得搞清楚外逃贪官触犯美国法律的证据。贪官明明是在中国国内贪腐，还能触犯到美国的法律？"
  },
  {
    "path": "test/doc/03.txt",
    "chars": 937,
    "preview": "据BI消息，Netflix 正准备在本月上线其最新的原创剧集《马可波罗》。而据纽约时报报道，《马可波罗》第一季 10 集的总投资高达 9000 万美元，这不仅创下了 Netflix 的最高电视剧投资记录，在全球电视剧制作成本的排名中也是数一"
  },
  {
    "path": "test/doc/04.txt",
    "chars": 437,
    "preview": "京菜擅长烤、爆、烧、焖、涮，听起来豪爽，吃起来痛快。北京烤鸭是来京游玩必食的美味；西四缸瓦市一家名叫砂锅居的老店所烧的砂锅白肉名满京城，相传他们用的原汤已有二三百年历史；涮羊肉是最受北京人欢迎的冬令美食，其中阳坊涮肉连锁店以价格便宜，味道正"
  },
  {
    "path": "test/doc/05.txt",
    "chars": 469,
    "preview": "支持向量机（英语：Support Vector Machine，常简称为SVM）是一种监督式学习的方法，可广泛地应用于统计分类以及回归分析。\n\n支持向量机属于一般化线性分类器，也可以被认为是提克洛夫规范化（Tikhonov Regulari"
  },
  {
    "path": "test/jieba_test.py",
    "chars": 350,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\n\nimport sys\ntry:\n    reload(sys)\n    sys.setdefaultencodin"
  },
  {
    "path": "test/util_test.py",
    "chars": 585,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import print_function\n\nfrom textrank4zh import util\n\ndef testAttrDict():\n    r ="
  },
  {
    "path": "textrank4zh/Segmentation.py",
    "chars": 5363,
    "preview": "#-*- encoding:utf-8 -*-\n\"\"\"\n@author:   letian\n@homepage: http://www.letiantian.me\n@github:   https://github.com/someus/\n"
  },
  {
    "path": "textrank4zh/TextRank4Keyword.py",
    "chars": 5062,
    "preview": "#-*- encoding:utf-8 -*-\n\"\"\"\n@author:   letian\n@homepage: http://www.letiantian.me\n@github:   https://github.com/someus/\n"
  },
  {
    "path": "textrank4zh/TextRank4Sentence.py",
    "chars": 3333,
    "preview": "#-*- encoding:utf-8 -*-\n\"\"\"\n@author:   letian\n@homepage: http://www.letiantian.me\n@github:   https://github.com/someus/\n"
  },
  {
    "path": "textrank4zh/__init__.py",
    "chars": 221,
    "preview": "#-*- encoding:utf-8 -*-\nfrom __future__ import absolute_import\nfrom .TextRank4Keyword import TextRank4Keyword\nfrom .Text"
  },
  {
    "path": "textrank4zh/stopwords.txt",
    "chars": 5610,
    "preview": "?\n、\n。\n“\n”\n《\n》\n！\n，\n：\n；\n？\n啊\n阿\n哎\n哎呀\n哎哟\n唉\n俺\n俺们\n按\n按照\n吧\n吧哒\n把\n罢了\n被\n本\n本着\n比\n比方\n比如\n鄙人\n彼\n彼此\n边\n别\n别的\n别说\n并\n并且\n不比\n不成\n不单\n不但\n不独\n不管\n不光\n不过\n"
  },
  {
    "path": "textrank4zh/util.py",
    "chars": 5890,
    "preview": "#-*- encoding:utf-8 -*-\n\"\"\"\n@author:   letian\n@homepage: http://www.letiantian.me\n@github:   https://github.com/someus/\n"
  }
]

About this extraction

This page contains the full source code of the someus/TextRank4ZH GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 24 files (42.3 KB), approximately 17.4k tokens, and a symbol index with 34 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo