Showing preview only (2,461K chars total). Download the full file or copy to clipboard to get everything.
Repository: sloria/TextBlob
Branch: dev
Commit: a1c8944b72bb
Files: 88
Total size: 2.3 MB
Directory structure:
gitextract_d9nsa0ec/
├── .github/
│ ├── dependabot.yml
│ └── workflows/
│ └── build-release.yml
├── .gitignore
├── .konchrc
├── .pre-commit-config.yaml
├── .readthedocs.yml
├── AUTHORS.rst
├── CHANGELOG.rst
├── CONTRIBUTING.rst
├── LICENSE
├── NOTICE
├── README.rst
├── RELEASING.md
├── SECURITY.md
├── docs/
│ ├── Makefile
│ ├── _templates/
│ │ ├── side-primary.html
│ │ └── side-secondary.html
│ ├── _themes/
│ │ ├── .gitignore
│ │ ├── LICENSE
│ │ ├── flask_theme_support.py
│ │ ├── kr/
│ │ │ ├── layout.html
│ │ │ ├── relations.html
│ │ │ ├── static/
│ │ │ │ ├── flasky.css_t
│ │ │ │ └── small_flask.css
│ │ │ └── theme.conf
│ │ └── kr_small/
│ │ ├── layout.html
│ │ ├── static/
│ │ │ └── flasky.css_t
│ │ └── theme.conf
│ ├── advanced_usage.rst
│ ├── api_reference.rst
│ ├── authors.rst
│ ├── changelog.rst
│ ├── classifiers.rst
│ ├── conf.py
│ ├── contributing.rst
│ ├── extensions.rst
│ ├── index.rst
│ ├── install.rst
│ ├── license.rst
│ ├── make.bat
│ └── quickstart.rst
├── pyproject.toml
├── src/
│ └── textblob/
│ ├── __init__.py
│ ├── _text.py
│ ├── base.py
│ ├── blob.py
│ ├── classifiers.py
│ ├── decorators.py
│ ├── download_corpora.py
│ ├── en/
│ │ ├── __init__.py
│ │ ├── en-context.txt
│ │ ├── en-entities.txt
│ │ ├── en-lexicon.txt
│ │ ├── en-morphology.txt
│ │ ├── en-sentiment.xml
│ │ ├── en-spelling.txt
│ │ ├── inflect.py
│ │ ├── np_extractors.py
│ │ ├── parsers.py
│ │ ├── sentiments.py
│ │ └── taggers.py
│ ├── exceptions.py
│ ├── formats.py
│ ├── inflect.py
│ ├── mixins.py
│ ├── np_extractors.py
│ ├── parsers.py
│ ├── sentiments.py
│ ├── taggers.py
│ ├── tokenizers.py
│ ├── utils.py
│ └── wordnet.py
├── tests/
│ ├── __init__.py
│ ├── data.csv
│ ├── data.json
│ ├── data.tsv
│ ├── test_blob.py
│ ├── test_classifiers.py
│ ├── test_decorators.py
│ ├── test_formats.py
│ ├── test_inflect.py
│ ├── test_np_extractor.py
│ ├── test_parsers.py
│ ├── test_sentiments.py
│ ├── test_taggers.py
│ ├── test_tokenizers.py
│ └── test_utils.py
└── tox.ini
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
- package-ecosystem: pip
directory: "/"
schedule:
interval: daily
open-pull-requests-limit: 10
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "monthly"
================================================
FILE: .github/workflows/build-release.yml
================================================
name: build
on:
push:
branches: ["dev"]
tags: ["*"]
pull_request:
jobs:
tests:
name: ${{ matrix.name }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- { name: "3.9", python: "3.9", tox: py39 }
- { name: "3.13", python: "3.13", tox: py313 }
- { name: "lowest", python: "3.9", tox: py39-lowest }
steps:
- uses: actions/checkout@v6
- uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python }}
- name: Download nltk data
run: |
pip install .
python -m textblob.download_corpora
- run: python -m pip install tox
- run: python -m tox -e${{ matrix.tox }}
build:
name: Build package
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-python@v6
with:
python-version: "3.11"
- name: Install pypa/build
run: python -m pip install build
- name: Build a binary wheel and a source tarball
run: python -m build
- name: Install twine
run: python -m pip install twine
- name: Check build
run: python -m twine check --strict dist/*
- name: Store the distribution packages
uses: actions/upload-artifact@v7
with:
name: python-package-distributions
path: dist/
# this duplicates pre-commit.ci, so only run it on tags
# it guarantees that linting is passing prior to a release
lint-pre-release:
if: startsWith(github.ref, 'refs/tags')
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-python@v6
with:
python-version: "3.11"
- run: python -m pip install tox
- run: python -m tox -e lint
publish-to-pypi:
name: PyPI release
if: startsWith(github.ref, 'refs/tags/')
needs: [build, tests, lint-pre-release]
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/textblob
permissions:
id-token: write
steps:
- name: Download all the dists
uses: actions/download-artifact@v8
with:
name: python-package-distributions
path: dist/
- name: Publish distribution to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
================================================
FILE: .gitignore
================================================
*.py[cod]
# virtualenv
.venv/
venv/
# C extensions
*.so
# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
# pip
pip-log.txt
pip-wheel-metadata
# Unit test / coverage reports
.coverage
.tox
nosetests.xml
test-output/
.pytest_cache
# Translations
*.mo
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
# Complexity
output/*.html
output/*/index.html
# Sphinx
docs/_build
README.html
# mypy
.mypy_cache
!tests/.env
# ruff
.ruff_cache
================================================
FILE: .konchrc
================================================
# -*- coding: utf-8 -*-
# vi: set ft=python :
import konch
from textblob import TextBlob, Blobber, Word, Sentence
konch.config({
'context': {
'tb': TextBlob,
'Blobber': Blobber,
'Word': Word,
'Sentence': Sentence,
},
'prompt': '>>> ',
'ipy_autoreload': True,
})
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.6
hooks:
- id: ruff
- id: ruff-format
- repo: https://github.com/python-jsonschema/check-jsonschema
rev: 0.37.0
hooks:
- id: check-github-workflows
- repo: https://github.com/asottile/blacken-docs
rev: 1.20.0
hooks:
- id: blacken-docs
additional_dependencies: [black==24.10.0]
================================================
FILE: .readthedocs.yml
================================================
version: 2
sphinx:
configuration: docs/conf.py
formats:
- pdf
build:
os: ubuntu-22.04
tools:
python: "3.11"
python:
install:
- method: pip
path: .
extra_requirements:
- docs
================================================
FILE: AUTHORS.rst
================================================
*******
Authors
*******
Development Lead
================
- Steven Loria <sloria1@gmail.com> `@sloria <https://github.com/sloria>`_
Contributors (chronological)
============================
- Pete Keen `@peterkeen <https://github.com/peterkeen>`_
- Matthew Honnibal `@syllog1sm <https://github.com/syllog1sm>`_
- Roman Yankovsky `@RomanYankovsky <https://github.com/RomanYankovsky>`_
- David Karesh `@davidnk <https://github.com/davidnk>`_
- Evan Dempsey `@evandempsey <https://github.com/evandempsey>`_
- Wesley Childs `@mrchilds <https://github.com/mrchilds>`_
- Jeff Schnurr `@jschnurr <https://github.com/jschnurr>`_
- Adel Qalieh `@adelq <https://github.com/adelq>`_
- Lage Ragnarsson `@lragnarsson <https://github.com/lragnarsson>`_
- Jonathon Coe `@jonmcoe <https://github.com/jonmcoe>`_
- Adrián López Calvo `@AdrianLC <https://github.com/AdrianLC>`_
- Nitish Kulshrestha `@nitkul <https://github.com/nitkul>`_
- Jhon Eslava `@EpicJhon <https://github.com/EpicJhon>`_
- `@jcalbert <https://github.com/jcalbert>`_
- Tyler James Harden `@tylerjharden <https://github.com/tylerjharden>`_
- `@pavelmalai <https://github.com/pavelmalai>`_
- Jeff Kolb `@jeffakolb <https://github.com/jeffakolb>`_
- Daniel Ong `@danong <https://github.com/danong>`_
- Jamie Moschella `@jammmo <https://github.com/jammmo>`_
- Roman Korolev `@roman-y-korolev <https://github.com/roman-y-korolev>`_
- Ram Rachum `@cool-RR <https://github.com/cool-RR>`_
- Romain Casati `@casatir <https://github.com/casatir>`_
- Evgeny Kemerov `@sudoguy <https://github.com/sudoguy>`_
- Karthikeyan Singaravelan `@tirkarthi <https://github.com/tirkarthi>`_
- John Franey `@johnfraney <https://github.com/johnfraney>`_
================================================
FILE: CHANGELOG.rst
================================================
Changelog
=========
0.19.0 (2025-01-13)
___________________
Bug fixes:
- Fix ``textblob.download_corpora`` script (:issue:`474`).
Thanks :user:`cagan-elden` for reporting.
Changes:
- Remove vendorized ``unicodecsv`` module, as it's no longer used.
- Support Python 3.9-3.13 and nltk>=3.9 (:pr:`486`)
Thanks :user:`johnfraney` for the PR.
0.18.0 (2024-02-15)
-------------------
Bug fixes:
- Remove usage of deprecated cElementTree (:issue:`339`).
Thanks :user:`tirkarthi` for reporting and for the PR.
- Address ``SyntaxWarning`` on Python 3.12 (:pr:`418`).
Thanks :user:`smontanaro` for the PR.
Removals:
- ``TextBlob.translate()`` and ``TextBlob.detect_language``, and ``textblob.translate``
are removed. Use the official Google Translate API instead (:issue:`215`).
- Remove ``textblob.compat``.
Support:
- Support Python 3.8-3.12. Older versions are no longer supported.
- Support nltk>=3.8.
0.17.1 (2021-10-21)
-------------------
Bug fixes:
- Fix translation and language detection (:issue:`395`).
Thanks :user:`sudoguy` for the patch.
0.17.0 (2021-02-17)
-------------------
Features:
- Performance improvement: Use ``chain.from_iterable`` in ``_text.py``
to improve runtime and memory usage (:pr:`333`). Thanks :user:`cool-RR` for the PR.
Other changes:
- Remove usage of `ctypes` (:pr:`354`). Thanks :user:`casatir`.
0.16.0 (2020-04-26)
-------------------
Deprecations:
- ``TextBlob.translate()`` and ``TextBlob.detect_language`` are deprecated. Use the official Google Translate API instead (:issue:`215`).
Other changes:
- *Backwards-incompatible*: Drop support for Python 3.4.
- Test against Python 3.7 and Python 3.8.
- Pin NLTK to ``nltk<3.5`` on Python 2 (:issue:`315`).
0.15.3 (2019-02-24)
-------------------
Bug fixes:
- Fix bug when ``Word`` string type after pos_tags is not a ``str``
(:pr:`255`). Thanks :user:`roman-y-korolev` for the patch.
0.15.2 (2018-11-21)
-------------------
Bug fixes:
- Fix bug that raised a ``RuntimeError`` when executing methods that
delegate to ``pattern.en`` (:issue:`230`). Thanks :user:`vvaezian`
for the report and thanks :user:`danong` for the fix.
- Fix methods of ``WordList`` that modified the list in-place by
removing the internal `_collection` variable (:pr:`235`). Thanks
:user:`jammmo` for the PR.
0.15.1 (2018-01-20)
-------------------
Bug fixes:
- Convert POS tags from treebank to wordnet when calling ``lemmatize``
to prevent ``MissingCorpusError`` (:issue:`160`). Thanks
:user:`jschnurr`.
0.15.0 (2017-12-02)
-------------------
Features:
- Add `TextBlob.sentiment_assessments` property which exposes pattern's
sentiment assessments (:issue:`170`). Thanks :user:`jeffakolb`.
0.14.0 (2017-11-20)
-------------------
Features:
- Use specified tokenizer when tagging (:issue:`167`). Thanks
:user:`jschnurr` for the PR.
0.13.1 (2017-11-11)
-------------------
Bug fixes:
- Avoid AttributeError when using pattern's sentiment analyzer
(:issue:`178`). Thanks :user:`tylerjharden` for the catch and patch.
- Correctly pass ``format`` argument to ``NLTKClassifier.accuracy``
(:issue:`177`). Thanks :user:`pavelmalai` for the catch and patch.
0.13.0 (2017-08-15)
-------------------
Features:
- Performance improvements to `NaiveBayesClassifier` (:issue:`63`, :issue:`77`,
:issue:`123`). Thanks :user:`jcalbert` for the PR.
0.12.0 (2017-02-27)
-------------------
Features:
- Add `Word.stem` and `WordList.stem` methods (:issue:`145`). Thanks :user:`nitkul`.
Bug fixes:
- Fix translation and language detection (:issue:`137`). Thanks :user:`EpicJhon` for the fix.
Changes:
- *Backwards-incompatible*: Remove Python 2.6 and 3.3 support.
0.11.1 (2016-02-17)
-------------------
Bug fixes:
- Fix translation and language detection (:issue:`115`, :issue:`117`, :issue:`119`). Thanks :user:`AdrianLC` and :user:`jschnurr` for the fix. Thanks :user:`AdrianLC`, :user:`edgaralts`, and :user:`pouya-cognitiv` for reporting.
0.11.0 (2015-11-01)
-------------------
Changes:
- Compatible with nltk>=3.1. NLTK versions < 3.1 are no longer supported.
- Change default tagger to NLTKTagger (uses NLTK's averaged perceptron tagger).
- Tested on Python 3.5.
Bug fixes:
- Fix singularization of a number of words. Thanks :user:`jonmcoe`.
- Fix spelling correction when nltk>=3.1 is installed (:issue:`99`). Thanks :user:`shubham12101` for reporting.
0.10.0 (2015-10-04)
-------------------
Changes:
- Unchanged text is now considered a translation error. Raises ``NotTranslated`` (:issue:`76`). Thanks :user:`jschnurr`.
Bug fixes:
- ``Translator.translate`` will detect language of input text by default (:issue:`85`). Thanks again :user:`jschnurr`.
- Fix matching of tagged phrases with CFG in ``ConllExtractor``. Thanks :user:`lragnarsson`.
- Fix inflection of a few irregular English nouns. Thanks :user:`jonmcoe`.
0.9.1 (2015-06-10)
------------------
Bug fixes:
- Fix ``DecisionTreeClassifier.pprint`` for compatibility with nltk>=3.0.2.
- Translation no longer adds erroneous whitespace around punctuation characters (:issue:`83`). Thanks :user:`AdrianLC` for reporting and thanks :user:`jschnurr` for the patch.
0.9.0 (2014-09-15)
------------------
- TextBlob now depends on NLTK 3. The vendorized version of NLTK has been removed.
- Fix bug that raised a `SyntaxError` when translating text with non-ascii characters on Python 3.
- Fix bug that showed "double-escaped" unicode characters in translator output (issue #56). Thanks Evan Dempsey.
- *Backwards-incompatible*: Completely remove ``import text.blob``. You should ``import textblob`` instead.
- *Backwards-incompatible*: Completely remove ``PerceptronTagger``. Install ``textblob-aptagger`` instead.
- *Backwards-incompatible*: Rename ``TextBlobException`` to ``TextBlobError`` and ``MissingCorpusException`` to ``MissingCorpusError``.
- *Backwards-incompatible*: ``Format`` classes are passed a file object rather than a file path.
- *Backwards-incompatible*: If training a classifier with data from a file, you must pass a file object (rather than a file path).
- Updated English sentiment corpus.
- Add ``feature_extractor`` parameter to ``NaiveBayesAnalyzer``.
- Add ``textblob.formats.get_registry()`` and ``textblob.formats.register()`` which allows users to register custom data source formats.
- Change ``BaseClassifier.detect`` from a ``staticmethod`` to a ``classmethod``.
- Improved docs.
- Tested on Python 3.4.
0.8.4 (2014-02-02)
------------------
- Fix display (``__repr__``) of WordList slices on Python 3.
- Add download_corpora module. Corpora must now be downloaded using ``python -m textblob.download_corpora``.
0.8.3 (2013-12-29)
------------------
- Sentiment analyzers return namedtuples, e.g. ``Sentiment(polarity=0.12, subjectivity=0.34)``.
- Memory usage improvements to NaiveBayesAnalyzer and basic_extractor (default feature extractor for classifiers module).
- Add ``textblob.tokenizers.sent_tokenize`` and ``textblob.tokenizers.word_tokenize`` convenience functions.
- Add ``textblob.classifiers.MaxEntClassifer``.
- Improved NLTKTagger.
0.8.2 (2013-12-21)
------------------
- Fix bug in spelling correction that stripped some punctuation (Issue #48).
- Various improvements to spelling correction: preserves whitespace characters (Issue #12); handle contractions and punctuation between words. Thanks @davidnk.
- Make ``TextBlob.words`` more memory-efficient.
- Translator now sends POST instead of GET requests. This allows for larger bodies of text to be translated (Issue #49).
- Update pattern tagger for better accuracy.
0.8.1 (2013-11-16)
------------------
- Fix bug that caused ``ValueError`` upon sentence tokenization. This removes modifications made to the NLTK sentence tokenizer.
- Add ``Word.lemmatize()`` method that allows passing in a part-of-speech argument.
- ``Word.lemma`` returns correct part of speech for Word objects that have their ``pos`` attribute set. Thanks @RomanYankovsky.
0.8.0 (2013-10-23)
------------------
- *Backwards-incompatible*: Renamed package to ``textblob``. This avoids clashes with other namespaces called `text`. TextBlob should now be imported with ``from textblob import TextBlob``.
- Update pattern resources for improved parser accuracy.
- Update NLTK.
- Allow Translator to connect to proxy server.
- PerceptronTagger completely deprecated. Install the ``textblob-aptagger`` extension instead.
0.7.1 (2013-09-30)
------------------
- Bugfix updates.
- Fix bug in feature extraction for ``NaiveBayesClassifier``.
- ``basic_extractor`` is now case-sensitive, e.g. contains(I) != contains(i)
- Fix ``repr`` output when a TextBlob contains non-ascii characters.
- Fix part-of-speech tagging with ``PatternTagger`` on Windows.
- Suppress warning about not having scikit-learn installed.
0.7.0 (2013-09-25)
------------------
- Wordnet integration. ``Word`` objects have ``synsets`` and ``definitions`` properties. The ``text.wordnet`` module allows you to create ``Synset`` and ``Lemma`` objects directly.
- Move all English-specific code to its own module, ``text.en``.
- Basic extensions framework in place. TextBlob has been refactored to make it easier to develop extensions.
- Add ``text.classifiers.PositiveNaiveBayesClassifier``.
- Update NLTK.
- ``NLTKTagger`` now working on Python 3.
- Fix ``__str__`` behavior. ``print(blob)`` should now print non-ascii text correctly in both Python 2 and 3.
- *Backwards-incompatible*: All abstract base classes have been moved to the ``text.base`` module.
- *Backwards-incompatible*: ``PerceptronTagger`` will now be maintained as an extension, ``textblob-aptagger``. Instantiating a ``text.taggers.PerceptronTagger()`` will raise a ``DeprecationWarning``.
0.6.3 (2013-09-15)
------------------
- Word tokenization fix: Words that stem from a contraction will still have an apostrophe, e.g. ``"Let's" => ["Let", "'s"]``.
- Fix bug with comparing blobs to strings.
- Add ``text.taggers.PerceptronTagger``, a fast and accurate POS tagger. Thanks `@syllog1sm <http://github.com/syllog1sm>`_.
- Note for Python 3 users: You may need to update your corpora, since NLTK master has reorganized its corpus system. Just run ``curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python`` again.
- Add ``download_corpora_lite.py`` script for getting the minimum corpora requirements for TextBlob's basic features.
0.6.2 (2013-09-05)
------------------
- Fix bug that resulted in a ``UnicodeEncodeError`` when tagging text with non-ascii characters.
- Add ``DecisionTreeClassifier``.
- Add ``labels()`` and ``train()`` methods to classifiers.
0.6.1 (2013-09-01)
------------------
- Classifiers can be trained and tested on CSV, JSON, or TSV data.
- Add basic WordNet lemmatization via the ``Word.lemma`` property.
- ``WordList.pluralize()`` and ``WordList.singularize()`` methods return ``WordList`` objects.
0.6.0 (2013-08-25)
------------------
- Add Naive Bayes classification. New ``text.classifiers`` module, ``TextBlob.classify()``, and ``Sentence.classify()`` methods.
- Add parsing functionality via the ``TextBlob.parse()`` method. The ``text.parsers`` module currently has one implementation (``PatternParser``).
- Add spelling correction. This includes the ``TextBlob.correct()`` and ``Word.spellcheck()`` methods.
- Update NLTK.
- Backwards incompatible: ``clean_html`` has been deprecated, just as it has in NLTK. Use Beautiful Soup's ``soup.get_text()`` method for HTML-cleaning instead.
- Slight API change to language translation: if ``from_lang`` isn't specified, attempts to detect the language.
- Add ``itokenize()`` method to tokenizers that returns a generator instead of a list of tokens.
0.5.3 (2013-08-21)
------------------
- Unicode fixes: This fixes a bug that sometimes raised a ``UnicodeEncodeError`` upon creating accessing ``sentences`` for TextBlobs with non-ascii characters.
- Update NLTK
0.5.2 (2013-08-14)
------------------
- `Important patch update for NLTK users`: Fix bug with importing TextBlob if local NLTK is installed.
- Fix bug with computing start and end indices of sentences.
0.5.1 (2013-08-13)
------------------
- Fix bug that disallowed display of non-ascii characters in the Python REPL.
- Backwards incompatible: Restore ``blob.json`` property for backwards compatibility with textblob<=0.3.10. Add a ``to_json()`` method that takes the same arguments as ``json.dumps``.
- Add ``WordList.append`` and ``WordList.extend`` methods that append Word objects.
0.5.0 (2013-08-10)
------------------
- Language translation and detection API!
- Add ``text.sentiments`` module. Contains the ``PatternAnalyzer`` (default implementation) as well as a ``NaiveBayesAnalyzer``.
- Part-of-speech tags can be accessed via ``TextBlob.tags`` or ``TextBlob.pos_tags``.
- Add ``polarity`` and ``subjectivity`` helper properties.
0.4.0 (2013-08-05)
------------------
- New ``text.tokenizers`` module with ``WordTokenizer`` and ``SentenceTokenizer``. Tokenizer instances (from either textblob itself or NLTK) can be passed to TextBlob's constructor. Tokens are accessed through the new ``tokens`` property.
- New ``Blobber`` class for creating TextBlobs that share the same tagger, tokenizer, and np_extractor.
- Add ``ngrams`` method.
- `Backwards-incompatible`: ``TextBlob.json()`` is now a method, not a property. This allows you to pass arguments (the same that you would pass to ``json.dumps()``).
- New home for documentation: https://textblob.readthedocs.io/
- Add parameter for cleaning HTML markup from text.
- Minor improvement to word tokenization.
- Updated NLTK.
- Fix bug with adding blobs to bytestrings.
0.3.10 (2013-08-02)
-------------------
- Bundled NLTK no longer overrides local installation.
- Fix sentiment analysis of text with non-ascii characters.
0.3.9 (2013-07-31)
------------------
- Updated nltk.
- ConllExtractor is now Python 3-compatible.
- Improved sentiment analysis.
- Blobs are equal (with `==`) to their string counterparts.
- Added instructions to install textblob without nltk bundled.
- Dropping official 3.1 and 3.2 support.
0.3.8 (2013-07-30)
------------------
- Importing TextBlob is now **much faster**. This is because the noun phrase parsers are trained only on the first call to ``noun_phrases`` (instead of training them every time you import TextBlob).
- Add text.taggers module which allows user to change which POS tagger implementation to use. Currently supports PatternTagger and NLTKTagger (NLTKTagger only works with Python 2).
- NPExtractor and Tagger objects can be passed to TextBlob's constructor.
- Fix bug with POS-tagger not tagging one-letter words.
- Rename text/np_extractor.py -> text/np_extractors.py
- Add run_tests.py script.
0.3.7 (2013-07-28)
------------------
- Every word in a ``Blob`` or ``Sentence`` is a ``Word`` instance which has methods for inflection, e.g ``word.pluralize()`` and ``word.singularize()``.
- Updated the ``np_extractor`` module. Now has an new implementation, ``ConllExtractor`` that uses the Conll2000 chunking corpus. Only works on Py2.
================================================
FILE: CONTRIBUTING.rst
================================================
Contributing guidelines
=======================
In General
----------
- `PEP 8`_, when sensible.
- Conventions *and* configuration.
- TextBlob wraps functionality in NLTK and pattern.en. Anything outside of that should be written as an extension.
- Test ruthlessly. Write docs for new features.
- Even more important than Test-Driven Development--*Human-Driven Development*.
- These guidelines may--and probably will--change.
.. _`PEP 8`: http://www.python.org/dev/peps/pep-0008/
In Particular
-------------
Questions, Feature Requests, Bug Reports, and Feedback. . .
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
. . .should all be reported on the `Github Issue Tracker`_ .
.. _`Github Issue Tracker`: https://github.com/sloria/TextBlob/issues?state=open
Setting Up for Local Development
++++++++++++++++++++++++++++++++
1. Fork TextBlob_ on Github. ::
$ git clone https://github.com/sloria/TextBlob.git
$ cd TextBlob
2. Install development requirements. It is highly recommended that you use a virtualenv. ::
# After activating your virtualenv
$ pip install -e '.[tests]'
.. _extension-development:
Developing Extensions
+++++++++++++++++++++
Extensions are packages with the name ``textblob-something``, where "something" is the name of your extension. Extensions should be imported with ``import textblob_something``.
Model Extensions
++++++++++++++++
To create a new extension for a part-of-speech tagger, sentiment analyzer, noun phrase extractor, classifier, tokenizer, or parser, simply create a module that has a class that implements the correct interface from ``textblob.base``. For example, a tagger might look like this:
.. code-block:: python
from textblob.base import BaseTagger
class MyTagger(BaseTagger):
def tag(self, text):
pass
# Your implementation goes here
Language Extensions
+++++++++++++++++++
The process for developing language extensions is the same as developing model extensions. Create your part-of-speech taggers, tokenizers, parsers, etc. in the language of your choice. Packages should be named ``textblob-xx`` where "xx" is the two- or three-letter language code (`Language code reference`_).
.. _Language code reference: http://www.loc.gov/standards/iso639-2/php/code_list.php
To see examples of existing extensions, visit the :ref:`Extensions <extensions>` page.
Check out the :ref:`API reference <api_base_classes>` for more info on the model interfaces.
Git Branch Structure
++++++++++++++++++++
TextBlob loosely follows Vincent Driessen's `Successful Git Branching Model <http://http://nvie.com/posts/a-successful-git-branching-model/>`_ . In practice, the following branch conventions are used:
``dev``
The next release branch.
``master``
Current production release on PyPI.
Pull Requests
++++++++++++++
1. Create a new local branch.
::
$ git checkout -b name-of-feature
2. Commit your changes. Write `good commit messages <http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html>`_.
::
$ git commit -m "Detailed commit message"
$ git push origin name-of-feature
3. Before submitting a pull request, check the following:
- If the pull request adds functionality, it is tested and the docs are updated.
- If you've developed an extension, it is on the :ref:`Extensions List <extensions>`.
- You've added yourself to ``AUTHORS.rst``.
4. Submit a pull request to the ``sloria:dev`` branch.
Running tests
+++++++++++++
To run all the tests: ::
$ pytest
To skip slow tests: ::
$ pytest -m 'not slow'
Documentation
+++++++++++++
Contributions to the documentation are welcome. Documentation is written in `reStructuredText`_ (rST). A quick rST reference can be found `here <https://docutils.sourceforge.io/docs/user/rst/quickref.html>`_. Builds are powered by Sphinx_.
To build docs and run in watch mode: ::
$ tox -e docs-serve
.. _Sphinx: http://sphinx.pocoo.org/
.. _`reStructuredText`: https://docutils.sourceforge.io/rst.html
.. _TextBlob: https://github.com/sloria/TextBlob
================================================
FILE: LICENSE
================================================
Copyright Steven Loria and contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
================================================
FILE: NOTICE
================================================
TextBlob includes some vendorized python libraries, including parts of pattern.
pattern License
===============
Copyright (c) 2011-2013 University of Antwerp, Belgium
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
* Neither the name of Pattern nor the names of its
contributors may be used to endorse or promote products
derived from this software without specific prior written
permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: README.rst
================================================
TextBlob: Simplified Text Processing
====================================
.. image:: https://badgen.net/pypi/v/TextBlob
:target: https://pypi.org/project/textblob/
:alt: Latest version
.. image:: https://github.com/sloria/TextBlob/actions/workflows/build-release.yml/badge.svg
:target: https://github.com/sloria/TextBlob/actions/workflows/build-release.yml
:alt: Build status
Homepage: `https://textblob.readthedocs.io/ <https://textblob.readthedocs.io/>`_
`TextBlob` is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and more.
.. code-block:: python
from textblob import TextBlob
text = """
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
"""
blob = TextBlob(text)
blob.tags # [('The', 'DT'), ('titular', 'JJ'),
# ('threat', 'NN'), ('of', 'IN'), ...]
blob.noun_phrases # WordList(['titular threat', 'blob',
# 'ultimate movie monster',
# 'amoeba-like mass', ...])
for sentence in blob.sentences:
print(sentence.sentiment.polarity)
# 0.060
# -0.341
TextBlob stands on the giant shoulders of `NLTK`_ and `pattern`_, and plays nicely with both.
Features
--------
- Noun phrase extraction
- Part-of-speech tagging
- Sentiment analysis
- Classification (Naive Bayes, Decision Tree)
- Tokenization (splitting text into words and sentences)
- Word and phrase frequencies
- Parsing
- `n`-grams
- Word inflection (pluralization and singularization) and lemmatization
- Spelling correction
- Add new models or languages through extensions
- WordNet integration
Get it now
----------
::
$ pip install -U textblob
$ python -m textblob.download_corpora
Examples
--------
See more examples at the `Quickstart guide`_.
.. _`Quickstart guide`: https://textblob.readthedocs.io/en/latest/quickstart.html#quickstart
Documentation
-------------
Full documentation is available at https://textblob.readthedocs.io/.
Project Links
-------------
- Docs: https://textblob.readthedocs.io/
- Changelog: https://textblob.readthedocs.io/en/latest/changelog.html
- PyPI: https://pypi.python.org/pypi/TextBlob
- Issues: https://github.com/sloria/TextBlob/issues
License
-------
MIT licensed. See the bundled `LICENSE <https://github.com/sloria/TextBlob/blob/master/LICENSE>`_ file for more details.
.. _pattern: https://github.com/clips/pattern/
.. _NLTK: http://nltk.org/
================================================
FILE: RELEASING.md
================================================
# Releasing
1. Bump version in `pyproject.toml` and update the changelog
with today's date.
2. Commit: `git commit -m "Bump version and update changelog"`
3. Tag the commit: `git tag x.y.z`
4. Push: `git push --tags origin dev`. CI will take care of the
PyPI release.
================================================
FILE: SECURITY.md
================================================
# Security Contact Information
To report a security vulnerability, please use the
[Tidelift security contact](https://tidelift.com/security).
Tidelift will coordinate the fix and disclosure.
================================================
FILE: docs/Makefile
================================================
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " qthelp to make HTML files and a qthelp project"
@echo " devhelp to make HTML files and a Devhelp project"
@echo " epub to make an epub"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " xml to make Docutils-native XML files"
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
clean:
rm -rf $(BUILDDIR)/*
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."
qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/textblob.qhcp"
@echo "To view the help file:"
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/textblob.qhc"
devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo "Build finished."
@echo "To view the help file:"
@echo "# mkdir -p $$HOME/.local/share/devhelp/textblob"
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/textblob"
@echo "# devhelp"
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
latexpdfja:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through platex and dvipdfmx..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."
info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
xml:
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
@echo
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
================================================
FILE: docs/_templates/side-primary.html
================================================
<p class="logo">
<a href="{{ pathto(master_doc) }}"
><img
class="logo"
src="{{ pathto('_static/textblob-logo.png', 1) }}"
height="200"
width="230"
alt="Logo"
/></a>
</p>
<p>
<iframe
src="https://ghbtns.com/github-btn.html?user=sloria&repo=TextBlob&type=watch&count=true&size=large"
allowtransparency="true"
frameborder="0"
scrolling="0"
width="200px"
height="35px"
></iframe>
</p>
<p>
TextBlob is a Python library for processing textual data. It
provides a consistent API for diving into common natural language processing
(NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment
analysis, and more.
</p>
<h3>Useful Links</h3>
<ul>
<li><a href="https://pypi.python.org/pypi/textblob">TextBlob @ PyPI</a></li>
<li><a href="https://github.com/sloria/textblob">TextBlob @ GitHub</a></li>
<li><a href="https://github.com/sloria/textblob/issues">Issue Tracker</a></li>
</ul>
<h3>Stay Informed</h3>
<p>
<iframe
src="https://ghbtns.com/github-btn.html?user=sloria&type=follow"
allowtransparency="true"
frameborder="0"
scrolling="0"
width="165"
height="20"
></iframe>
</p>
================================================
FILE: docs/_templates/side-secondary.html
================================================
<p class="logo">
<a href="{{ pathto(master_doc) }}"
><img
class="logo"
src="{{ pathto('_static/textblob-logo.png', 1) }}"
height="200"
width="230"
alt="Logo"
/></a>
</p>
<p>
<iframe
src="https://ghbtns.com/github-btn.html?user=sloria&repo=TextBlob&type=watch&count=true&size=large"
allowtransparency="true"
frameborder="0"
scrolling="0"
width="200px"
height="35px"
></iframe>
</p>
<p>
TextBlob is a Python library for processing textual data. It provides a
consistent API for diving into common natural language processing (NLP) tasks
such as part-of-speech tagging, noun phrase extraction, sentiment analysis,
and more.
</p>
<h3>Useful Links</h3>
<ul>
<li><a href="https://pypi.python.org/pypi/textblob">TextBlob @ PyPI</a></li>
<li><a href="https://github.com/sloria/textblob">TextBlob @ GitHub</a></li>
<li><a href="https://github.com/sloria/textblob/issues">Issue Tracker</a></li>
</ul>
================================================
FILE: docs/_themes/.gitignore
================================================
*.pyc
*.pyo
.DS_Store
================================================
FILE: docs/_themes/LICENSE
================================================
Modifications:
Copyright (c) 2010 Kenneth Reitz.
Original Project:
Copyright (c) 2010 by Armin Ronacher.
Some rights reserved.
Redistribution and use in source and binary forms of the theme, with or
without modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
* The names of the contributors may not be used to endorse or
promote products derived from this software without specific
prior written permission.
We kindly ask you to only use these themes in an unmodified manner just
for Flask and Flask-related products, not for unrelated projects. If you
like the visual style and want to use it for your own projects, please
consider making some larger changes to the themes (such as changing
font faces, sizes, colors or margins).
THIS THEME IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS THEME, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: docs/_themes/flask_theme_support.py
================================================
# flasky extensions. flasky pygments style based on tango style
from pygments.style import Style
from pygments.token import (
Comment,
Error,
Generic,
Keyword,
Literal,
Name,
Number,
Operator,
Other,
Punctuation,
String,
Whitespace,
)
class FlaskyStyle(Style):
background_color = "#f8f8f8"
default_style = ""
styles = {
# No corresponding class for the following:
# Text: "", # class: ''
Whitespace: "underline #f8f8f8", # class: 'w'
Error: "#a40000 border:#ef2929", # class: 'err'
Other: "#000000", # class 'x'
Comment: "italic #8f5902", # class: 'c'
Comment.Preproc: "noitalic", # class: 'cp'
Keyword: "bold #004461", # class: 'k'
Keyword.Constant: "bold #004461", # class: 'kc'
Keyword.Declaration: "bold #004461", # class: 'kd'
Keyword.Namespace: "bold #004461", # class: 'kn'
Keyword.Pseudo: "bold #004461", # class: 'kp'
Keyword.Reserved: "bold #004461", # class: 'kr'
Keyword.Type: "bold #004461", # class: 'kt'
Operator: "#582800", # class: 'o'
Operator.Word: "bold #004461", # class: 'ow' - like keywords
Punctuation: "bold #000000", # class: 'p'
# because special names such as Name.Class, Name.Function, etc.
# are not recognized as such later in the parsing, we choose them
# to look the same as ordinary variables.
Name: "#000000", # class: 'n'
Name.Attribute: "#c4a000", # class: 'na' - to be revised
Name.Builtin: "#004461", # class: 'nb'
Name.Builtin.Pseudo: "#3465a4", # class: 'bp'
Name.Class: "#000000", # class: 'nc' - to be revised
Name.Constant: "#000000", # class: 'no' - to be revised
Name.Decorator: "#888", # class: 'nd' - to be revised
Name.Entity: "#ce5c00", # class: 'ni'
Name.Exception: "bold #cc0000", # class: 'ne'
Name.Function: "#000000", # class: 'nf'
Name.Property: "#000000", # class: 'py'
Name.Label: "#f57900", # class: 'nl'
Name.Namespace: "#000000", # class: 'nn' - to be revised
Name.Other: "#000000", # class: 'nx'
Name.Tag: "bold #004461", # class: 'nt' - like a keyword
Name.Variable: "#000000", # class: 'nv' - to be revised
Name.Variable.Class: "#000000", # class: 'vc' - to be revised
Name.Variable.Global: "#000000", # class: 'vg' - to be revised
Name.Variable.Instance: "#000000", # class: 'vi' - to be revised
Number: "#990000", # class: 'm'
Literal: "#000000", # class: 'l'
Literal.Date: "#000000", # class: 'ld'
String: "#4e9a06", # class: 's'
String.Backtick: "#4e9a06", # class: 'sb'
String.Char: "#4e9a06", # class: 'sc'
String.Doc: "italic #8f5902", # class: 'sd' - like a comment
String.Double: "#4e9a06", # class: 's2'
String.Escape: "#4e9a06", # class: 'se'
String.Heredoc: "#4e9a06", # class: 'sh'
String.Interpol: "#4e9a06", # class: 'si'
String.Other: "#4e9a06", # class: 'sx'
String.Regex: "#4e9a06", # class: 'sr'
String.Single: "#4e9a06", # class: 's1'
String.Symbol: "#4e9a06", # class: 'ss'
Generic: "#000000", # class: 'g'
Generic.Deleted: "#a40000", # class: 'gd'
Generic.Emph: "italic #000000", # class: 'ge'
Generic.Error: "#ef2929", # class: 'gr'
Generic.Heading: "bold #000080", # class: 'gh'
Generic.Inserted: "#00A000", # class: 'gi'
Generic.Output: "#888", # class: 'go'
Generic.Prompt: "#745334", # class: 'gp'
Generic.Strong: "bold #000000", # class: 'gs'
Generic.Subheading: "bold #800080", # class: 'gu'
Generic.Traceback: "bold #a40000", # class: 'gt'
}
================================================
FILE: docs/_themes/kr/layout.html
================================================
{%- extends "basic/layout.html" %} {%- block extrahead %} {{ super() }} {% if
theme_touch_icon %}
<link
rel="apple-touch-icon"
href="{{ pathto('_static/' ~ theme_touch_icon, 1) }}"
/>
{% endif %}
<meta
name="viewport"
content="width=device-width, initial-scale=0.9, maximum-scale=0.9"
/>
{% endblock %} {%- block relbar2 %}{% endblock %} {%- block footer %}
<div class="footer">© Copyright {{ copyright }}.</div>
<a href="https://github.com/sloria/textblob" class="github">
<img
style="position: absolute; top: 0; right: 0; border: 0"
src="https://github.blog/wp-content/uploads/2008/12/forkme_right_darkblue_121621.png"
alt="Fork me on GitHub"
class="github"
/>
</a>
{%- endblock %}
================================================
FILE: docs/_themes/kr/relations.html
================================================
<h3>Related Topics</h3>
<ul>
<li><a href="{{ pathto(master_doc) }}">Documentation overview</a><ul>
{%- for parent in parents %}
<li><a href="{{ parent.link|e }}">{{ parent.title }}</a><ul>
{%- endfor %}
{%- if prev %}
<li>Previous: <a href="{{ prev.link|e }}" title="{{ _('previous chapter')
}}">{{ prev.title }}</a></li>
{%- endif %}
{%- if next %}
<li>Next: <a href="{{ next.link|e }}" title="{{ _('next chapter')
}}">{{ next.title }}</a></li>
{%- endif %}
{%- for parent in parents %}
</ul></li>
{%- endfor %}
</ul></li>
</ul>
================================================
FILE: docs/_themes/kr/static/flasky.css_t
================================================
/*
* flasky.css_t
* ~~~~~~~~~~~~
*
* :copyright: Copyright 2010 by Armin Ronacher. Modifications by Kenneth Reitz.
* :license: Flask Design License, see LICENSE for details.
*/
{% set page_width = '940px' %}
{% set sidebar_width = '220px' %}
@import url("basic.css");
/* -- page layout ----------------------------------------------------------- */
body {
font-family: 'goudy old style', 'minion pro', 'bell mt', Georgia, 'Hiragino Mincho Pro';
font-size: 17px;
background-color: white;
color: #000;
margin: 0;
padding: 0;
}
div.document {
width: {{ page_width }};
margin: 30px auto 0 auto;
}
div.documentwrapper {
float: left;
width: 100%;
}
div.bodywrapper {
margin: 0 0 0 {{ sidebar_width }};
}
div.sphinxsidebar {
width: {{ sidebar_width }};
}
hr {
border: 1px solid #B1B4B6;
}
div.body {
background-color: #ffffff;
color: #3E4349;
padding: 0 30px 0 30px;
}
img.floatingflask {
padding: 0 0 10px 10px;
float: right;
}
div.footer {
width: {{ page_width }};
margin: 20px auto 30px auto;
font-size: 14px;
color: #888;
text-align: right;
}
div.footer a {
color: #888;
}
div.related {
display: none;
}
div.sphinxsidebar a {
color: #444;
text-decoration: none;
border-bottom: 1px dotted #999;
}
div.sphinxsidebar a:hover {
border-bottom: 1px solid #999;
}
div.sphinxsidebar {
font-size: 14px;
line-height: 1.5;
}
div.sphinxsidebarwrapper {
padding: 18px 10px;
}
div.sphinxsidebarwrapper p.logo {
padding: 0;
margin: -10px 0 0 -20px;
text-align: center;
}
div.sphinxsidebar h3,
div.sphinxsidebar h4 {
font-family: 'Garamond', 'Georgia', serif;
color: #444;
font-size: 24px;
font-weight: normal;
margin: 0 0 5px 0;
padding: 0;
}
div.sphinxsidebar h4 {
font-size: 20px;
}
div.sphinxsidebar h3 a {
color: #444;
}
div.sphinxsidebar p.logo a,
div.sphinxsidebar h3 a,
div.sphinxsidebar p.logo a:hover,
div.sphinxsidebar h3 a:hover {
border: none;
}
div.sphinxsidebar p {
color: #555;
margin: 10px 0;
}
div.sphinxsidebar ul {
margin: 10px 0;
padding: 0;
color: #000;
}
div.sphinxsidebar input {
border: 1px solid #ccc;
font-family: 'Georgia', serif;
font-size: 1em;
}
/* -- body styles ----------------------------------------------------------- */
a {
color: #004B6B;
text-decoration: underline;
}
a:hover {
color: #6D4100;
text-decoration: underline;
}
div.body h1,
div.body h2,
div.body h3,
div.body h4,
div.body h5,
div.body h6 {
font-family: 'Garamond', 'Georgia', serif;
font-weight: normal;
margin: 30px 0px 10px 0px;
padding: 0;
}
div.body h1 { margin-top: 0; padding-top: 0; font-size: 240%; }
div.body h2 { font-size: 180%; }
div.body h3 { font-size: 150%; }
div.body h4 { font-size: 130%; }
div.body h5 { font-size: 100%; }
div.body h6 { font-size: 100%; }
a.headerlink {
color: #ddd;
padding: 0 4px;
text-decoration: none;
}
a.headerlink:hover {
color: #444;
background: #eaeaea;
}
div.body p, div.body dd, div.body li {
line-height: 1.4em;
}
div.admonition {
background: #fafafa;
margin: 20px -30px;
padding: 10px 30px;
border-top: 1px solid #ccc;
border-bottom: 1px solid #ccc;
}
div.admonition tt.xref, div.admonition a tt {
border-bottom: 1px solid #fafafa;
}
dd div.admonition {
margin-left: -60px;
padding-left: 60px;
}
div.admonition p.admonition-title {
font-family: 'Garamond', 'Georgia', serif;
font-weight: normal;
font-size: 24px;
margin: 0 0 10px 0;
padding: 0;
line-height: 1;
}
div.admonition p.last {
margin-bottom: 0;
}
div.highlight {
background-color: white;
}
dt:target, .highlight {
background: #FAF3E8;
}
div.note {
background-color: #eee;
border: 1px solid #ccc;
}
div.seealso {
background-color: #ffc;
border: 1px solid #ff6;
}
div.topic {
background-color: #eee;
}
p.admonition-title {
display: inline;
}
p.admonition-title:after {
content: ":";
}
pre, tt {
font-family: 'Consolas', 'Menlo', 'Deja Vu Sans Mono', 'Bitstream Vera Sans Mono', monospace;
font-size: 0.9em;
}
img.screenshot {
}
tt.descname, tt.descclassname {
font-size: 0.95em;
}
tt.descname {
padding-right: 0.08em;
}
img.screenshot {
-moz-box-shadow: 2px 2px 4px #eee;
-webkit-box-shadow: 2px 2px 4px #eee;
box-shadow: 2px 2px 4px #eee;
}
table.docutils {
border: 1px solid #888;
-moz-box-shadow: 2px 2px 4px #eee;
-webkit-box-shadow: 2px 2px 4px #eee;
box-shadow: 2px 2px 4px #eee;
}
table.docutils td, table.docutils th {
border: 1px solid #888;
padding: 0.25em 0.7em;
}
table.field-list, table.footnote {
border: none;
-moz-box-shadow: none;
-webkit-box-shadow: none;
box-shadow: none;
}
table.footnote {
margin: 15px 0;
width: 100%;
border: 1px solid #eee;
background: #fdfdfd;
font-size: 0.9em;
}
table.footnote + table.footnote {
margin-top: -15px;
border-top: none;
}
table.field-list th {
padding: 0 0.8em 0 0;
}
table.field-list td {
padding: 0;
}
table.footnote td.label {
width: 0px;
padding: 0.3em 0 0.3em 0.5em;
}
table.footnote td {
padding: 0.3em 0.5em;
}
dl {
margin: 0;
padding: 0;
}
dl dd {
margin-left: 30px;
}
blockquote {
margin: 0 0 0 30px;
padding: 0;
}
ul, ol {
margin: 10px 0 10px 30px;
padding: 0;
}
pre {
background: #eee;
padding: 7px 30px;
margin: 15px -30px;
line-height: 1.3em;
}
dl pre, blockquote pre, li pre {
margin-left: -60px;
padding-left: 60px;
}
dl dl pre {
margin-left: -90px;
padding-left: 90px;
}
tt {
background-color: #ecf0f3;
color: #222;
/* padding: 1px 2px; */
}
tt.xref, a tt {
background-color: #FBFBFB;
border-bottom: 1px solid white;
}
a.reference {
text-decoration: none;
border-bottom: 1px dotted #004B6B;
}
a.reference:hover {
border-bottom: 1px solid #6D4100;
}
a.footnote-reference {
text-decoration: none;
font-size: 0.7em;
vertical-align: top;
border-bottom: 1px dotted #004B6B;
}
a.footnote-reference:hover {
border-bottom: 1px solid #6D4100;
}
a:hover tt {
background: #EEE;
}
@media screen and (max-width: 870px) {
div.sphinxsidebar {
display: none;
}
div.document {
width: 100%;
}
div.documentwrapper {
margin-left: 0;
margin-top: 0;
margin-right: 0;
margin-bottom: 0;
}
div.bodywrapper {
margin-top: 0;
margin-right: 0;
margin-bottom: 0;
margin-left: 0;
}
ul {
margin-left: 0;
}
.document {
width: auto;
}
.footer {
width: auto;
}
.bodywrapper {
margin: 0;
}
.footer {
width: auto;
}
.github {
display: none;
}
}
@media screen and (max-width: 875px) {
body {
margin: 0;
padding: 20px 30px;
}
div.documentwrapper {
float: none;
background: white;
}
div.sphinxsidebar {
display: block;
float: none;
width: 102.5%;
margin: 50px -30px -20px -30px;
padding: 10px 20px;
background: #333;
color: white;
}
div.sphinxsidebar h3, div.sphinxsidebar h4, div.sphinxsidebar p,
div.sphinxsidebar h3 a {
color: white;
}
div.sphinxsidebar a {
color: #aaa;
}
div.sphinxsidebar p.logo {
display: none;
}
div.document {
width: 100%;
margin: 0;
}
div.related {
display: block;
margin: 0;
padding: 10px 0 20px 0;
}
div.related ul,
div.related ul li {
margin: 0;
padding: 0;
}
div.footer {
display: none;
}
div.bodywrapper {
margin: 0;
}
div.body {
min-height: 0;
padding: 0;
}
.rtd_doc_footer {
display: none;
}
.document {
width: auto;
}
.footer {
width: auto;
}
.footer {
width: auto;
}
.github {
display: none;
}
}
/* misc. */
.revsys-inline {
display: none!important;
}
div.sphinxsidebar a.flattr-button {
text-decoration: none;
border-bottom: none;
}
================================================
FILE: docs/_themes/kr/static/small_flask.css
================================================
/*
* small_flask.css_t
* ~~~~~~~~~~~~~~~~~
*
* :copyright: Copyright 2010 by Armin Ronacher.
* :license: Flask Design License, see LICENSE for details.
*/
body {
margin: 0;
padding: 20px 30px;
}
div.documentwrapper {
float: none;
background: white;
}
div.sphinxsidebar {
display: block;
float: none;
width: 102.5%;
margin: 50px -30px -20px -30px;
padding: 10px 20px;
background: #333;
color: white;
}
div.sphinxsidebar h3, div.sphinxsidebar h4, div.sphinxsidebar p,
div.sphinxsidebar h3 a {
color: white;
}
div.sphinxsidebar a {
color: #aaa;
}
div.sphinxsidebar p.logo {
display: none;
}
div.document {
width: 100%;
margin: 0;
}
div.related {
display: block;
margin: 0;
padding: 10px 0 20px 0;
}
div.related ul,
div.related ul li {
margin: 0;
padding: 0;
}
div.footer {
display: none;
}
div.bodywrapper {
margin: 0;
}
div.body {
min-height: 0;
padding: 0;
}
.rtd_doc_footer {
display: none;
}
.document {
width: auto;
}
.footer {
width: auto;
}
.footer {
width: auto;
}
.github {
display: none;
}
img {
border: 0px 0px;
}
================================================
FILE: docs/_themes/kr/theme.conf
================================================
[theme]
inherit = basic
stylesheet = flasky.css
pygments_style = flask_theme_support.FlaskyStyle
[options]
touch_icon =
================================================
FILE: docs/_themes/kr_small/layout.html
================================================
{% extends "basic/layout.html" %} {% block header %} {{ super() }} {% if
pagename == 'index' %}
<div class="indexwrapper">
{% endif %} {% endblock %} {% block footer %} {% if pagename == 'index' %}
</div>
{% endif %} {% endblock %} {# do not display relbars #} {% block relbar1 %}{%
endblock %} {% block relbar2 %} {% if theme_github_fork %}
<a href="http://github.com/{{ theme_github_fork }}"
><img
style="position: fixed; top: 0; right: 0; border: 0"
src="https://github.blog/wp-content/uploads/2008/12/forkme_right_darkblue_121621.png"
alt="Fork me on GitHub"
/></a>
{% endif %} {% endblock %} {% block sidebar1 %}{% endblock %} {% block sidebar2
%}{% endblock %}
================================================
FILE: docs/_themes/kr_small/static/flasky.css_t
================================================
/*
* flasky.css_t
* ~~~~~~~~~~~~
*
* Sphinx stylesheet -- flasky theme based on nature theme.
*
* :copyright: Copyright 2007-2010 by the Sphinx team, see AUTHORS.
* :license: BSD, see LICENSE for details.
*
*/
@import url("basic.css");
/* -- page layout ----------------------------------------------------------- */
body {
font-family: 'Georgia', serif;
font-size: 17px;
color: #000;
background: white;
margin: 0;
padding: 0;
}
div.documentwrapper {
float: left;
width: 100%;
}
div.bodywrapper {
margin: 40px auto 0 auto;
width: 700px;
}
hr {
border: 1px solid #B1B4B6;
}
div.body {
background-color: #ffffff;
color: #3E4349;
padding: 0 30px 30px 30px;
}
img.floatingflask {
padding: 0 0 10px 10px;
float: right;
}
div.footer {
text-align: right;
color: #888;
padding: 10px;
font-size: 14px;
width: 650px;
margin: 0 auto 40px auto;
}
div.footer a {
color: #888;
text-decoration: underline;
}
div.related {
line-height: 32px;
color: #888;
}
div.related ul {
padding: 0 0 0 10px;
}
div.related a {
color: #444;
}
/* -- body styles ----------------------------------------------------------- */
a {
color: #004B6B;
text-decoration: underline;
}
a:hover {
color: #6D4100;
text-decoration: underline;
}
div.body {
padding-bottom: 40px; /* saved for footer */
}
div.body h1,
div.body h2,
div.body h3,
div.body h4,
div.body h5,
div.body h6 {
font-family: 'Garamond', 'Georgia', serif;
font-weight: normal;
margin: 30px 0px 10px 0px;
padding: 0;
}
{% if theme_index_logo %}
div.indexwrapper h1 {
text-indent: -999999px;
background: url({{ theme_index_logo }}) no-repeat center center;
height: {{ theme_index_logo_height }};
}
{% endif %}
div.body h2 { font-size: 180%; }
div.body h3 { font-size: 150%; }
div.body h4 { font-size: 130%; }
div.body h5 { font-size: 100%; }
div.body h6 { font-size: 100%; }
a.headerlink {
color: white;
padding: 0 4px;
text-decoration: none;
}
a.headerlink:hover {
color: #444;
background: #eaeaea;
}
div.body p, div.body dd, div.body li {
line-height: 1.4em;
}
div.admonition {
background: #fafafa;
margin: 20px -30px;
padding: 10px 30px;
border-top: 1px solid #ccc;
border-bottom: 1px solid #ccc;
}
div.admonition p.admonition-title {
font-family: 'Garamond', 'Georgia', serif;
font-weight: normal;
font-size: 24px;
margin: 0 0 10px 0;
padding: 0;
line-height: 1;
}
div.admonition p.last {
margin-bottom: 0;
}
div.highlight{
background-color: white;
}
dt:target, .highlight {
background: #FAF3E8;
}
div.note {
background-color: #eee;
border: 1px solid #ccc;
}
div.seealso {
background-color: #ffc;
border: 1px solid #ff6;
}
div.topic {
background-color: #eee;
}
div.warning {
background-color: #ffe4e4;
border: 1px solid #f66;
}
p.admonition-title {
display: inline;
}
p.admonition-title:after {
content: ":";
}
pre, tt {
font-family: 'Consolas', 'Menlo', 'Deja Vu Sans Mono', 'Bitstream Vera Sans Mono', monospace;
font-size: 0.85em;
}
img.screenshot {
}
tt.descname, tt.descclassname {
font-size: 0.95em;
}
tt.descname {
padding-right: 0.08em;
}
img.screenshot {
-moz-box-shadow: 2px 2px 4px #eee;
-webkit-box-shadow: 2px 2px 4px #eee;
box-shadow: 2px 2px 4px #eee;
}
table.docutils {
border: 1px solid #888;
-moz-box-shadow: 2px 2px 4px #eee;
-webkit-box-shadow: 2px 2px 4px #eee;
box-shadow: 2px 2px 4px #eee;
}
table.docutils td, table.docutils th {
border: 1px solid #888;
padding: 0.25em 0.7em;
}
table.field-list, table.footnote {
border: none;
-moz-box-shadow: none;
-webkit-box-shadow: none;
box-shadow: none;
}
table.footnote {
margin: 15px 0;
width: 100%;
border: 1px solid #eee;
}
table.field-list th {
padding: 0 0.8em 0 0;
}
table.field-list td {
padding: 0;
}
table.footnote td {
padding: 0.5em;
}
dl {
margin: 0;
padding: 0;
}
dl dd {
margin-left: 30px;
}
pre {
padding: 0;
margin: 15px -30px;
padding: 8px;
line-height: 1.3em;
padding: 7px 30px;
background: #eee;
border-radius: 2px;
-moz-border-radius: 2px;
-webkit-border-radius: 2px;
}
dl pre {
margin-left: -60px;
padding-left: 60px;
}
tt {
background-color: #ecf0f3;
color: #222;
/* padding: 1px 2px; */
}
tt.xref, a tt {
background-color: #FBFBFB;
}
a:hover tt {
background: #EEE;
}
================================================
FILE: docs/_themes/kr_small/theme.conf
================================================
[theme]
inherit = basic
stylesheet = flasky.css
nosidebar = true
pygments_style = flask_theme_support.FlaskyStyle
[options]
index_logo = ''
index_logo_height = 120px
github_fork = ''
================================================
FILE: docs/advanced_usage.rst
================================================
.. _advanced:
Advanced Usage: Overriding Models and the Blobber Class
=======================================================
TextBlob allows you to specify which algorithms you want to use under the hood of its simple API.
Sentiment Analyzers
-------------------
New in version `0.5.0`.
The ``textblob.sentiments`` module contains two sentiment analysis implementations, ``PatternAnalyzer`` (based on the pattern_ library) and ``NaiveBayesAnalyzer`` (an NLTK_ classifier trained on a movie reviews corpus).
The default implementation is ``PatternAnalyzer``, but you can override the analyzer by passing another implementation into a TextBlob's constructor.
For instance, the ``NaiveBayesAnalyzer`` returns its result as a namedtuple of the form: ``Sentiment(classification, p_pos, p_neg)``.
::
>>> from textblob import TextBlob
>>> from textblob.sentiments import NaiveBayesAnalyzer
>>> blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
>>> blob.sentiment
Sentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)
Tokenizers
----------
New in version `0.4.0`.
The ``words`` and ``sentences`` properties are helpers that use the ``textblob.tokenizers.WordTokenizer`` and ``textblob.tokenizers.SentenceTokenizer`` classes, respectively.
You can use other tokenizers, such as those provided by NLTK, by passing them into the ``TextBlob`` constructor then accessing the ``tokens`` property.
.. doctest::
>>> from textblob import TextBlob
>>> from nltk.tokenize import TabTokenizer
>>> tokenizer = TabTokenizer()
>>> blob = TextBlob("This is\ta rather tabby\tblob.", tokenizer=tokenizer)
>>> blob.tokens
WordList(['This is', 'a rather tabby', 'blob.'])
You can also use the ``tokenize([tokenizer])`` method.
.. doctest::
>>> from textblob import TextBlob
>>> from nltk.tokenize import BlanklineTokenizer
>>> tokenizer = BlanklineTokenizer()
>>> blob = TextBlob("A token\n\nof appreciation")
>>> blob.tokenize(tokenizer)
WordList(['A token', 'of appreciation'])
Noun Phrase Chunkers
--------------------
TextBlob currently has two noun phrases chunker implementations,
``textblob.np_extractors.FastNPExtractor`` (default, based on Shlomi Babluki's implementation from
`this blog post <http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/>`_)
and ``textblob.np_extractors.ConllExtractor``, which uses the CoNLL 2000 corpus to train a tagger.
You can change the chunker implementation (or even use your own) by explicitly passing an instance of a noun phrase extractor to a TextBlob's constructor.
.. doctest::
>>> from textblob import TextBlob
>>> from textblob.np_extractors import ConllExtractor
>>> extractor = ConllExtractor()
>>> blob = TextBlob("Python is a high-level programming language.", np_extractor=extractor)
>>> blob.noun_phrases
WordList(['python', 'high-level programming language'])
POS Taggers
-----------
TextBlob currently has two POS tagger implementations, located in ``textblob.taggers``. The default is the ``PatternTagger`` which uses the same implementation as the pattern_ library.
The second implementation is ``NLTKTagger`` which uses NLTK_'s TreeBank tagger. *Numpy is required to use the NLTKTagger*.
Similar to the tokenizers and noun phrase chunkers, you can explicitly specify which POS tagger to use by passing a tagger instance to the constructor.
::
>>> from textblob import TextBlob
>>> from textblob.taggers import NLTKTagger
>>> nltk_tagger = NLTKTagger()
>>> blob = TextBlob("Tag! You're It!", pos_tagger=nltk_tagger)
>>> blob.pos_tags
[(Word('Tag'), u'NN'), (Word('You'), u'PRP'), (Word('''), u'VBZ'), (Word('re'), u'NN'), (Word('It')
, u'PRP')]
.. _pattern: http://www.clips.ua.ac.be/pattern
.. _NLTK: http://nltk.org/
Parsers
-------
New in version `0.6.0`.
Parser implementations can also be passed to the TextBlob constructor.
::
>>> from textblob import TextBlob
>>> from textblob.parsers import PatternParser
>>> blob = TextBlob("Parsing is fun.", parser=PatternParser())
>>> blob.parse()
'Parsing/VBG/B-VP/O is/VBZ/I-VP/O fun/VBG/I-VP/O ././O/O'
Blobber: A TextBlob Factory
---------------------------
New in `0.4.0`.
It can be tedious to repeatedly pass taggers, NP extractors, sentiment analyzers, classifiers, and tokenizers to multiple TextBlobs. To keep your code `DRY <https://en.wikipedia.org/wiki/DRY_principle>`_, you can use the ``Blobber`` class to create TextBlobs that share the same models.
First, instantiate a ``Blobber`` with the tagger, NP extractor, sentiment analyzer, classifier, and/or tokenizer of your choice.
.. doctest::
>>> from textblob import Blobber
>>> from textblob.taggers import NLTKTagger
>>> tb = Blobber(pos_tagger=NLTKTagger())
You can now create new TextBlobs like so:
.. doctest::
>>> blob1 = tb("This is a blob.")
>>> blob2 = tb("This is another blob.")
>>> blob1.pos_tagger is blob2.pos_tagger
True
================================================
FILE: docs/api_reference.rst
================================================
.. _api:
API Reference
=============
Blob Classes
------------
.. automodule:: textblob.blob
:members:
:inherited-members:
.. _api_base_classes:
Base Classes
------------
.. automodule:: textblob.base
:members:
Tokenizers
----------
.. automodule:: textblob.tokenizers
:members:
:inherited-members:
POS Taggers
-----------
.. automodule:: textblob.en.taggers
:members:
:inherited-members:
Noun Phrase Extractors
----------------------
.. automodule:: textblob.en.np_extractors
:members: BaseNPExtractor, ConllExtractor, FastNPExtractor
:inherited-members:
Sentiment Analyzers
-------------------
.. automodule:: textblob.en.sentiments
:members:
:inherited-members:
Parsers
-------
.. automodule:: textblob.en.parsers
:members:
:inherited-members:
.. _api_classifiers:
Classifiers
-----------
.. automodule:: textblob.classifiers
:members:
:inherited-members:
Blobber
-------
.. autoclass:: textblob.blob.Blobber
:members:
:special-members:
:exclude-members: __weakref__
File Formats
------------
.. automodule:: textblob.formats
:members:
:inherited-members:
Wordnet
-------
.. automodule:: textblob.wordnet
:members:
Exceptions
----------
.. module:: textblob.exceptions
.. autoexception:: textblob.exceptions.TextBlobError
.. autoexception:: textblob.exceptions.MissingCorpusError
.. autoexception:: textblob.exceptions.DeprecationError
.. autoexception:: textblob.exceptions.TranslatorError
.. autoexception:: textblob.exceptions.NotTranslated
.. autoexception:: textblob.exceptions.FormatError
================================================
FILE: docs/authors.rst
================================================
.. include:: ../AUTHORS.rst
================================================
FILE: docs/changelog.rst
================================================
.. _changelog:
.. include:: ../CHANGELOG.rst
================================================
FILE: docs/classifiers.rst
================================================
.. _classifiers:
Tutorial: Building a Text Classification System
***********************************************
The ``textblob.classifiers`` module makes it simple to create custom classifiers.
As an example, let's create a custom sentiment analyzer.
Loading Data and Creating a Classifier
======================================
First we'll create some training and test data.
.. doctest::
>>> train = [
... ("I love this sandwich.", "pos"),
... ("this is an amazing place!", "pos"),
... ("I feel very good about these beers.", "pos"),
... ("this is my best work.", "pos"),
... ("what an awesome view", "pos"),
... ("I do not like this restaurant", "neg"),
... ("I am tired of this stuff.", "neg"),
... ("I can't deal with this", "neg"),
... ("he is my sworn enemy!", "neg"),
... ("my boss is horrible.", "neg"),
... ]
>>> test = [
... ("the beer was good.", "pos"),
... ("I do not enjoy my job", "neg"),
... ("I ain't feeling dandy today.", "neg"),
... ("I feel amazing!", "pos"),
... ("Gary is a friend of mine.", "pos"),
... ("I can't believe I'm doing this.", "neg"),
... ]
Now we'll create a Naive Bayes classifier, passing the training data into the constructor.
.. doctest::
>>> from textblob.classifiers import NaiveBayesClassifier
>>> cl = NaiveBayesClassifier(train)
.. _data_files:
Loading Data from Files
-----------------------
You can also load data from common file formats including CSV, JSON, and TSV.
CSV files should be formatted like so:
::
I love this sandwich.,pos
This is an amazing place!,pos
I do not like this restaurant,neg
JSON files should be formatted like so:
::
[
{"text": "I love this sandwich.", "label": "pos"},
{"text": "This is an amazing place!", "label": "pos"},
{"text": "I do not like this restaurant", "label": "neg"}
]
You can then pass the opened file into the constructor.
::
>>> with open('train.json', 'r') as fp:
... cl = NaiveBayesClassifier(fp, format="json")
Classifying Text
================
Call the ``classify(text)`` method to use the classifier.
.. doctest::
>>> cl.classify("This is an amazing library!")
'pos'
You can get the label probability distribution with the ``prob_classify(text)`` method.
.. doctest::
>>> prob_dist = cl.prob_classify("This one's a doozy.")
>>> prob_dist.max()
'pos'
>>> round(prob_dist.prob("pos"), 2)
0.63
>>> round(prob_dist.prob("neg"), 2)
0.37
Classifying TextBlobs
=====================
Another way to classify text is to pass a classifier into the constructor of ``TextBlob`` and call its ``classify()`` method.
.. doctest::
>>> from textblob import TextBlob
>>> blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)
>>> blob.classify()
'pos'
The advantage of this approach is that you can classify sentences within a ``TextBlob``.
.. doctest::
>>> for s in blob.sentences:
... print(s)
... print(s.classify())
...
The beer is good.
pos
But the hangover is horrible.
neg
Evaluating Classifiers
======================
To compute the accuracy on our test set, use the ``accuracy(test_data)`` method.
.. doctest::
>>> cl.accuracy(test)
0.8333333333333334
.. note::
You can also pass in a file object into the ``accuracy`` method. The file can be in any of the formats listed in the :ref:`Loading Data <data_files>` section.
Use the ``show_informative_features()`` method to display a listing of the most informative features.
.. doctest::
>>> cl.show_informative_features(5) # doctest: +SKIP
Most Informative Features
contains(my) = True neg : pos = 1.7 : 1.0
contains(an) = False neg : pos = 1.6 : 1.0
contains(I) = True neg : pos = 1.4 : 1.0
contains(I) = False pos : neg = 1.4 : 1.0
contains(my) = False pos : neg = 1.3 : 1.0
Updating Classifiers with New Data
==================================
Use the ``update(new_data)`` method to update a classifier with new training data.
.. doctest::
>>> new_data = [
... ("She is my best friend.", "pos"),
... ("I'm happy to have a new friend.", "pos"),
... ("Stay thirsty, my friend.", "pos"),
... ("He ain't from around here.", "neg"),
... ]
>>> cl.update(new_data)
True
>>> cl.accuracy(test)
1.0
Feature Extractors
==================
By default, the ``NaiveBayesClassifier`` uses a simple feature extractor that indicates which words in the training set are contained in a document.
For example, the sentence *"I feel happy"* might have the features ``contains(happy): True`` or ``contains(angry): False``.
You can override this feature extractor by writing your own. A feature extractor is simply a function with ``document`` (the text to extract features from) as the first argument. The function may include a second argument, ``train_set`` (the training dataset), if necessary.
The function should return a dictionary of features for ``document``.
For example, let's create a feature extractor that just uses the first and last words of a document as its features.
.. doctest::
>>> def end_word_extractor(document):
... tokens = document.split()
... first_word, last_word = tokens[0], tokens[-1]
... feats = {}
... feats["first({0})".format(first_word)] = True
... feats["last({0})".format(last_word)] = False
... return feats
...
>>> features = end_word_extractor("I feel happy")
>>> assert features == {"last(happy)": False, "first(I)": True}
We can then use the feature extractor in a classifier by passing it as the second argument of the constructor.
.. doctest::
>>> cl2 = NaiveBayesClassifier(test, feature_extractor=end_word_extractor)
>>> blob = TextBlob("I'm excited to try my new classifier.", classifier=cl2)
>>> blob.classify()
'pos'
Next Steps
==========
Be sure to check out the :ref:`API Reference <api_classifiers>` for the :ref:`classifiers module <api_classifiers>`.
Want to try different POS taggers or noun phrase chunkers with TextBlobs? Check out the :ref:`Advanced Usage <advanced>` guide.
================================================
FILE: docs/conf.py
================================================
import importlib.metadata
import os
import sys
sys.path.append(os.path.abspath("_themes"))
# -- General configuration -----------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.doctest",
"sphinx.ext.viewcode",
"sphinx_issues",
]
primary_domain = "py"
default_role = "py:obj"
issues_github_path = "sloria/TextBlob"
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# The suffix of source filenames.
source_suffix = ".rst"
# The master toctree document.
master_doc = "index"
# General information about the project.
project = "TextBlob"
copyright = '<a href="http://stevenloria.com/">Steven Loria</a> and contributors'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = release = importlib.metadata.version("textblob")
exclude_patterns = ["_build"]
pygments_style = "flask_theme_support.FlaskyStyle"
html_theme = "kr"
html_theme_path = ["_themes"]
html_static_path = ["_static"]
# Custom sidebar templates, maps document names to template names.
html_sidebars = {
"index": ["side-primary.html", "searchbox.html"],
"**": ["side-secondary.html", "localtoc.html", "relations.html", "searchbox.html"],
}
# Output file base name for HTML help builder.
htmlhelp_basename = "textblobdoc"
# -- Options for LaTeX output --------------------------------------------------
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass [howto/manual]).
latex_documents = [
("index", "TextBlob.tex", "textblob Documentation", "Steven Loria", "manual"),
]
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [("index", "textblob", "textblob Documentation", ["Steven Loria"], 1)]
# -- Options for Texinfo output ------------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(
"index",
"textblob",
"TextBlob Documentation",
"Steven Loria",
"textblob",
"Simplified Python text-processing.",
"Natural Language Processing",
),
]
================================================
FILE: docs/contributing.rst
================================================
.. include:: ../CONTRIBUTING.rst
================================================
FILE: docs/extensions.rst
================================================
.. _extensions:
**********
Extensions
**********
TextBlob supports adding custom models and new languages through "extensions".
Extensions can be installed from the PyPI. ::
$ pip install textblob-name
where "name" is the name of the package.
Available extensions
====================
Languages
---------
* `textblob-fr <https://github.com/sloria/textblob-fr>`_: French
* `textblob-de <https://github.com/markuskiller/textblob-de>`_: German
Part-of-speech Taggers
----------------------
* `textblob-aptagger <https://github.com/sloria/textblob-aptagger>`_: A fast and accurate tagger based on the Averaged Perceptron.
.. admonition:: Interested in creating an extension?
See the :ref:`Contributing guide <extension-development>`.
================================================
FILE: docs/index.rst
================================================
.. textblob documentation master file, created by
sphinx-quickstart on Mon Aug 5 01:41:33 2013.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
TextBlob: Simplified Text Processing
====================================
Release v\ |version|. (:ref:`Changelog`)
*TextBlob* is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and more.
.. code-block:: python
from textblob import TextBlob
text = """
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
"""
blob = TextBlob(text)
blob.tags # [('The', 'DT'), ('titular', 'JJ'),
# ('threat', 'NN'), ('of', 'IN'), ...]
blob.noun_phrases # WordList(['titular threat', 'blob',
# 'ultimate movie monster',
# 'amoeba-like mass', ...])
for sentence in blob.sentences:
print(sentence.sentiment.polarity)
# 0.060
# -0.341
TextBlob stands on the giant shoulders of `NLTK`_ and `pattern`_, and plays nicely with both.
Features
--------
- Noun phrase extraction
- Part-of-speech tagging
- Sentiment analysis
- Classification (Naive Bayes, Decision Tree)
- Tokenization (splitting text into words and sentences)
- Word and phrase frequencies
- Parsing
- `n`-grams
- Word inflection (pluralization and singularization) and lemmatization
- Spelling correction
- Add new models or languages through extensions
- WordNet integration
Get it now
----------
::
$ pip install -U textblob
$ python -m textblob.download_corpora
Ready to dive in? Go on to the :ref:`Quickstart guide <quickstart>`.
Guide
=====
.. toctree::
:maxdepth: 2
license
install
quickstart
classifiers
advanced_usage
extensions
api_reference
Project info
============
.. toctree::
:maxdepth: 1
changelog
authors
contributing
.. _NLTK: http://www.nltk.org
.. _pattern: https://github.com/clips/pattern
================================================
FILE: docs/install.rst
================================================
.. _install:
Installation
============
Installing/Upgrading From the PyPI
----------------------------------
::
$ pip install -U textblob
$ python -m textblob.download_corpora
This will install TextBlob and download the necessary NLTK corpora. If you need to change the default download directory set the ``NLTK_DATA`` environment variable.
.. admonition:: Downloading the minimum corpora
If you only intend to use TextBlob's default models (no model overrides), you can pass the ``lite`` argument. This downloads only those corpora needed for basic functionality.
::
$ python -m textblob.download_corpora lite
With conda
----------
TextBlob is also available as a `conda <http://conda.pydata.org/>`_ package. To install with ``conda``, run ::
$ conda install -c conda-forge textblob
$ python -m textblob.download_corpora
From Source
-----------
TextBlob is actively developed on Github_.
You can clone the public repo: ::
$ git clone https://github.com/sloria/TextBlob.git
Or download one of the following:
* tarball_
* zipball_
Once you have the source, you can install it into your site-packages with ::
$ python setup.py install
.. _Github: https://github.com/sloria/TextBlob
.. _tarball: https://github.com/sloria/TextBlob/tarball/master
.. _zipball: https://github.com/sloria/TextBlob/zipball/master
Get the bleeding edge version
-----------------------------
To get the latest development version of TextBlob, run
::
$ pip install -U git+https://github.com/sloria/TextBlob.git@dev
Migrating from older versions (<=0.7.1)
---------------------------------------
As of TextBlob 0.8.0, TextBlob's core package was renamed to ``textblob``, whereas earlier versions used a package called ``text``. Therefore, migrating to newer versions should be as simple as rewriting your imports, like so:
New:
::
from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger
Old:
::
from text.blob import TextBlob, Word, Blobber
from text.classifiers import NaiveBayesClassifier
from text.taggers import NLTKTagger
Dependencies
++++++++++++
TextBlob depends on NLTK 3. NLTK will be installed automatically when you run ``pip install textblob``.
Some features, such as the maximum entropy classifier, require `numpy`_, but it is not required for basic usage.
.. _numpy: http://www.numpy.org/
.. _NLTK: http://nltk.org/
================================================
FILE: docs/license.rst
================================================
License
=======
.. literalinclude:: ../LICENSE
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set BUILDDIR=_build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .
set I18NSPHINXOPTS=%SPHINXOPTS% .
if NOT "%PAPER%" == "" (
set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
)
if "%1" == "" goto help
if "%1" == "help" (
:help
echo.Please use `make ^<target^>` where ^<target^> is one of
echo. html to make standalone HTML files
echo. dirhtml to make HTML files named index.html in directories
echo. singlehtml to make a single large HTML file
echo. pickle to make pickle files
echo. json to make JSON files
echo. htmlhelp to make HTML files and a HTML help project
echo. qthelp to make HTML files and a qthelp project
echo. devhelp to make HTML files and a Devhelp project
echo. epub to make an epub
echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
echo. text to make text files
echo. man to make manual pages
echo. texinfo to make Texinfo files
echo. gettext to make PO message catalogs
echo. changes to make an overview over all changed/added/deprecated items
echo. xml to make Docutils-native XML files
echo. pseudoxml to make pseudoxml-XML files for display purposes
echo. linkcheck to check all external links for integrity
echo. doctest to run all doctests embedded in the documentation if enabled
goto end
)
if "%1" == "clean" (
for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
del /q /s %BUILDDIR%\*
goto end
)
%SPHINXBUILD% 2> nul
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
if "%1" == "html" (
%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/html.
goto end
)
if "%1" == "dirhtml" (
%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
goto end
)
if "%1" == "singlehtml" (
%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
goto end
)
if "%1" == "pickle" (
%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the pickle files.
goto end
)
if "%1" == "json" (
%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the JSON files.
goto end
)
if "%1" == "htmlhelp" (
%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
goto end
)
if "%1" == "qthelp" (
%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
echo.^> qcollectiongenerator %BUILDDIR%\qthelp\textblob.qhcp
echo.To view the help file:
echo.^> assistant -collectionFile %BUILDDIR%\qthelp\textblob.ghc
goto end
)
if "%1" == "devhelp" (
%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished.
goto end
)
if "%1" == "epub" (
%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The epub file is in %BUILDDIR%/epub.
goto end
)
if "%1" == "latex" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
if errorlevel 1 exit /b 1
echo.
echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdf" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf
cd %BUILDDIR%/..
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdfja" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf-ja
cd %BUILDDIR%/..
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "text" (
%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The text files are in %BUILDDIR%/text.
goto end
)
if "%1" == "man" (
%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The manual pages are in %BUILDDIR%/man.
goto end
)
if "%1" == "texinfo" (
%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
goto end
)
if "%1" == "gettext" (
%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
goto end
)
if "%1" == "changes" (
%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
if errorlevel 1 exit /b 1
echo.
echo.The overview file is in %BUILDDIR%/changes.
goto end
)
if "%1" == "linkcheck" (
%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
if errorlevel 1 exit /b 1
echo.
echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
goto end
)
if "%1" == "doctest" (
%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
if errorlevel 1 exit /b 1
echo.
echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
goto end
)
if "%1" == "xml" (
%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The XML files are in %BUILDDIR%/xml.
goto end
)
if "%1" == "pseudoxml" (
%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
goto end
)
:end
================================================
FILE: docs/quickstart.rst
================================================
.. _quickstart:
Tutorial: Quickstart
====================
.. module:: textblob.blob
TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat :class:`TextBlob <TextBlob>` objects as if they were Python strings that learned how to do Natural Language Processing.
Create a TextBlob
-----------------
First, the import.
.. doctest::
>>> from textblob import TextBlob
Let's create our first :class:`TextBlob <TextBlob>`.
.. doctest::
>>> wiki = TextBlob("Python is a high-level, general-purpose programming language.")
Part-of-speech Tagging
----------------------
Part-of-speech tags can be accessed through the :meth:`tags <TextBlob.tags>` property.
.. doctest::
>>> wiki.tags
[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')]
Noun Phrase Extraction
----------------------
Similarly, noun phrases are accessed through the :meth:`noun_phrases <TextBlob.noun_phrases>` property.
.. doctest::
>>> wiki.noun_phrases
WordList(['python'])
Sentiment Analysis
------------------
The :meth:`sentiment <TextBlob.sentiment>` property returns a namedtuple of the form ``Sentiment(polarity, subjectivity)``. The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
.. doctest::
>>> testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
>>> testimonial.sentiment
Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)
>>> testimonial.sentiment.polarity
0.39166666666666666
Tokenization
------------
You can break TextBlobs into words or sentences.
.. doctest::
>>> zen = TextBlob(
... "Beautiful is better than ugly. "
... "Explicit is better than implicit. "
... "Simple is better than complex."
... )
>>> zen.words
WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])
>>> zen.sentences
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]
:class:`Sentence <Sentence>` objects have the same properties and methods as TextBlobs.
::
>>> for sentence in zen.sentences:
... print(sentence.sentiment)
For more advanced tokenization, see the :ref:`Advanced Usage <advanced>` guide.
Words Inflection and Lemmatization
----------------------------------
Each word in :meth:`TextBlob.words <TextBlob.words>` or :meth:`Sentence.words <Sentence.words>` is a :class:`Word <Word>`
object (a subclass of ``unicode``) with useful methods, e.g. for word inflection.
.. doctest::
>>> sentence = TextBlob("Use 4 spaces per indentation level.")
>>> sentence.words
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
>>> sentence.words[2].singularize()
'space'
>>> sentence.words[-1].pluralize()
'levels'
Words can be lemmatized by calling the :meth:`lemmatize <Word.lemmatize>` method.
.. doctest::
>>> from textblob import Word
>>> w = Word("octopi")
>>> w.lemmatize()
'octopus'
>>> w = Word("went")
>>> w.lemmatize("v") # Pass in WordNet part of speech (verb)
'go'
WordNet Integration
-------------------
You can access the synsets for a :class:`Word <Word>` via the :meth:`synsets <Word.synsets>` property or the :meth:`get_synsets <Word.get_synsets>` method, optionally passing in a part of speech.
.. doctest::
>>> from textblob import Word
>>> from textblob.wordnet import VERB
>>> word = Word("octopus")
>>> word.synsets
[Synset('octopus.n.01'), Synset('octopus.n.02')]
>>> Word("hack").get_synsets(pos=VERB)
[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]
You can access the definitions for each synset via the :meth:`definitions <Word.definitions>` property or the :meth:`define() <Word.define>` method, which can also take an optional part-of-speech argument.
.. doctest::
>>> Word("octopus").definitions
['tentacles of octopus prepared as food', 'bottom-living cephalopod having a soft oval body with eight long tentacles']
You can also create synsets directly.
.. doctest::
>>> from textblob.wordnet import Synset
>>> octopus = Synset("octopus.n.02")
>>> shrimp = Synset("shrimp.n.03")
>>> octopus.path_similarity(shrimp)
0.1111111111111111
For more information on the WordNet API, see the NLTK documentation on the `Wordnet Interface <http://www.nltk.org/howto/wordnet.html>`_.
WordLists
---------
A :class:`WordList <textblob.WordList>` is just a Python list with additional methods.
.. doctest::
>>> animals = TextBlob("cat dog octopus")
>>> animals.words
WordList(['cat', 'dog', 'octopus'])
>>> animals.words.pluralize()
WordList(['cats', 'dogs', 'octopodes'])
Spelling Correction
-------------------
Use the :meth:`correct() <TextBlob.correct>` method to attempt spelling correction.
.. doctest::
>>> b = TextBlob("I havv goood speling!")
>>> print(b.correct())
I have good spelling!
:class:`Word <Word>` objects have a :meth:`spellcheck() Word.spellcheck` method that returns a list of ``(word, confidence)`` tuples with spelling suggestions.
.. doctest::
>>> from textblob import Word
>>> w = Word("falibility")
>>> w.spellcheck()
[('fallibility', 1.0)]
Spelling correction is based on Peter Norvig's "How to Write a Spelling Corrector"[#]_ as implemented in the pattern library. It is about 70% accurate [#]_.
Get Word and Noun Phrase Frequencies
------------------------------------
There are two ways to get the frequency of a word or noun phrase in a :class:`TextBlob <TextBlob>`.
The first is through the ``word_counts`` dictionary. ::
>>> monty = TextBlob("We are no longer the Knights who say Ni. "
... "We are now the Knights who say Ekki ekki ekki PTANG.")
>>> monty.word_counts['ekki']
3
If you access the frequencies this way, the search will *not* be case sensitive, and words that are not found will have a frequency of 0.
The second way is to use the ``count()`` method. ::
>>> monty.words.count('ekki')
3
You can specify whether or not the search should be case-sensitive (default is ``False``). ::
>>> monty.words.count('ekki', case_sensitive=True)
2
Each of these methods can also be used with noun phrases. ::
>>> wiki.noun_phrases.count('python')
1
Parsing
-------
Use the :meth:`parse() <TextBlob.parse>` method to parse the text.
.. doctest::
>>> b = TextBlob("And now for something completely different.")
>>> print(b.parse())
And/CC/O/O now/RB/B-ADVP/O for/IN/B-PP/B-PNP something/NN/B-NP/I-PNP completely/RB/B-ADJP/O different/JJ/I-ADJP/O ././O/O
By default, TextBlob uses pattern's parser [#]_.
TextBlobs Are Like Python Strings!
----------------------------------
You can use Python's substring syntax.
.. doctest::
>>> zen[0:19]
TextBlob("Beautiful is better")
You can use common string methods.
.. doctest::
>>> zen.upper()
TextBlob("BEAUTIFUL IS BETTER THAN UGLY. EXPLICIT IS BETTER THAN IMPLICIT. SIMPLE IS BETTER THAN COMPLEX.")
>>> zen.find("Simple")
65
You can make comparisons between TextBlobs and strings.
.. doctest::
>>> apple_blob = TextBlob("apples")
>>> banana_blob = TextBlob("bananas")
>>> apple_blob < banana_blob
True
>>> apple_blob == "apples"
True
You can concatenate and interpolate TextBlobs and strings.
.. doctest::
>>> apple_blob + " and " + banana_blob
TextBlob("apples and bananas")
>>> "{0} and {1}".format(apple_blob, banana_blob)
'apples and bananas'
`n`-grams
---------
The :class:`TextBlob.ngrams() <TextBlob.ngrams>` method returns a list of tuples of `n` successive words.
.. doctest::
>>> blob = TextBlob("Now is better than never.")
>>> blob.ngrams(n=3)
[WordList(['Now', 'is', 'better']), WordList(['is', 'better', 'than']), WordList(['better', 'than', 'never'])]
Get Start and End Indices of Sentences
--------------------------------------
Use ``sentence.start`` and ``sentence.end`` to get the indices where a sentence starts and ends within a :class:`TextBlob <TextBlob>`.
.. doctest::
>>> for s in zen.sentences:
... print(s)
... print("---- Starts at index {}, Ends at index {}".format(s.start, s.end))
...
Beautiful is better than ugly.
---- Starts at index 0, Ends at index 30
Explicit is better than implicit.
---- Starts at index 31, Ends at index 64
Simple is better than complex.
---- Starts at index 65, Ends at index 95
Next Steps
++++++++++
Want to build your own text classification system? Check out the :ref:`Classifiers Tutorial <classifiers>`.
Want to use a different POS tagger or noun phrase chunker implementation? Check out the :ref:`Advanced Usage <advanced>` guide.
.. [#] http://norvig.com/spell-correct.html
.. [#] http://www.clips.ua.ac.be/pages/pattern-en#spelling
.. [#] http://www.clips.ua.ac.be/pages/pattern-en#parser
================================================
FILE: pyproject.toml
================================================
[project]
name = "textblob"
version = "0.19.0"
description = "Simple, Pythonic text processing. Sentiment analysis, part-of-speech tagging, noun phrase parsing, and more."
readme = "README.rst"
license = { file = "LICENSE" }
authors = [{ name = "Steven Loria", email = "sloria1@gmail.com" }]
classifiers = [
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Topic :: Text Processing :: Linguistic",
]
keywords = ["textblob", "nlp", 'linguistics', 'nltk', 'pattern']
requires-python = ">=3.9"
dependencies = ["nltk>=3.9"]
[project.urls]
Changelog = "https://textblob.readthedocs.io/en/latest/changelog.html"
Issues = "https://github.com/sloria/TextBlob/issues"
Source = "https://github.com/sloria/TextBlob"
[project.optional-dependencies]
docs = ["sphinx==9.1.0", "sphinx-issues==5.0.1", "PyYAML==6.0.3"]
tests = ["pytest", "numpy"]
dev = ["textblob[tests]", "tox", "pre-commit>=3.5,<5.0", "pyright", "ruff"]
[build-system]
requires = ["flit_core<4"]
build-backend = "flit_core.buildapi"
[tool.flit.sdist]
include = ["tests/", "CHANGELOG.rst", "CONTRIBUTING.rst", "tox.ini", "NOTICE"]
[tool.ruff]
src = ["src"]
fix = true
show-fixes = true
unsafe-fixes = true
exclude = [
# Default excludes from ruff
".bzr",
".direnv",
".eggs",
".git",
".git-rewrite",
".hg",
".ipynb_checkpoints",
".mypy_cache",
".nox",
".pants.d",
".pyenv",
".pytest_cache",
".pytype",
".ruff_cache",
".svn",
".tox",
".venv",
".vscode",
"__pypackages__",
"_build",
"buck-out",
"build",
"dist",
"node_modules",
"site-packages",
"venv",
# Vendorized code
"src/textblob/en",
"src/textblob/_text.py",
]
[tool.ruff.format]
docstring-code-format = true
[tool.ruff.lint]
select = [
"B", # flake8-bugbear
"E", # pycodestyle error
"F", # pyflakes
"I", # isort
"UP", # pyupgrade
"W", # pycodestyle warning
"TC", # flake8-typechecking
]
[tool.ruff.lint.per-file-ignores]
"tests/*" = ["E721"]
[tool.pytest.ini_options]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"numpy: marks tests that require numpy",
]
[tool.pyright]
include = ["src/**", "tests/**"]
================================================
FILE: src/textblob/__init__.py
================================================
from .blob import Blobber, Sentence, TextBlob, Word, WordList
__all__ = [
"TextBlob",
"Word",
"Sentence",
"Blobber",
"WordList",
]
================================================
FILE: src/textblob/_text.py
================================================
"""This file is adapted from the pattern library.
URL: http://www.clips.ua.ac.be/pages/pattern-web
Licence: BSD
"""
import codecs
import os
import re
import string
import types
from itertools import chain
from xml.etree import ElementTree
basestring = (str, bytes)
try:
MODULE = os.path.dirname(os.path.abspath(__file__))
except:
MODULE = ""
SLASH, WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA = (
"&slash;",
"word",
"part-of-speech",
"chunk",
"preposition",
"relation",
"anchor",
"lemma",
)
# String functions
def decode_string(v, encoding="utf-8"):
"""Returns the given value as a Unicode string (if possible)."""
if isinstance(encoding, basestring):
encoding = ((encoding,),) + (("windows-1252",), ("utf-8", "ignore"))
if isinstance(v, bytes):
for e in encoding:
try:
return v.decode(*e)
except:
pass
return v
return str(v)
def encode_string(v, encoding="utf-8"):
"""Returns the given value as a Python byte string (if possible)."""
if isinstance(encoding, basestring):
encoding = ((encoding,),) + (("windows-1252",), ("utf-8", "ignore"))
if isinstance(v, str):
for e in encoding:
try:
return v.encode(*e)
except:
pass
return v
return str(v)
decode_utf8 = decode_string
encode_utf8 = encode_string
def isnumeric(strg):
try:
float(strg)
except ValueError:
return False
return True
# --- LAZY DICTIONARY -------------------------------------------------------------------------------
# A lazy dictionary is empty until one of its methods is called.
# This way many instances (e.g., lexicons) can be created without using memory until used.
class lazydict(dict):
def load(self):
# Must be overridden in a subclass.
# Must load data with dict.__setitem__(self, k, v) instead of lazydict[k] = v.
pass
def _lazy(self, method, *args):
"""If the dictionary is empty, calls lazydict.load().
Replaces lazydict.method() with dict.method() and calls it.
"""
if dict.__len__(self) == 0:
self.load()
setattr(self, method, types.MethodType(getattr(dict, method), self))
return getattr(dict, method)(self, *args)
def __repr__(self):
return self._lazy("__repr__")
def __len__(self):
return self._lazy("__len__")
def __iter__(self):
return self._lazy("__iter__")
def __contains__(self, *args):
return self._lazy("__contains__", *args)
def __getitem__(self, *args):
return self._lazy("__getitem__", *args)
def __setitem__(self, *args):
return self._lazy("__setitem__", *args)
def setdefault(self, *args):
return self._lazy("setdefault", *args)
def get(self, *args, **kwargs):
return self._lazy("get", *args)
def items(self):
return self._lazy("items")
def keys(self):
return self._lazy("keys")
def values(self):
return self._lazy("values")
def update(self, *args, **kwargs):
return self._lazy("update", *args)
def pop(self, *args):
return self._lazy("pop", *args)
def popitem(self, *args):
return self._lazy("popitem", *args)
class lazylist(list):
def load(self):
# Must be overridden in a subclass.
# Must load data with list.append(self, v) instead of lazylist.append(v).
pass
def _lazy(self, method, *args):
"""If the list is empty, calls lazylist.load().
Replaces lazylist.method() with list.method() and calls it.
"""
if list.__len__(self) == 0:
self.load()
setattr(self, method, types.MethodType(getattr(list, method), self))
return getattr(list, method)(self, *args)
def __repr__(self):
return self._lazy("__repr__")
def __len__(self):
return self._lazy("__len__")
def __iter__(self):
return self._lazy("__iter__")
def __contains__(self, *args):
return self._lazy("__contains__", *args)
def insert(self, *args):
return self._lazy("insert", *args)
def append(self, *args):
return self._lazy("append", *args)
def extend(self, *args):
return self._lazy("extend", *args)
def remove(self, *args):
return self._lazy("remove", *args)
def pop(self, *args):
return self._lazy("pop", *args)
# --- UNIVERSAL TAGSET ------------------------------------------------------------------------------
# The default part-of-speech tagset used in Pattern is Penn Treebank II.
# However, not all languages are well-suited to Penn Treebank (which was developed for English).
# As more languages are implemented, this is becoming more problematic.
#
# A universal tagset is proposed by Slav Petrov (2012):
# http://www.petrovi.de/data/lrec.pdf
#
# Subclasses of Parser should start implementing
# Parser.parse(tagset=UNIVERSAL) with a simplified tagset.
# The names of the constants correspond to Petrov's naming scheme, while
# the value of the constants correspond to Penn Treebank.
UNIVERSAL = "universal"
NOUN, VERB, ADJ, ADV, PRON, DET, PREP, ADP, NUM, CONJ, INTJ, PRT, PUNC, X = (
"NN",
"VB",
"JJ",
"RB",
"PR",
"DT",
"PP",
"PP",
"NO",
"CJ",
"UH",
"PT",
".",
"X",
)
def penntreebank2universal(token, tag):
"""Returns a (token, tag)-tuple with a simplified universal part-of-speech tag."""
if tag.startswith(("NNP-", "NNPS-")):
return (token, "{}-{}".format(NOUN, tag.split("-")[-1]))
if tag in ("NN", "NNS", "NNP", "NNPS", "NP"):
return (token, NOUN)
if tag in ("MD", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"):
return (token, VERB)
if tag in ("JJ", "JJR", "JJS"):
return (token, ADJ)
if tag in ("RB", "RBR", "RBS", "WRB"):
return (token, ADV)
if tag in ("PRP", "PRP$", "WP", "WP$"):
return (token, PRON)
if tag in ("DT", "PDT", "WDT", "EX"):
return (token, DET)
if tag in ("IN",):
return (token, PREP)
if tag in ("CD",):
return (token, NUM)
if tag in ("CC",):
return (token, CONJ)
if tag in ("UH",):
return (token, INTJ)
if tag in ("POS", "RP", "TO"):
return (token, PRT)
if tag in ("SYM", "LS", ".", "!", "?", ",", ":", "(", ")", '"', "#", "$"):
return (token, PUNC)
return (token, X)
# --- TOKENIZER -------------------------------------------------------------------------------------
TOKEN = re.compile(r"(\S+)\s")
# Handle common punctuation marks.
PUNCTUATION = punctuation = ".,;:!?()[]{}`''\"@#$^&*+-|=~_"
# Handle common abbreviations.
ABBREVIATIONS = abbreviations = set(
(
"a.",
"adj.",
"adv.",
"al.",
"a.m.",
"c.",
"cf.",
"comp.",
"conf.",
"def.",
"ed.",
"e.g.",
"esp.",
"etc.",
"ex.",
"f.",
"fig.",
"gen.",
"id.",
"i.e.",
"int.",
"l.",
"m.",
"Med.",
"Mil.",
"Mr.",
"n.",
"n.q.",
"orig.",
"pl.",
"pred.",
"pres.",
"p.m.",
"ref.",
"v.",
"vs.",
"w/",
)
)
RE_ABBR1 = re.compile(r"^[A-Za-z]\.$") # single letter, "T. De Smedt"
RE_ABBR2 = re.compile(r"^([A-Za-z]\.)+$") # alternating letters, "U.S."
RE_ABBR3 = re.compile(
"^[A-Z]["
+ "|".join( # capital followed by consonants, "Mr."
"bcdfghjklmnpqrstvwxz"
)
+ "]+.$"
)
# Handle emoticons.
EMOTICONS = { # (facial expression, sentiment)-keys
("love", +1.00): set(("<3", "♥")),
("grin", +1.00): set(
(">:D", ":-D", ":D", "=-D", "=D", "X-D", "x-D", "XD", "xD", "8-D")
),
("taunt", +0.75): set(
(">:P", ":-P", ":P", ":-p", ":p", ":-b", ":b", ":c)", ":o)", ":^)")
),
("smile", +0.50): set(
(">:)", ":-)", ":)", "=)", "=]", ":]", ":}", ":>", ":3", "8)", "8-)")
),
("wink", +0.25): set((">;]", ";-)", ";)", ";-]", ";]", ";D", ";^)", "*-)", "*)")),
("gasp", +0.05): set((">:o", ":-O", ":O", ":o", ":-o", "o_O", "o.O", "°O°", "°o°")),
("worry", -0.25): set(
(">:/", ":-/", ":/", ":\\", ">:\\", ":-.", ":-s", ":s", ":S", ":-S", ">.>")
),
("frown", -0.75): set(
(">:[", ":-(", ":(", "=(", ":-[", ":[", ":{", ":-<", ":c", ":-c", "=/")
),
("cry", -1.00): set((":'(", ":'''(", ";'(")),
}
TEMP_RE_EMOTICONS = [
r" ?".join([re.escape(each) for each in e]) for v in EMOTICONS.values() for e in v
]
RE_EMOTICONS = re.compile(r"(%s)($|\s)" % "|".join(TEMP_RE_EMOTICONS))
# Handle sarcasm punctuation (!).
RE_SARCASM = re.compile(r"\( ?\! ?\)")
# Handle common contractions.
replacements = {
"'d": " 'd",
"'m": " 'm",
"'s": " 's",
"'ll": " 'll",
"'re": " 're",
"'ve": " 've",
"n't": " n't",
}
# Handle paragraph line breaks (\n\n marks end of sentence).
EOS = "END-OF-SENTENCE"
def find_tokens(
string,
punctuation=PUNCTUATION,
abbreviations=ABBREVIATIONS,
replace=replacements,
linebreak=r"\n{2,}",
):
"""Returns a list of sentences. Each sentence is a space-separated string of tokens (words).
Handles common cases of abbreviations (e.g., etc., ...).
Punctuation marks are split from other words. Periods (or ?!) mark the end of a sentence.
Headings without an ending period are inferred by line breaks.
"""
# Handle periods separately.
punctuation = tuple(punctuation.replace(".", ""))
# Handle replacements (contractions).
for a, b in list(replace.items()):
string = re.sub(a, b, string)
# Handle Unicode quotes.
if isinstance(string, str):
string = (
str(string)
.replace("“", " “ ")
.replace("”", " ” ")
.replace("‘", " ‘ ")
.replace("’", " ’ ")
.replace("'", " ' ")
.replace('"', ' " ')
)
# Collapse whitespace.
string = re.sub("\r\n", "\n", string)
string = re.sub(linebreak, " %s " % EOS, string)
string = re.sub(r"\s+", " ", string)
tokens = []
for t in TOKEN.findall(string + " "):
if len(t) > 0:
tail = []
while t.startswith(punctuation) and t not in replace:
# Split leading punctuation.
if t.startswith(punctuation):
tokens.append(t[0])
t = t[1:]
while t.endswith(punctuation + (".",)) and t not in replace:
# Split trailing punctuation.
if t.endswith(punctuation):
tail.append(t[-1])
t = t[:-1]
# Split ellipsis (...) before splitting period.
if t.endswith("..."):
tail.append("...")
t = t[:-3].rstrip(".")
# Split period (if not an abbreviation).
if t.endswith("."):
if (
t in abbreviations
or RE_ABBR1.match(t) is not None
or RE_ABBR2.match(t) is not None
or RE_ABBR3.match(t) is not None
):
break
else:
tail.append(t[-1])
t = t[:-1]
if t != "":
tokens.append(t)
tokens.extend(reversed(tail))
sentences, i, j = [[]], 0, 0
while j < len(tokens):
if tokens[j] in ("...", ".", "!", "?", EOS):
# Handle citations, trailing parenthesis, repeated punctuation (!?).
while j < len(tokens) and tokens[j] in (
"'",
'"',
"”",
"’",
"...",
".",
"!",
"?",
")",
EOS,
):
if tokens[j] in ("'", '"') and sentences[-1].count(tokens[j]) % 2 == 0:
break # Balanced quotes.
j += 1
sentences[-1].extend(t for t in tokens[i:j] if t != EOS)
sentences.append([])
i = j
j += 1
sentences[-1].extend(tokens[i:j])
sentences = (" ".join(s) for s in sentences if len(s) > 0)
sentences = (RE_SARCASM.sub("(!)", s) for s in sentences)
sentences = [
RE_EMOTICONS.sub(lambda m: m.group(1).replace(" ", "") + m.group(2), s)
for s in sentences
]
return sentences
#### LEXICON #######################################################################################
# --- LEXICON ---------------------------------------------------------------------------------------
# Pattern's text parsers are based on Brill's algorithm.
# Brill's algorithm automatically acquires a lexicon of known words,
# and a set of rules for tagging unknown words from a training corpus.
# Lexical rules are used to tag unknown words, based on the word morphology (prefix, suffix, ...).
# Contextual rules are used to tag all words, based on the word's role in the sentence.
# Named entity rules are used to discover proper nouns (NNP's).
def _read(path, encoding="utf-8", comment=";;;"):
"""Returns an iterator over the lines in the file at the given path,
stripping comments and decoding each line to Unicode.
"""
if path:
if isinstance(path, basestring) and os.path.exists(path):
# From file path.
f = open(path, encoding="utf-8")
elif isinstance(path, basestring):
# From string.
f = path.splitlines()
elif hasattr(path, "read"):
# From string buffer.
f = path.read().splitlines()
else:
f = path
for i, line in enumerate(f):
line = (
line.strip(codecs.BOM_UTF8)
if i == 0 and isinstance(line, bytes)
else line
)
line = line.strip()
line = decode_utf8(line)
if not line or (comment and line.startswith(comment)):
continue
yield line
return
class Lexicon(lazydict):
def __init__(
self,
path="",
morphology="",
context="",
entities="",
NNP="NNP",
language=None,
):
"""A dictionary of words and their part-of-speech tags.
For unknown words, rules for word morphology, context and named entities can be used.
"""
self._path = path
self._language = language
self.morphology = Morphology(self, path=morphology)
self.context = Context(self, path=context)
self.entities = Entities(self, path=entities, tag=NNP)
def load(self):
# Arnold NNP x
dict.update(self, (x.split(" ")[:2] for x in _read(self._path) if x.strip()))
@property
def path(self):
return self._path
@property
def language(self):
return self._language
# --- MORPHOLOGICAL RULES ---------------------------------------------------------------------------
# Brill's algorithm generates lexical (i.e., morphological) rules in the following format:
# NN s fhassuf 1 NNS x => unknown words ending in -s and tagged NN change to NNS.
# ly hassuf 2 RB x => unknown words ending in -ly change to RB.
class Rules:
def __init__(self, lexicon=None, cmd=None):
if cmd is None:
cmd = {}
if lexicon is None:
lexicon = {}
self.lexicon, self.cmd = lexicon, cmd
def apply(self, x):
"""Applies the rule to the given token or list of tokens."""
return x
class Morphology(lazylist, Rules):
def __init__(self, lexicon=None, path=""):
"""A list of rules based on word morphology (prefix, suffix)."""
if lexicon is None:
lexicon = {}
cmd = (
"char", # Word contains x.
"haspref", # Word starts with x.
"hassuf", # Word end with x.
"addpref", # x + word is in lexicon.
"addsuf", # Word + x is in lexicon.
"deletepref", # Word without x at the start is in lexicon.
"deletesuf", # Word without x at the end is in lexicon.
"goodleft", # Word preceded by word x.
"goodright", # Word followed by word x.
)
cmd = dict.fromkeys(cmd, True)
cmd.update(("f" + k, v) for k, v in list(cmd.items()))
Rules.__init__(self, lexicon, cmd)
self._path = path
@property
def path(self):
return self._path
def load(self):
# ["NN", "s", "fhassuf", "1", "NNS", "x"]
list.extend(self, (x.split() for x in _read(self._path)))
def apply(self, token, previous=(None, None), next=(None, None)):
"""Applies lexical rules to the given token, which is a [word, tag] list."""
w = token[0]
for r in self:
if r[1] in self.cmd: # Rule = ly hassuf 2 RB x
f, x, pos, cmd = bool(0), r[0], r[-2], r[1].lower()
if r[2] in self.cmd: # Rule = NN s fhassuf 1 NNS x
f, x, pos, cmd = bool(1), r[1], r[-2], r[2].lower().lstrip("f")
if f and token[1] != r[0]:
continue
if (
(cmd == "char" and x in w)
or (cmd == "haspref" and w.startswith(x))
or (cmd == "hassuf" and w.endswith(x))
or (cmd == "addpref" and x + w in self.lexicon)
or (cmd == "addsuf" and w + x in self.lexicon)
or (
cmd == "deletepref"
and w.startswith(x)
and w[len(x) :] in self.lexicon
)
or (
cmd == "deletesuf"
and w.endswith(x)
and w[: -len(x)] in self.lexicon
)
or (cmd == "goodleft" and x == next[0])
or (cmd == "goodright" and x == previous[0])
):
token[1] = pos
return token
def insert(self, i, tag, affix, cmd="hassuf", tagged=None):
"""Inserts a new rule that assigns the given tag to words with the given affix,
e.g., Morphology.append("RB", "-ly").
"""
if affix.startswith("-") and affix.endswith("-"):
affix, cmd = affix[+1:-1], "char"
if affix.startswith("-"):
affix, cmd = affix[+1:-0], "hassuf"
if affix.endswith("-"):
affix, cmd = affix[+0:-1], "haspref"
if tagged:
r = [tagged, affix, "f" + cmd.lstrip("f"), tag, "x"]
else:
r = [affix, cmd.lstrip("f"), tag, "x"]
lazylist.insert(self, i, r)
def append(self, *args, **kwargs):
self.insert(len(self) - 1, *args, **kwargs)
def extend(self, rules=None):
if rules is None:
rules = []
for r in rules:
self.append(*r)
# --- CONTEXT RULES ---------------------------------------------------------------------------------
# Brill's algorithm generates contextual rules in the following format:
# VBD VB PREVTAG TO => unknown word tagged VBD changes to VB if preceded by a word tagged TO.
class Context(lazylist, Rules):
def __init__(self, lexicon=None, path=""):
"""A list of rules based on context (preceding and following words)."""
if lexicon is None:
lexicon = {}
cmd = (
"prevtag", # Preceding word is tagged x.
"nexttag", # Following word is tagged x.
"prev2tag", # Word 2 before is tagged x.
"next2tag", # Word 2 after is tagged x.
"prev1or2tag", # One of 2 preceding words is tagged x.
"next1or2tag", # One of 2 following words is tagged x.
"prev1or2or3tag", # One of 3 preceding words is tagged x.
"next1or2or3tag", # One of 3 following words is tagged x.
"surroundtag", # Preceding word is tagged x and following word is tagged y.
"curwd", # Current word is x.
"prevwd", # Preceding word is x.
"nextwd", # Following word is x.
"prev1or2wd", # One of 2 preceding words is x.
"next1or2wd", # One of 2 following words is x.
"next1or2or3wd", # One of 3 preceding words is x.
"prev1or2or3wd", # One of 3 following words is x.
"prevwdtag", # Preceding word is x and tagged y.
"nextwdtag", # Following word is x and tagged y.
"wdprevtag", # Current word is y and preceding word is tagged x.
"wdnexttag", # Current word is x and following word is tagged y.
"wdand2aft", # Current word is x and word 2 after is y.
"wdand2tagbfr", # Current word is y and word 2 before is tagged x.
"wdand2tagaft", # Current word is x and word 2 after is tagged y.
"lbigram", # Current word is y and word before is x.
"rbigram", # Current word is x and word after is y.
"prevbigram", # Preceding word is tagged x and word before is tagged y.
"nextbigram", # Following word is tagged x and word after is tagged y.
)
Rules.__init__(self, lexicon, dict.fromkeys(cmd, True))
self._path = path
@property
def path(self):
return self._path
def load(self):
# ["VBD", "VB", "PREVTAG", "TO"]
list.extend(self, (x.split() for x in _read(self._path)))
def apply(self, tokens):
"""Applies contextual rules to the given list of tokens,
where each token is a [word, tag] list.
"""
o = [("STAART", "STAART")] * 3 # Empty delimiters for look ahead/back.
t = o + tokens + o
for i, token in enumerate(t):
for r in self:
if token[1] == "STAART":
continue
if token[1] != r[0] and r[0] != "*":
continue
cmd, x, y = r[2], r[3], r[4] if len(r) > 4 else ""
cmd = cmd.lower()
if (
(cmd == "prevtag" and x == t[i - 1][1])
or (cmd == "nexttag" and x == t[i + 1][1])
or (cmd == "prev2tag" and x == t[i - 2][1])
or (cmd == "next2tag" and x == t[i + 2][1])
or (cmd == "prev1or2tag" and x in (t[i - 1][1], t[i - 2][1]))
or (cmd == "next1or2tag" and x in (t[i + 1][1], t[i + 2][1]))
or (
cmd == "prev1or2or3tag"
and x in (t[i - 1][1], t[i - 2][1], t[i - 3][1])
)
or (
cmd == "next1or2or3tag"
and x in (t[i + 1][1], t[i + 2][1], t[i + 3][1])
)
or (cmd == "surroundtag" and x == t[i - 1][1] and y == t[i + 1][1])
or (cmd == "curwd" and x == t[i + 0][0])
or (cmd == "prevwd" and x == t[i - 1][0])
or (cmd == "nextwd" and x == t[i + 1][0])
or (cmd == "prev1or2wd" and x in (t[i - 1][0], t[i - 2][0]))
or (cmd == "next1or2wd" and x in (t[i + 1][0], t[i + 2][0]))
or (cmd == "prevwdtag" and x == t[i - 1][0] and y == t[i - 1][1])
or (cmd == "nextwdtag" and x == t[i + 1][0] and y == t[i + 1][1])
or (cmd == "wdprevtag" and x == t[i - 1][1] and y == t[i + 0][0])
or (cmd == "wdnexttag" and x == t[i + 0][0] and y == t[i + 1][1])
or (cmd == "wdand2aft" and x == t[i + 0][0] and y == t[i + 2][0])
or (cmd == "wdand2tagbfr" and x == t[i - 2][1] and y == t[i + 0][0])
or (cmd == "wdand2tagaft" and x == t[i + 0][0] and y == t[i + 2][1])
or (cmd == "lbigram" and x == t[i - 1][0] and y == t[i + 0][0])
or (cmd == "rbigram" and x == t[i + 0][0] and y == t[i + 1][0])
or (cmd == "prevbigram" and x == t[i - 2][1] and y == t[i - 1][1])
or (cmd == "nextbigram" and x == t[i + 1][1] and y == t[i + 2][1])
):
t[i] = [t[i][0], r[1]]
return t[len(o) : -len(o)]
def insert(self, i, tag1, tag2, cmd="prevtag", x=None, y=None, *args):
"""Inserts a new rule that updates words with tag1 to tag2,
given constraints x and y, e.g., Context.append("TO < NN", "VB")
"""
if " < " in tag1 and not x and not y:
tag1, x = tag1.split(" < ")
cmd = "prevtag"
if " > " in tag1 and not x and not y:
x, tag1 = tag1.split(" > ")
cmd = "nexttag"
lazylist.insert(self, i, [tag1, tag2, cmd, x or "", y or ""])
def append(self, *args, **kwargs):
self.insert(len(self) - 1, *args, **kwargs)
def extend(self, rules=None, *args):
if rules is None:
rules = []
for r in rules:
self.append(*r)
# --- NAMED ENTITY RECOGNIZER -----------------------------------------------------------------------
RE_ENTITY1 = re.compile(r"^http://") # http://www.domain.com/path
RE_ENTITY2 = re.compile(r"^www\..*?\.[com|org|net|edu|de|uk]$") # www.domain.com
RE_ENTITY3 = re.compile(r"^[\w\-\.\+]+@(\w[\w\-]+\.)+[\w\-]+$") # name@domain.com
class Entities(lazydict, Rules):
def __init__(self, lexicon=None, path="", tag="NNP"):
"""A dictionary of named entities and their labels.
For domain names and e-mail adresses, regular expressions are used.
"""
if lexicon is None:
lexicon = {}
cmd = (
"pers", # Persons: George/NNP-PERS
"loc", # Locations: Washington/NNP-LOC
"org", # Organizations: Google/NNP-ORG
)
Rules.__init__(self, lexicon, cmd)
self._path = path
self.tag = tag
@property
def path(self):
return self._path
def load(self):
# ["Alexander", "the", "Great", "PERS"]
# {"alexander": [["alexander", "the", "great", "pers"], ...]}
for x in _read(self.path):
x = [x.lower() for x in x.split()]
dict.setdefault(self, x[0], []).append(x)
def apply(self, tokens):
"""Applies the named entity recognizer to the given list of tokens,
where each token is a [word, tag] list.
"""
# Note: we could also scan for patterns, e.g.,
# "my|his|her name is|was *" => NNP-PERS.
i = 0
while i < len(tokens):
w = tokens[i][0].lower()
if RE_ENTITY1.match(w) or RE_ENTITY2.match(w) or RE_ENTITY3.match(w):
tokens[i][1] = self.tag
if w in self:
for e in self[w]:
# Look ahead to see if successive words match the named entity.
e, tag = (
(e[:-1], "-" + e[-1].upper()) if e[-1] in self.cmd else (e, "")
)
b = True
for j, e in enumerate(e):
if i + j >= len(tokens) or tokens[i + j][0].lower() != e:
b = False
break
if b:
for token in tokens[i : i + j + 1]:
token[1] = (
token[1] == "NNPS" and token[1] or self.tag
) + tag
i += j
break
i += 1
return tokens
def append(self, entity, name="pers"):
"""Appends a named entity to the lexicon,
e.g., Entities.append("Hooloovoo", "PERS")
"""
e = [s.lower() for s in entity.split(" ") + [name]]
self.setdefault(e[0], []).append(e)
def extend(self, entities):
for entity, name in entities:
self.append(entity, name)
### SENTIMENT POLARITY LEXICON #####################################################################
# A sentiment lexicon can be used to discern objective facts from subjective opinions in text.
# Each word in the lexicon has scores for:
# 1) polarity: negative vs. positive (-1.0 => +1.0)
# 2) subjectivity: objective vs. subjective (+0.0 => +1.0)
# 3) intensity: modifies next word? (x0.5 => x2.0)
# For English, adverbs are used as modifiers (e.g., "very good").
# For Dutch, adverbial adjectives are used as modifiers
# ("hopeloos voorspelbaar", "ontzettend spannend", "verschrikkelijk goed").
# Negation words (e.g., "not") reverse the polarity of the following word.
# Sentiment()(txt) returns an averaged (polarity, subjectivity)-tuple.
# Sentiment().assessments(txt) returns a list of (chunk, polarity, subjectivity, label)-tuples.
# Semantic labels are useful for fine-grained analysis, e.g.,
# negative words + positive emoticons could indicate cynicism.
# Semantic labels:
MOOD = "mood" # emoticons, emojis
IRONY = "irony" # sarcasm mark (!)
NOUN, VERB, ADJECTIVE, ADVERB = "NN", "VB", "JJ", "RB"
RE_SYNSET = re.compile(r"^[acdnrv][-_][0-9]+$")
def avg(list):
return sum(list) / float(len(list) or 1)
class Score(tuple):
def __new__(self, polarity, subjectivity, assessments=None):
"""A (polarity, subjectivity)-tuple with an assessments property."""
if assessments is None:
assessments = []
return tuple.__new__(self, [polarity, subjectivity])
def __init__(self, polarity, subjectivity, assessments=None):
if assessments is None:
assessments = []
self.assessments = assessments
class Sentiment(lazydict):
def __init__(self, path="", language=None, synset=None, confidence=None, **kwargs):
"""A dictionary of words (adjectives) and polarity scores (positive/negative).
The value for each word is a dictionary of part-of-speech tags.
The value for each word POS-tag is a tuple with values for
polarity (-1.0-1.0), subjectivity (0.0-1.0) and intensity (0.5-2.0).
"""
self._path = path # XML file path.
self._language = None # XML language attribute ("en", "fr", ...)
self._confidence = None # XML confidence attribute threshold (>=).
self._synset = synset # XML synset attribute ("wordnet_id", "cornetto_id", ...)
self._synsets = {} # {"a-01123879": (1.0, 1.0, 1.0)}
self.labeler = {} # {"dammit": "profanity"}
self.tokenizer = kwargs.get("tokenizer", find_tokens)
self.negations = kwargs.get("negations", ("no", "not", "n't", "never"))
self.modifiers = kwargs.get("modifiers", ("RB",))
self.modifier = kwargs.get("modifier", lambda w: w.endswith("ly"))
@property
def path(self):
return self._path
@property
def language(self):
return self._language
@property
def confidence(self):
return self._confidence
def load(self, path=None):
"""Loads the XML-file (with sentiment annotations) from the given path.
By default, Sentiment.path is lazily loaded.
"""
# <word form="great" wordnet_id="a-01123879" pos="JJ" polarity="1.0" subjectivity="1.0" intensity="1.0" />
# <word form="damnmit" polarity="-0.75" subjectivity="1.0" label="profanity" />
if not path:
path = self._path
if not os.path.exists(path):
return
words, synsets, labels = {}, {}, {}
xml = ElementTree.parse(path)
xml = xml.getroot()
for w in xml.findall("word"):
if self._confidence is None or self._confidence <= float(
w.attrib.get("confidence", 0.0)
):
w, pos, p, s, i, label, synset = (
w.attrib.get("form"),
w.attrib.get("pos"),
w.attrib.get("polarity", 0.0),
w.attrib.get("subjectivity", 0.0),
w.attrib.get("intensity", 1.0),
w.attrib.get("label"),
w.attrib.get(self._synset), # wordnet_id, cornetto_id, ...
)
psi = (float(p), float(s), float(i))
if w:
words.setdefault(w, {}).setdefault(pos, []).append(psi)
if w and label:
labels[w] = label
if synset:
synsets.setdefault(synset, []).append(psi)
self._language = xml.attrib.get("language", self._language)
# Average scores of all word senses per part-of-speech tag.
for w in words:
words[w] = dict(
(pos, [avg(each) for each in zip(*psi)])
for pos, psi in words[w].items()
)
# Average scores of all part-of-speech tags.
for w, pos in list(words.items()):
words[w][None] = [avg(each) for each in zip(*pos.values())]
# Average scores of all synonyms per synset.
for id, psi in synsets.items():
synsets[id] = [avg(each) for each in zip(*psi)]
dict.update(self, words)
dict.update(self.labeler, labels)
dict.update(self._synsets, synsets)
def synset(self, id, pos=ADJECTIVE):
"""Returns a (polarity, subjectivity)-tuple for the given synset id.
For example, the adjective "horrible" has id 193480 in WordNet:
Sentiment.synset(193480, pos="JJ") => (-0.6, 1.0, 1.0).
"""
id = str(id).zfill(8)
if not id.startswith(("n-", "v-", "a-", "r-")):
if pos == NOUN:
id = "n-" + id
if pos == VERB:
id = "v-" + id
if pos == ADJECTIVE:
id = "a-" + id
if pos == ADVERB:
id = "r-" + id
if dict.__len__(self) == 0:
self.load()
return tuple(self._synsets.get(id, (0.0, 0.0))[:2])
def __call__(self, s, negation=True, **kwargs):
"""Returns a (polarity, subjectivity)-tuple for the given sentence,
with polarity between -1.0 and 1.0 and subjectivity between 0.0 and 1.0.
The sentence can be a string, Synset, Text, Sentence, Chunk, Word, Document, Vector.
An optional weight parameter can be given,
as a function that takes a list of words and returns a weight.
"""
def avg(assessments, weighted=lambda w: 1):
s, n = 0, 0
for words, score in assessments:
w = weighted(words)
s += w * score
n += w
return s / float(n or 1)
# A pattern.en.wordnet.Synset.
# Sentiment(synsets("horrible", "JJ")[0]) => (-0.6, 1.0)
if hasattr(s, "gloss"):
a = [(s.synonyms[0],) + self.synset(s.id, pos=s.pos) + (None,)]
# A synset id.
# Sentiment("a-00193480") => horrible => (-0.6, 1.0) (English WordNet)
# Sentiment("c_267") => verschrikkelijk => (-0.9, 1.0) (Dutch Cornetto)
elif (
isinstance(s, basestring) and RE_SYNSET.match(s) and hasattr(s, "synonyms")
):
a = [(s.synonyms[0],) + self.synset(s.id, pos=s.pos) + (None,)]
# A string of words.
# Sentiment("a horrible movie") => (-0.6, 1.0)
elif isinstance(s, basestring):
a = self.assessments(
((w.lower(), None) for w in " ".join(self.tokenizer(s)).split()),
negation,
)
# A pattern.en.Text.
elif hasattr(s, "sentences"):
a = self.assessments(
(
(w.lemma or w.string.lower(), w.pos[:2])
for w in chain.from_iterable(s)
),
negation,
)
# A pattern.en.Sentence or pattern.en.Chunk.
elif hasattr(s, "lemmata"):
a = self.assessments(
((w.lemma or w.string.lower(), w.pos[:2]) for w in s.words), negation
)
# A pattern.en.Word.
elif hasattr(s, "lemma"):
a = self.assessments(((s.lemma or s.string.lower(), s.pos[:2]),), negation)
# A pattern.vector.Document.
# Average score = weighted average using feature weights.
# Bag-of words is unordered: inject None between each two words
# to stop assessments() from scanning for preceding negation & modifiers.
elif hasattr(s, "terms"):
a = self.assessments(
chain.from_iterable(((w, None), (None, None)) for w in s), negation
)
kwargs.setdefault("weight", lambda w: s.terms[w[0]])
# A dict of (word, weight)-items.
elif isinstance(s, dict):
a = self.assessments(
chain.from_iterable(((w, None), (None, None)) for w in s), negation
)
kwargs.setdefault("weight", lambda w: s[w[0]])
# A list of words.
elif isinstance(s, list):
a = self.assessments(((w, None) for w in s), negation)
else:
a = []
weight = kwargs.get("weight", lambda w: 1) # [(w, p) for w, p, s, x in a]
return Score(
polarity=avg([(w, p) for w, p, s, x in a], weight),
subjectivity=avg([(w, s) for w, p, s, x in a], weight),
assessments=a,
)
def assessments(self, words=None, negation=True):
"""Returns a list of (chunk, polarity, subjectivity, label)-tuples for the given list of words:
where chunk is a list of successive words: a known word optionally
preceded by a modifier ("very good") or a negation ("not good").
"""
if words is None:
words = []
a = []
m = None # Preceding modifier (i.e., adverb or adjective).
n = None # Preceding negation (e.g., "not beautiful").
for w, pos in words:
# Only assess known words, preferably by part-of-speech tag.
# Including unknown words (polarity 0.0 and subjectivity 0.0) lowers the average.
if w is None:
continue
if w in self and pos in self[w]:
p, s, i = self[w][pos]
# Known word not preceded by a modifier ("good").
if m is None:
a.append(dict(w=[w], p=p, s=s, i=i, n=1, x=self.labeler.get(w)))
# Known word preceded by a modifier ("really good").
if m is not None:
a[-1]["w"].append(w)
a[-1]["p"] = max(-1.0, min(p * a[-1]["i"], +1.0))
a[-1]["s"] = max(-1.0, min(s * a[-1]["i"], +1.0))
a[-1]["i"] = i
a[-1]["x"] = self.labeler.get(w)
# Known word preceded by a negation ("not really good").
if n is not None:
a[-1]["w"].insert(0, n)
a[-1]["i"] = 1.0 / a[-1]["i"]
a[-1]["n"] = -1
# Known word may be a negation.
# Known word may be modifying the next word (i.e., it is a known adverb).
m = None
n = None
if (
pos
and pos in self.modifiers
or any(map(self[w].__contains__, self.modifiers))
):
m = (w, pos)
if negation and w in self.negations:
n = w
else:
# Unknown word may be a negation ("not good").
if negation and w in self.negations:
n = w
# Unknown word. Retain negation across small words ("not a good").
elif n and len(w.strip("'")) > 1:
n = None
# Unknown word may be a negation preceded by a modifier ("really not good").
if (
n is not None
and m is not None
and (pos in self.modifiers or self.modifier(m[0]))
):
a[-1]["w"].append(n)
a[-1]["n"] = -1
n = None
# Unknown word. Retain modifier across small words ("really is a good").
elif m and len(w) > 2:
m = None
# Exclamation marks boost previous word.
if w == "!" and len(a) > 0:
a[-1]["w"].append("!")
a[-1]["p"] = max(-1.0, min(a[-1]["p"] * 1.25, +1.0))
# Exclamation marks in parentheses indicate sarcasm.
if w == "(!)":
a.append(dict(w=[w], p=0.0, s=1.0, i=1.0, n=1, x=IRONY))
# EMOTICONS: {("grin", +1.0): set((":-D", ":D"))}
if (
w.isalpha() is False and len(w) <= 5 and w not in PUNCTUATION
): # speedup
for (_type, p), e in EMOTICONS.items():
if w in map(lambda e: e.lower(), e):
a.append(dict(w=[w], p=p, s=1.0, i=1.0, n=1, x=MOOD))
break
for i in range(len(a)):
w = a[i]["w"]
p = a[i]["p"]
s = a[i]["s"]
n = a[i]["n"]
x = a[i]["x"]
# "not good" = slightly bad, "not bad" = slightly good.
a[i] = (w, p * -0.5 if n < 0 else p, s, x)
return a
def annotate(
self, word, pos=None, polarity=0.0, subjectivity=0.0, intensity=1.0, label=None
):
"""Annotates the given word with polarity, subjectivity and intensity scores,
and optionally a semantic label (e.g., MOOD for emoticons, IRONY for "(!)").
"""
w = self.setdefault(word, {})
w[pos] = w[None] = (polarity, subjectivity, intensity)
if label:
self.labeler[word] = label
# --- PART-OF-SPEECH TAGGER -------------------------------------------------------------------------
# Unknown words are recognized as numbers if they contain only digits and -,.:/%$
CD = re.compile(r"^[0-9\-\,\.\:\/\%\$]+$")
def _suffix_rules(token, tag="NN"):
"""Default morphological tagging rules for English, based on word suffixes."""
if isinstance(token, (list, tuple)):
token, tag = token
if token.endswith("ing"):
tag = "VBG"
if token.endswith("ly"):
tag = "RB"
if token.endswith("s") and not token.endswith(("is", "ous", "ss")):
tag = "NNS"
if (
token.endswith(
("able", "al", "ful", "ible", "ient", "ish", "ive", "less", "tic", "ous")
)
or "-" in token
):
tag = "JJ"
if token.endswith("ed"):
tag = "VBN"
if token.endswith(("ate", "ify", "ise", "ize")):
tag = "VBP"
return [token, tag]
def find_tags(
tokens,
lexicon=None,
model=None,
morphology=None,
context=None,
entities=None,
default=("NN", "NNP", "CD"),
language="en",
map=None,
**kwargs,
):
"""Returns a list of [token, tag]-items for the given list of tokens:
["The", "cat", "purs"] => [["The", "DT"], ["cat", "NN"], ["purs", "VB"]]
Words are tagged using the given lexicon of (word, tag)-items.
Unknown words are tagged NN by default.
Unknown words that start with a capital letter are tagged NNP (unless language="de").
Unknown words that consist only of digits and punctuation marks are tagged CD.
Unknown words are then improved with morphological rules.
All words are improved with contextual rules.
If a model is given, uses model for unknown words instead of morphology and context.
If map is a function, it is applied to each (token, tag) after applying all rules.
"""
if lexicon is None:
lexicon = {}
tagged = []
# Tag known words.
for i, token in enumerate(tokens):
tagged.append(
[token, lexicon.get(token, i == 0 and lexicon.get(token.lower()) or None)]
)
# Tag unknown words.
for i, (token, tag) in enumerate(tagged):
prev, next = (None, None), (None, None)
if i > 0:
prev = tagged[i - 1]
if i < len(tagged) - 1:
next = tagged[i + 1]
if tag is None or token in (model is not None and model.unknown or ()):
# Use language model (i.e., SLP).
if model is not None:
tagged[i] = model.apply([token, None], prev, next)
# Use NNP for capitalized words (except in German).
elif token.istitle() and language != "de":
tagged[i] = [token, default[1]]
# Use CD for digits and numbers.
elif CD.match(token) is not None:
tagged[i] = [token, default[2]]
# Use suffix rules (e.g., -ly = RB).
elif morphology is not None:
tagged[i] = morphology.apply([token, default[0]], prev, next)
# Use suffix rules (English default).
elif language == "en":
tagged[i] = _suffix_rules([token, default[0]])
# Use most frequent tag (NN).
else:
tagged[i] = [token, default[0]]
# Tag words by context.
if context is not None and model is None:
tagged = context.apply(tagged)
# Tag named entities.
if entities is not None:
tagged = entities.apply(tagged)
# Map tags with a custom function.
if map is not None:
tagged = [list(map(token, tag)) or [token, default[0]] for token, tag in tagged]
return tagged
# --- PHRASE CHUNKER --------------------------------------------------------------------------------
SEPARATOR = "/"
NN = r"NN|NNS|NNP|NNPS|NNPS?\-[A-Z]{3,4}|PR|PRP|PRP\$"
VB = r"VB|VBD|VBG|VBN|VBP|VBZ"
JJ = r"JJ|JJR|JJS"
RB = r"(?<!W)RB|RBR|RBS"
# Chunking rules.
# CHUNKS[0] = Germanic: RB + JJ precedes NN ("the round table").
# CHUNKS[1] = Romance: RB + JJ precedes or follows NN ("la table ronde", "une jolie fille").
CHUNKS = [
[
# Germanic languages: en, de, nl, ...
(
"NP",
re.compile(
r"(("
+ NN
+ ")/)*((DT|CD|CC|CJ)/)*(("
+ RB
+ "|"
+ JJ
+ ")/)*(("
+ NN
+ ")/)+"
),
),
("VP", re.compile(r"(((MD|" + RB + ")/)*((" + VB + ")/)+)+")),
("VP", re.compile(r"((MD)/)")),
("PP", re.compile(r"((IN|PP|TO)/)+")),
("ADJP", re.compile(r"((CC|CJ|" + RB + "|" + JJ + ")/)*((" + JJ + ")/)+")),
("ADVP", re.compile(r"((" + RB + "|WRB)/)+")),
],
[
# Romance languages: es, fr, it, ...
(
"NP",
re.compile(
r"(("
+ NN
+ ")/)*((DT|CD|CC|CJ)/)*(("
+ RB
+ "|"
+ JJ
+ ")/)*(("
+ NN
+ ")/)+(("
+ RB
+ "|"
+ JJ
+ ")/)*"
),
),
("VP", re.compile(r"(((MD|" + RB + ")/)*((" + VB + ")/)+((" + RB + ")/)*)+")),
("VP", re.compile(r"((MD)/)")),
("PP", re.compile(r"((IN|PP|TO)/)+")),
("ADJP", re.compile(r"((CC|CJ|" + RB + "|" + JJ + ")/)*((" + JJ + ")/)+")),
("ADVP", re.compile(r"((" + RB + "|WRB)/)+")),
],
]
# Handle ADJP before VP, so that
# RB prefers next ADJP over previous VP.
CHUNKS[0].insert(1, CHUNKS[0].pop(3))
CHUNKS[1].insert(1, CHUNKS[1].pop(3))
def find_chunks(tagged, language="en"):
"""The input is a list of [token, tag]-items.
The output is a list of [token, tag, chunk]-items:
The/DT nice/JJ fish/NN is/VBZ dead/JJ ./. =>
The/DT/B-NP nice/JJ/I-NP fish/NN/I-NP is/VBZ/B-VP dead/JJ/B-ADJP ././O
"""
chunked = [x for x in tagged]
tags = "".join(f"{tag}{SEPARATOR}" for token, tag in tagged)
# Use Germanic or Romance chunking rules according to given language.
for tag, rule in CHUNKS[
int(language in ("ca", "es", "pt", "fr", "it", "pt", "ro"))
]:
for m in rule.finditer(tags):
# Find the start of chunks inside the tags-string.
# Number of preceding separators = number of preceding tokens.
i = m.start()
j = tags[:i].count(SEPARATOR)
n = m.group(0).count(SEPARATOR)
for k in range(j, j + n):
if len(chunked[k]) == 3:
continue
if len(chunked[k]) < 3:
# A conjunction can not be start of a chunk.
if k == j and chunked[k][1] in ("CC", "CJ", "KON", "Conj(neven)"):
j += 1
# Mark first token in chunk with B-.
elif k == j:
chunked[k].append("B-" + tag)
# Mark other tokens in chunk with I-.
else:
chunked[k].append("I-" + tag)
# Mark chinks (tokens outside of a chunk) with O-.
for chink in filter(lambda x: len(x) < 3, chunked):
chink.append("O")
# Post-processing corrections.
for i, (_word, tag, chunk) in enumerate(chunked):
if tag.startswith("RB") and chunk == "B-NP":
# "Very nice work" (NP) <=> "Perhaps" (ADVP) + "you" (NP).
if i < len(chunked) - 1 and not chunked[i + 1][1].startswith("JJ"):
chunked[i + 0][2] = "B-ADVP"
chunked[i + 1][2] = "B-NP"
return chunked
def find_prepositions(chunked):
"""The input is a list of [token, tag, chunk]-items.
The output is a list of [token, tag, chunk, preposition]-items.
PP-chunks followed by NP-chunks make up a PNP-chunk.
"""
# Tokens that are not part of a preposition just get the O-tag.
for ch in chunked:
ch.append("O")
for i, chunk in enumerate(chunked):
if chunk[2].endswith("PP") and chunk[-1] == "O":
# Find PP followed by other PP, NP with nouns and pronouns, VP with a gerund.
if i < len(chunked) - 1 and (
chunked[i + 1][2].endswith(("NP", "PP"))
or chunked[i + 1][1] in ("VBG", "VBN")
):
chunk[-1] = "B-PNP"
pp = True
for ch in chunked[i + 1 :]:
if not (ch[2].endswith(("NP", "PP")) or ch[1] in ("VBG", "VBN")):
break
if ch[2].endswith("PP") and pp:
ch[-1] = "I-PNP"
if not ch[2].endswith("PP"):
ch[-1] = "I-PNP"
pp = False
return chunked
#### PARSER ########################################################################################
# --- PARSER ----------------------------------------------------------------------------------------
# A shallow parser can be used to retrieve syntactic-semantic information from text
# in an efficient way (usually at the expense of deeper configurational syntactic information).
# The shallow parser in Pattern is meant to handle the following tasks:
# 1) Tokenization: split punctuation marks from words and find sentence periods.
# 2) Tagging: find the part-of-speech tag of each word (noun, verb, ...) in a sentence.
# 3) Chunking: find words that belong together in a phrase.
# 4) Role labeling: find the subject and object of the sentence.
# 5) Lemmatization: find the base form of each word ("was" => "is").
# WORD TAG CHUNK PNP ROLE LEMMA
# ------------------------------------------------------------------
# The DT B-NP O NP-SBJ-1 the
# black JJ I-NP O NP-SBJ-1 black
# cat NN I-NP O NP-SBJ-1 cat
# sat VB B-VP O VP-1 sit
# on IN B-PP B-PNP PP-LOC on
# the DT B-NP I-PNP NP-OBJ-1 the
# mat NN I-NP I-PNP NP-OBJ-1 mat
# . . O O O .
# The example demonstrates what information can be retrieved:
#
# - the period is split from "mat." = the end of the sentence,
# - the words are annotated: NN (noun), VB (verb), JJ (adjective), DT (determiner), ...
# - the phrases are annotated: NP (noun phrase), VP (verb phrase), PNP (preposition), ...
# - the phrases are labeled: SBJ (subject), OBJ (object), LOC (location), ...
# - the phrase start is marked: B (begin), I (inside), O (outside),
# - the past tense "sat" is lemmatized => "sit".
# By default, the English parser uses the Penn Treebank II tagset:
# http://www.clips.ua.ac.be/pages/penn-treebank-tagset
PTB = PENN = "penn"
class Parser:
def __init__(self, lexicon=None, default=("NN", "NNP", "CD"), language=None):
"""A simple shallow parser using a Brill-based part-of-speech tagger.
The given lexicon is a dictionary of known words and their part-of-speech tag.
The given default tags are used for unknown words.
Unknown words that start with a capital letter are tagged NNP (except for German).
Unknown words that contain only digits and punctuation are tagged CD.
The given language can be used to discern between
Germanic and Romance languages for phrase chunking.
"""
if lexicon is None:
lexicon = {}
self.lexicon = lexicon
self.default = default
self.language = language
def find_tokens(self, string, **kwargs):
"""Returns a list of sentences from the given string.
Punctuation marks are separated from each word by a space.
"""
# "The cat purs." => ["The cat purs ."]
return find_tokens(
str(string),
punctuation=kwargs.get("punctuation", PUNCTUATION),
abbreviations=kwargs.get("abbreviations", ABBREVIATIONS),
replace=kwargs.get("replace", replacements),
linebreak=r"\n{2,}",
)
def find_tags(self, tokens, **kwargs):
"""Annotates the given list of tokens with part-of-speech tags.
Returns a list of tokens, where each token is now a [word, tag]-list.
"""
# ["The", "cat", "purs"] => [["The", "DT"], ["cat", "NN"], ["purs", "VB"]]
return find_tags(
tokens,
language=kwargs.get("language", self.language),
lexicon=kwargs.get("lexicon", self.lexicon),
default=kwargs.get("default", self.default),
map=kwargs.get("map", None),
)
def find_chunks(self, tokens, **kwargs):
"""Annotates the given list of tokens with chunk tags.
Several tags can be added, for example chunk + preposition tags.
"""
# [["The", "DT"], ["cat", "NN"], ["purs", "VB"]] =>
# [["The", "DT", "B-NP"], ["cat", "NN", "I-NP"], ["purs", "VB", "B-VP"]]
return find_prepositions(
find_chunks(tokens, language=kwargs.get("language", self.language))
)
def find_prepositions(self, tokens, **kwargs):
"""Annotates the given list of tokens with prepositional noun phrase tags."""
return find_prepositions(tokens) # See also Parser.find_chunks().
def find_labels(self, tokens, **kwargs):
"""Annotates the given list of tokens with verb/predicate tags."""
return find_relations(tokens)
def find_lemmata(self, tokens, **kwargs):
"""Annotates the given list of tokens with word lemmata."""
return [token + [token[0].lower()] for token in tokens]
def parse(
self,
s,
tokenize=True,
tags=True,
chunks=True,
relations=False,
lemmata=False,
encoding="utf-8",
**kwargs,
):
"""Takes a string (sentences) and returns a tagged Unicode string (TaggedString).
Sentences in the output are separated by newlines.
With tokenize=True, punctuation is split from words and sentences are separated by \n.
With tags=True, part-of-speech tags are parsed (NN, VB, IN, ...).
With chunks=True, phrase chunk tags are parsed (NP, VP, PP, PNP, ...).
With relations=True, semantic role labels are parsed (SBJ, OBJ).
With lemmata=True, word lemmata are parsed.
Optional parameters are passed to
the tokenizer, tagger, chunker, labeler and lemmatizer.
"""
# Tokenizer.
if tokenize:
s = self.find_tokens(s, **kwargs)
if isinstance(s, (list, tuple)):
s = [isinstance(s, basestring) and s.split(" ") or s for s in s]
if isinstance(s, basestring):
s = [s.split(" ") for s in s.split("\n")]
# Unicode.
for i in range(len(s)):
for j in range(len(s[i])):
if isinstance(s[i][j], bytes):
s[i][j] = decode_string(s[i][j], encoding)
# Tagger (required by chunker, labeler & lemmatizer).
if tags or chunks or relations or lemmata:
s[i] = self.find_tags(s[i], **kwargs)
else:
s[i] = [[w] for w in s[i]]
# Chunker.
if chunks or relations:
s[i] = self.find_chunks(s[i], **kwargs)
# Labeler.
if relations:
s[i] = self.find_labels(s[i], **kwargs)
# Lemmatizer.
if lemmata:
s[i] = self.find_lemmata(s[i], **kwargs)
# Slash-formatted tagged string.
# With collapse=False (or split=True), returns raw list
# (this output is not usable by tree.Text).
if not kwargs.get("collapse", True) or kwargs.get("split", False):
return s
# Construct TaggedString.format.
# (this output is usable by tree.Text).
format = ["word"]
if tags:
format.append("part-of-speech")
if chunks:
format.extend(("chunk", "preposition"))
if relations:
format.append("relation")
if lemmata:
format.append("lemma")
# Collapse raw list.
# Sentences are separated by newlines, tokens by spaces, tags by slashes.
# Slashes in words are encoded with &slash;
for i in range(len(s)):
for j in range(len(s[i])):
s[i][j][0] = s[i][j][0].replace("/", "&slash;")
s[i][j] = "/".join(s[i][j])
s[i] = " ".join(s[i])
s = "\n".join(s)
s = TaggedString(
str(s), format, language=kwargs.get("language", self.language)
)
return s
# --- TAGGED STRING ---------------------------------------------------------------------------------
# Pattern.parse() returns a TaggedString: a Unicode string with "tags" and "language" attributes.
# The pattern.text.tree.Text class uses this attribute to determine the token format and
# transform the tagged string to a parse tree of nested Sentence, Chunk and Word objects.
TOKENS = "tokens"
class TaggedString(str):
def __new__(cls, string, tags=None, language=None):
"""Unicode string with tags and language attributes.
For example: TaggedString("cat/NN/NP", tags=["word", "pos", "chunk"]).
"""
# From a TaggedString:
if tags is None:
tags = ["word"]
if isinstance(string, str) and hasattr(string, "tags"):
tags, language = string.tags, string.language
# From a TaggedString.split(TOKENS) list:
if isinstance(string, list):
string = [
[[x.replace("/", "&slash;") for x in token] for token in s]
for s in string
]
string = "\n".join(" ".join("/".join(token) for token in s) for s in string)
s = str.__new__(cls, string)
s.tags = list(tags)
s.language = language
return s
def split(self, sep=TOKENS):
"""Returns a list of sentences, where each sentence is a list of tokens,
where each token is a list of word + tags.
"""
if sep != TOKENS:
return str.split(self, sep)
if len(self) == 0:
return []
return [
[
[x.replace("&slash;", "/") for x in token.split("/")]
for token in sentence.split(" ")
]
for sentence in str.split(self, "\n")
]
#### SPELLING CORRECTION ###########################################################################
# Based on: Peter Norvig, "How to Write a Spelling Corrector", http://norvig.com/spell-correct.html
class Spelling(lazydict):
ALPHA = "abcdefghijklmnopqrstuvwxyz"
def __init__(self, path=""):
self._path = path
def load(self):
for x in _read(self._path):
x = x.split()
dict.__setitem__(self, x[0], int(x[1]))
@property
def path(self):
return self._path
@property
def language(self):
return self._language
@classmethod
def train(cls, s, path="spelling.txt"):
"""Counts the words in the given string and saves the probabilities at the given path.
This can be used to generate a new model for the Spelling() constructor.
"""
model = {}
for w in re.findall("[a-z]+", s.lower()):
model[w] = w in model and model[w] + 1 or 1
model = (f"{k} {v}" for k, v in sorted(model.items()))
model = "\n".join(model)
f = open(path, "w")
f.write(model)
f.close()
def _edit1(self, w):
"""Returns a set of words with edit distance 1 from the given word."""
# Of all spelling errors, 80% is covered by edit distance 1.
# Edit distance 1 = one character deleted, swapped, replaced or inserted.
split = [(w[:i], w[i:]) for i in range(len(w) + 1)]
delete, transpose, replace, insert = (
[a + b[1:] for a, b in split if b],
[a + b[1] + b[0] + b[2:] for a, b in split if len(b) > 1],
[a + c + b[1:] for a, b in split for c in Spelling.ALPHA if b],
[a + c + b[0:] for a, b in split for c in Spelling.ALPHA],
)
return set(delete + transpose + replace + insert)
def _edit2(self, w):
"""Returns a set of words with edit distance 2 from the given word"""
# Of all spelling errors, 99% is covered by edit distance 2.
# Only keep candidates that are actually known words (20% speedup).
return set(e2 for e1 in self._edit1(w) for e2 in self._edit1(e1) if e2 in self)
def _known(self, words=None):
"""Returns the given list of words filtered by known words."""
if words is None:
words = []
return set(w for w in words if w in self)
def suggest(self, w):
"""Return a list of (word, confidence) spelling corrections for the given word,
based on the probability of known words with edit distance 1-2 from the given word.
"""
if len(self) == 0:
self.load()
if len(w) == 1:
return [(w, 1.0)] # I
if w in PUNCTUATION:
return [(w, 1.0)] # .?!
if w in string.whitespace:
return [(w, 1.0)] # \n
if w.replace(".", "").isdigit():
return [(w, 1.0)] # 1.5
candidates = (
self._known([w])
or self._known(self._edit1(w))
or self._known(self._edit2(w))
or [w]
)
candidates = [(self.get(c, 0.0), c) for c in candidates]
s = float(sum(p for p, word in candidates) or 1)
candidates = sorted(((p / s, word) for p, word in candidates), reverse=True)
if w.istitle(): # Preserve capitalization
candidates = [(word.title(), p) for p, word in candidates]
else:
candidates = [(word, p) for p, word in candidates]
return candidates
================================================
FILE: src/textblob/base.py
================================================
"""Abstract base classes for models (taggers, noun phrase extractors, etc.)
which define the interface for descendant classes.
.. versionchanged:: 0.7.0
All base classes are defined in the same module, ``textblob.base``.
"""
from __future__ import annotations
from abc import ABCMeta, abstractmethod
from typing import TYPE_CHECKING
import nltk
if TYPE_CHECKING:
from typing import Any, AnyStr
##### POS TAGGERS #####
class BaseTagger(metaclass=ABCMeta):
"""Abstract tagger class from which all taggers
inherit from. All descendants must implement a
``tag()`` method.
"""
@abstractmethod
def tag(self, text: str, tokenize=True) -> list[tuple[str, str]]:
"""Return a list of tuples of the form (word, tag)
for a given set of text or BaseBlob instance.
"""
...
##### NOUN PHRASE EXTRACTORS #####
class BaseNPExtractor(metaclass=ABCMeta):
"""Abstract base class from which all NPExtractor classes inherit.
Descendant classes must implement an ``extract(text)`` method
that returns a list of noun phrases as strings.
"""
@abstractmethod
def extract(self, text: str) -> list[str]:
"""Return a list of noun phrases (strings) for a body of text."""
...
##### TOKENIZERS #####
class BaseTokenizer(nltk.tokenize.api.TokenizerI, metaclass=ABCMeta): # pyright: ignore
"""Abstract base class from which all Tokenizer classes inherit.
Descendant classes must implement a ``tokenize(text)`` method
that returns a list of noun phrases as strings.
"""
@abstractmethod
def tokenize(self, text: str) -> list[str]:
"""Return a list of tokens (strings) for a body of text.
:rtype: list
"""
...
def itokenize(self, text: str, *args, **kwargs):
"""Return a generator that generates tokens "on-demand".
.. versionadded:: 0.6.0
:rtype: generator
"""
return (t for t in self.tokenize(text, *args, **kwargs))
##### SENTIMENT ANALYZERS ####
DISCRETE = "ds"
CONTINUOUS = "co"
class BaseSentimentAnalyzer(metaclass=ABCMeta):
"""Abstract base class from which all sentiment analyzers inherit.
Should implement an ``analyze(text)`` method which returns either the
results of analysis.
"""
_trained: bool
kind = DISCRETE
def __init__(self):
self._trained = False
def train(self):
# Train me
self._trained = True
@abstractmethod
def analyze(self, text) -> Any:
"""Return the result of of analysis. Typically returns either a
tuple, float, or dictionary.
"""
# Lazily train the classifier
if not self._trained:
self.train()
# Analyze text
return None
##### PARSERS #####
class BaseParser(metaclass=ABCMeta):
"""Abstract parser class from which all parsers inherit from. All
descendants must implement a ``parse()`` method.
"""
@abstractmethod
def parse(self, text: AnyStr):
"""Parses the text."""
...
================================================
FILE: src/textblob/blob.py
================================================
"""Wrappers for various units of text, including the main
:class:`TextBlob <textblob.blob.TextBlob>`, :class:`Word <textblob.blob.Word>`,
and :class:`WordList <textblob.blob.WordList>` classes.
Example usage: ::
>>> from textblob import TextBlob
>>> b = TextBlob("Simple is better than complex.")
>>> b.tags
[(u'Simple', u'NN'), (u'is', u'VBZ'), (u'better', u'JJR'), (u'than', u'IN'), (u'complex', u'NN')]
>>> b.noun_phrases
WordList([u'simple'])
>>> b.words
WordList([u'Simple', u'is', u'better', u'than', u'complex'])
>>> b.sentiment
(0.06666666666666667, 0.41904761904761906)
>>> b.words[0].synsets()[0]
Synset('simple.n.01')
.. versionchanged:: 0.8.0
These classes are now imported from ``textblob`` rather than ``text.blob``.
""" # noqa: E501
import json
import sys
from collections import defaultdict
import nltk
from textblob.base import (
BaseNPExtractor,
BaseParser,
BaseSentimentAnalyzer,
BaseTagger,
BaseTokenizer,
)
from textblob.decorators import cached_property, requires_nltk_corpus
from textblob.en import suggest
from textblob.inflect import pluralize as _pluralize
from textblob.inflect import singularize as _singularize
from textblob.mixins import BlobComparableMixin, StringlikeMixin
from textblob.np_extractors import FastNPExtractor
from textblob.parsers import PatternParser
from textblob.sentiments import PatternAnalyzer
from textblob.taggers import NLTKTagger
from textblob.tokenizers import WordTokenizer, sent_tokenize, word_tokenize
from textblob.utils import PUNCTUATION_REGEX, lowerstrip
# Wordnet interface
# NOTE: textblob.wordnet is not imported so that the wordnet corpus can be lazy-loaded
_wordnet = nltk.corpus.wordnet
basestring = (str, bytes)
def _penn_to_wordnet(tag):
"""Converts a Penn corpus tag into a Wordnet tag."""
if tag in ("NN", "NNS", "NNP", "NNPS"):
return _wordnet.NOUN
if tag in ("JJ", "JJR", "JJS"):
return _wordnet.ADJ
if tag in ("VB", "VBD", "VBG", "VBN", "VBP", "VBZ"):
return _wordnet.VERB
if tag in ("RB", "RBR", "RBS"):
return _wordnet.ADV
return None
class Word(str):
"""A simple word representation. Includes methods for inflection,
and WordNet integration.
"""
def __new__(cls, string, pos_tag=None):
"""Return a new instance of the class. It is necessary to override
this method in order to handle the extra pos_tag argument in the
constructor.
"""
return super().__new__(cls, string)
def __init__(self, string, pos_tag=None):
self.string = string
self.pos_tag = pos_tag
def __repr__(self):
return repr(self.string)
def __str__(self):
return self.string
def singularize(self):
"""Return the singular version of the word as a string."""
return Word(_singularize(self.string))
def pluralize(self):
"""Return the plural version of the word as a string."""
return Word(_pluralize(self.string))
def spellcheck(self):
"""Return a list of (word, confidence) tuples of spelling corrections.
Based on: Peter Norvig, "How to Write a Spelling Corrector"
(http://norvig.com/spell-correct.html) as implemented in the pattern
library.
.. versionadded:: 0.6.0
"""
return suggest(self.string)
def correct(self):
"""Correct the spelling of the word. Returns the word with the highest
confidence using the spelling corrector.
.. versionadded:: 0.6.0
"""
return Word(self.spellcheck()[0][0])
@cached_property
@requires_nltk_corpus
def lemma(self):
"""Return the lemma of this word using Wordnet's morphy function."""
return self.lemmatize(pos=self.pos_tag)
@requires_nltk_corpus
def lemmatize(self, pos=None):
"""Return the lemma for a word using WordNet's morphy function.
:param pos: Part of speech to filter upon. If `None`, defaults to
``_wordnet.NOUN``.
.. versionadded:: 0.8.1
"""
if pos is None:
tag = _wordnet.NOUN
elif pos in _wordnet._FILEMAP.keys():
tag = pos
else:
tag = _penn_to_wordnet(pos)
lemmatizer = nltk.stem.WordNetLemmatizer()
return lemmatizer.lemmatize(self.string, tag)
PorterStemmer = nltk.stem.PorterStemmer()
LancasterStemmer = nltk.stem.LancasterStemmer()
SnowballStemmer = nltk.stem.SnowballStemmer("english")
# added 'stemmer' on lines of lemmatizer
# based on nltk
def stem(self, stemmer=PorterStemmer):
"""Stem a word using various NLTK stemmers. (Default: Porter Stemmer)
.. versionadded:: 0.12.0
"""
return stemmer.stem(self.string)
@cached_property
def synsets(self):
"""The list of Synset objects for this Word.
:rtype: list of Synsets
.. versionadded:: 0.7.0
"""
return self.get_synsets(pos=None)
@cached_property
def definitions(self):
"""The list of definitions for this word. Each definition corresponds
to a synset.
.. versionadded:: 0.7.0
"""
return self.define(pos=None)
def get_synsets(self, pos=None):
"""Return a list of Synset objects for this word.
:param pos: A part-of-speech tag to filter upon. If ``None``, all
synsets for all parts of speech will be loaded.
:rtype: list of Synsets
.. versionadded:: 0.7.0
"""
return _wordnet.synsets(self.string, pos)
def define(self, pos=None):
"""Return a list of definitions for this word. Each definition
corresponds to a synset for this word.
:param pos: A part-of-speech tag to filter upon. If ``None``, definitions
for all parts of speech will be loaded.
:rtype: List of strings
.. versionadded:: 0.7.0
"""
return [syn.definition() for syn in self.get_synsets(pos=pos)]
class WordList(list):
"""A list-like collection of words."""
def __init__(self, collection):
"""Initialize a WordList. Takes a collection of strings as
its only argument.
"""
super().__init__([Word(w) for w in collection])
def __str__(self):
"""Returns a string representation for printing."""
return super().__repr__()
def __repr__(self):
"""Returns a string representation for debugging."""
class_name = self.__class__.__name__
return f"{class_name}({super().__repr__()})"
def __getitem__(self, key):
"""Returns a string at the given index."""
item = super().__getitem__(key)
if isinstance(key, slice):
return self.__class__(item)
else:
return item
def __getslice__(self, i, j):
# This is included for Python 2.* compatibility
return self.__class__(super().__getslice__(i, j))
def __setitem__(self, index, obj):
"""Places object at given index, replacing existing item. If the object
is a string, inserts a :class:`Word <Word>` object.
"""
if isinstance(obj, basestring):
super().__setitem__(index, Word(obj))
else:
super().__setitem__(index, obj)
def count(self, strg, case_sensitive=False, *args, **kwargs):
"""Get the count of a word or phrase `s` within this WordList.
:param strg: The string to count.
:param case_sensitive: A boolean, whether or not the search is case-sensitive.
"""
if not case_sensitive:
return [word.lower() for word in self].count(strg.lower(), *args, **kwargs)
return super().count(strg, *args, **kwargs)
def append(self, obj):
"""Append an object to end. If the object is a string, appends a
:class:`Word <Word>` object.
"""
if isinstance(obj, basestring):
super().append(Word(obj))
else:
super().append(obj)
def extend(self, iterable):
"""Extend WordList by appending elements from ``iterable``. If an element
is a string, appends a :class:`Word <Word>` object.
"""
for e in iterable:
self.append(e)
def upper(self):
"""Return a new WordList with each word upper-cased."""
return self.__class__([word.upper() for word in self])
def lower(self):
"""Return a new WordList with each word lower-cased."""
return self.__class__([word.lower() for word in self])
def singularize(self):
"""Return the single version of each word in this WordList."""
return self.__class__([word.singularize() for word in self])
def pluralize(self):
"""Return the plural version of each word in this WordList."""
return self.__class__([word.pluralize() for word in self])
def lemmatize(self):
"""Return the lemma of each word in this WordList."""
return self.__class__([word.lemmatize() for word in self])
def stem(self, *args, **kwargs):
"""Return the stem for each word in this WordList."""
return self.__class__([word.stem(*args, **kwargs) for word in self])
def _validated_param(obj, name, base_class, default, base_class_name=None):
"""Validates a parameter passed to __init__. Makes sure that obj is
the correct class. Return obj if it's not None or falls back to default
:param obj: The object passed in.
:param name: The name of the parameter.
:param base_class: The class that obj must inherit from.
:param default: The default object to fall back upon if obj is None.
"""
base_class_name = base_class_name if base_class_name else base_class.__name__
if obj is not None and not isinstance(obj, base_class):
raise ValueError(f"{name} must be an instance of {base_class_name}")
return obj or default
def _initialize_models(
obj, tokenizer, pos_tagger, np_extractor, analyzer, parser, classifier
):
"""Common initialization between BaseBlob and Blobber classes."""
# tokenizer may be a textblob or an NLTK tokenizer
obj.tokenizer = _validated_param(
tokenizer,
"tokenizer",
base_class=(BaseTokenizer, nltk.tokenize.api.TokenizerI), # pyright: ignore
default=BaseBlob.tokenizer,
base_class_name="BaseTokenizer",
)
obj.np_extractor = _validated_param(
np_extractor,
"np_extractor",
base_class=BaseNPExtractor,
default=BaseBlob.np_extractor,
)
obj.pos_tagger = _validated_param(
pos_tagger, "pos_tagger", BaseTagger, BaseBlob.pos_tagger
)
obj.analyzer = _validated_param(
analyzer, "analyzer", BaseSentimentAnalyzer, BaseBlob.analyzer
)
obj.parser = _validated_param(parser, "parser", BaseParser, BaseBlob.parser)
obj.classifier = classifier
class BaseBlob(StringlikeMixin, BlobComparableMixin):
"""An abstract base class that all textblob classes will inherit from.
Includes words, POS tag, NP, and word count properties. Also includes
basic dunder and string methods for making objects like Python strings.
:param text: A string.
:param tokenizer: (optional) A tokenizer instance. If ``None``,
defaults to :class:`WordTokenizer() <textblob.tokenizers.WordTokenizer>`.
:param np_extractor: (optional) An NPExtractor instance. If ``None``,
defaults to :class:`FastNPExtractor() <textblob.en.np_extractors.FastNPExtractor>`.
:param pos_tagger: (optional) A Tagger instance. If ``None``,
defaults to :class:`NLTKTagger <textblob.en.taggers.NLTKTagger>`.
:param analyzer: (optional) A sentiment analyzer. If ``None``,
defaults to :class:`PatternAnalyzer <textblob.en.sentiments.PatternAnalyzer>`.
:param parser: A parser. If ``None``, defaults to
:class:`PatternParser <textblob.en.parsers.PatternParser>`.
:param classifier: A classifier.
.. versionchanged:: 0.6.0
``clean_html`` parameter deprecated, as it was in NLTK.
""" # noqa: E501
np_extractor = FastNPExtractor()
pos_tagger = NLTKTagger()
tokenizer = WordTokenizer()
analyzer = PatternAnalyzer()
parser = PatternParser()
def __init__(
self,
text,
tokenizer=None,
pos_tagger=None,
np_extractor=None,
analyzer=None,
parser=None,
classifier=None,
clean_html=False,
):
if not isinstance(text, basestring):
raise TypeError(
"The `text` argument passed to `__init__(text)` "
f"must be a string, not {type(text)}"
)
if clean_html:
raise NotImplementedError(
"clean_html has been deprecated. "
"To remove HTML markup, use BeautifulSoup's "
"get_text() function"
)
self.raw = self.string = text
self.stripped = lowerstrip(self.raw, all=True)
_initialize_models(
self, tokenizer, pos_tagger, np_extractor, analyzer, parser, classifier
)
@cached_property
def words(self):
"""Return a list of word tokens. This excludes punctuation characters.
If you want to include punctuation characters, access the ``tokens``
property.
:returns: A :class:`WordList <WordList>` of word tokens.
"""
return WordList(word_tokenize(self.raw, include_punc=False))
@cached_property
def tokens(self):
"""Return a list of tokens, using this blob's tokenizer object
(defaults to :class:`WordTokenizer <textblob.tokenizers.WordTokenizer>`).
"""
return WordList(self.tokenizer.tokenize(self.raw))
def tokenize(self, tokenizer=None):
"""Return a list of tokens, using ``tokenizer``.
:param tokenizer: (optional) A tokenizer object. If None, defaults to
this blob's default tokenizer.
"""
t = tokenizer if tokenizer is not None else self.tokenizer
return WordList(t.tokenize(self.raw))
def parse(self, parser=None):
"""Parse the text.
:param parser: (optional) A parser instance. If ``None``, defaults to
this blob's default parser.
.. versionadded:: 0.6.0
"""
p = parser if parser is not None else self.parser
return p.parse(self.raw)
def classify(self):
"""Classify the blob using the blob's ``classifier``."""
if self.classifier is None:
raise NameError("This blob has no classifier. Train one first!")
return self.classifier.classify(self.raw)
@cached_property
def sentiment(self):
"""Return a tuple of form (polarity, subjectivity ) where polarity
is a float within the range [-1.0, 1.0] and subjectivity is a float
within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is
very subjective.
:rtype: namedtuple of the form ``Sentiment(polarity, subjectivity)``
"""
return self.analyzer.analyze(self.raw)
@cached_property
def sentiment_assessments(self):
"""Return a tuple of form (polarity, subjectivity, assessments ) where
polarity is a float within the range [-1.0, 1.0], subjectivity is a
float within the range [0.0, 1.0] where 0.0 is very objective and 1.0
is very subjective, and assessments is a list of polarity and
subjectivity scores for the assessed tokens.
:rtype: namedtuple of the form ``Sentiment(polarity, subjectivity,
assessments)``
"""
return self.analyzer.analyze(self.raw, keep_assessments=True)
@cached_property
def polarity(self):
"""Return the polarity score as a float within the range [-1.0, 1.0]
:rtype: float
"""
return PatternAnalyzer().analyze(self.raw)[0]
@cached_property
def subjectivity(self):
"""Return the subjectivity score as a float within the range [0.0, 1.0]
where 0.0 is very objective and 1.0 is very subjective.
:rtype: float
"""
return PatternAnalyzer().analyze(self.raw)[1]
@cached_property
def noun_phrases(self):
"""Returns a list of noun phrases for this blob."""
return WordList(
[
phrase.strip().lower()
for phrase in self.np_extractor.extract(self.raw)
if len(phrase) > 1
]
)
@cached_property
def pos_tags(self):
"""Returns an list of tuples of the form (word, POS tag).
Example:
::
[
("At", "IN"),
("eight", "CD"),
("o'clock", "JJ"),
("on", "IN"),
("Thursday", "NNP"),
("morning", "NN"),
]
:rtype: list of tuples
"""
if isinstance(self, TextBlob):
return [
val
for sublist in [s.pos_tags for s in self.sentences]
for val in sublist
]
else:
return [
(Word(str(word), pos_tag=t), str(t))
for word, t in self.pos_tagger.tag(self)
if not PUNCTUATION_REGEX.match(str(t))
]
tags = pos_tags
@cached_property
def word_counts(self):
"""Dictionary of word frequencies in this text."""
counts = defaultdict(int)
stripped_words = [lowerstrip(word) for word in self.words]
for word in stripped_words:
counts[word] += 1
return counts
@cached_property
def np_counts(self):
"""Dictionary of noun phrase frequencies in this text."""
counts = defaultdict(int)
for phrase in self.noun_phrases:
counts[phrase] += 1
return counts
def ngrams(self, n=3):
"""Return a list of n-grams (tuples of n successive words) for this
blob.
:rtype: List of :class:`WordLists <WordList>`
"""
if n <= 0:
return []
grams = [
WordList(self.words[i : i + n]) for i in range(len(self.words) - n + 1)
]
return grams
def correct(self):
"""Attempt to correct the spelling of a blob.
.. versionadded:: 0.6.0
:rtype: :class:`BaseBlob <BaseBlob>`
"""
# regex matches: word or punctuation or whitespace
tokens = nltk.tokenize.regexp_tokenize(self.raw, r"\w+|[^\w\s]|\s")
corrected = (Word(w).correct() for w in tokens)
ret = "".join(corrected)
return self.__class__(ret)
def _cmpkey(self):
"""Key used by ComparableMixin to implement all rich comparison
operators.
"""
return self.raw
def _strkey(self):
"""Key used by StringlikeMixin to implement string methods."""
return self.raw
def __hash__(self):
return hash(self._cmpkey())
def __add__(self, other):
"""Concatenates two text objects the same way Python strings are
concatenated.
Arguments:
- `other`: a string or a text object
"""
if isinstance(other, basestring):
return self.__class__(self.raw + other)
elif isinstance(other, BaseBlob):
return self.__class__(self.raw + other.raw)
else:
raise TypeError(
f"Operands must be either strings or {self.__class__.__name__} objects"
)
def split(self, sep=None, maxsplit=sys.maxsize):
"""Behaves like the built-in str.split() except returns a
WordList.
:rtype: :class:`WordList <WordList>`
"""
return WordList(self._strkey().split(sep, maxsplit))
class TextBlob(BaseBlob):
"""A general text block, meant for larger bodies of text (esp. those
containing sentences). Inherits from :class:`BaseBlob <BaseBlob>`.
:param str text: A string.
:param tokenizer: (optional) A tokenizer instance. If ``None``, defaults to
:class:`WordTokenizer() <textblob.tokenizers.WordTokenizer>`.
:param np_extractor: (optional) An NPExtractor instance. If ``None``,
defaults to :class:`FastNPExtractor() <textblob.en.np_extractors.FastNPExtractor>`.
:param pos_tagger: (optional) A Tagger instance. If ``None``, defaults to
:class:`NLTKTagger <textblob.en.taggers.NLTKTagger>`.
:param analyzer: (optional) A sentiment analyzer. If ``None``, defaults to
:class:`PatternAnalyzer <textblob.en.sentiments.PatternAnalyzer>`.
:param classifier: (optional) A classifier.
""" # noqa: E501
@cached_property
def sentences(self):
"""Return list of :class:`Sentence <Sentence>` objects."""
return self._create_sentence_objects()
@cached_property
def words(self):
"""Return a list of word tokens. This excludes punctuation characters.
If you want to include punctuation characters, access the ``tokens``
property.
:returns: A :class:`WordList <WordList>` of word tokens.
"""
return WordList(word_tokenize(self.raw, include_punc=False))
@property
def raw_sentences(self):
"""List of strings, the raw sentences in the blob."""
return [sentence.raw for sentence in self.sentences]
@property
def serialized(self):
"""Returns a list of each sentence's dict representation."""
return [sentence.dict for sentence in self.sentences]
def to_json(self, *args, **kwargs):
"""Return a json representation (str) of this blob.
Takes the same arguments as json.dumps.
.. versionadded:: 0.5.1
"""
return json.dumps(self.serialized, *args, **kwargs)
@property
def json(self):
"""The json representation of this blob.
.. versionchanged:: 0.5.1
Made ``json`` a property instead of a method to restore backwards
compatibility that was broken after version 0.4.0.
"""
return self.to_json()
def _create_sentence_objects(self):
"""Returns a list of Sentence objects from the raw text."""
sentence_objects = []
sentences = sent_tokenize(self.raw)
char_index = 0 # Keeps track of character index within the blob
for sent in sentences:
# Compute the start and end indices of the sentence
# within the blob
start_index = self.raw.index(sent, char_index)
char_index += len(sent)
end_index = start_index + len(sent)
# Sentences share the same models as their parent blob
s = Sentence(
sent,
start_index=start_index,
end_index=end_index,
tokenizer=self.tokenizer,
np_extractor=self.np_extractor,
pos_tagger=self.pos_tagger,
analyzer=self.analyzer,
parser=self.parser,
classifier=self.classifier,
)
sentence_objects.append(s)
return sentence_objects
class Sentence(BaseBlob):
"""A sentence within a TextBlob. Inherits from :class:`BaseBlob <BaseBlob>`.
:param sentence: A string, the raw sentence.
:param start_index: An int, the index where this sentence begins
in a TextBlob. If not given, defaults to 0.
:param end_index: An int, the index where this sentence ends in
a TextBlob. If not given, defaults to the
length of the sentence - 1.
"""
def __init__(self, sentence, start_index=0, end_index=None, *args, **kwargs):
super().__init__(sentence, *args, **kwargs)
#: The start index within a TextBlob
self.start = self.start_index = start_index
#: The end index within a textBlob
gitextract_d9nsa0ec/ ├── .github/ │ ├── dependabot.yml │ └── workflows/ │ └── build-release.yml ├── .gitignore ├── .konchrc ├── .pre-commit-config.yaml ├── .readthedocs.yml ├── AUTHORS.rst ├── CHANGELOG.rst ├── CONTRIBUTING.rst ├── LICENSE ├── NOTICE ├── README.rst ├── RELEASING.md ├── SECURITY.md ├── docs/ │ ├── Makefile │ ├── _templates/ │ │ ├── side-primary.html │ │ └── side-secondary.html │ ├── _themes/ │ │ ├── .gitignore │ │ ├── LICENSE │ │ ├── flask_theme_support.py │ │ ├── kr/ │ │ │ ├── layout.html │ │ │ ├── relations.html │ │ │ ├── static/ │ │ │ │ ├── flasky.css_t │ │ │ │ └── small_flask.css │ │ │ └── theme.conf │ │ └── kr_small/ │ │ ├── layout.html │ │ ├── static/ │ │ │ └── flasky.css_t │ │ └── theme.conf │ ├── advanced_usage.rst │ ├── api_reference.rst │ ├── authors.rst │ ├── changelog.rst │ ├── classifiers.rst │ ├── conf.py │ ├── contributing.rst │ ├── extensions.rst │ ├── index.rst │ ├── install.rst │ ├── license.rst │ ├── make.bat │ └── quickstart.rst ├── pyproject.toml ├── src/ │ └── textblob/ │ ├── __init__.py │ ├── _text.py │ ├── base.py │ ├── blob.py │ ├── classifiers.py │ ├── decorators.py │ ├── download_corpora.py │ ├── en/ │ │ ├── __init__.py │ │ ├── en-context.txt │ │ ├── en-entities.txt │ │ ├── en-lexicon.txt │ │ ├── en-morphology.txt │ │ ├── en-sentiment.xml │ │ ├── en-spelling.txt │ │ ├── inflect.py │ │ ├── np_extractors.py │ │ ├── parsers.py │ │ ├── sentiments.py │ │ └── taggers.py │ ├── exceptions.py │ ├── formats.py │ ├── inflect.py │ ├── mixins.py │ ├── np_extractors.py │ ├── parsers.py │ ├── sentiments.py │ ├── taggers.py │ ├── tokenizers.py │ ├── utils.py │ └── wordnet.py ├── tests/ │ ├── __init__.py │ ├── data.csv │ ├── data.json │ ├── data.tsv │ ├── test_blob.py │ ├── test_classifiers.py │ ├── test_decorators.py │ ├── test_formats.py │ ├── test_inflect.py │ ├── test_np_extractor.py │ ├── test_parsers.py │ ├── test_sentiments.py │ ├── test_taggers.py │ ├── test_tokenizers.py │ └── test_utils.py └── tox.ini
SYMBOL INDEX (629 symbols across 29 files)
FILE: docs/_themes/flask_theme_support.py
class FlaskyStyle (line 19) | class FlaskyStyle(Style):
FILE: src/textblob/_text.py
function decode_string (line 34) | def decode_string(v, encoding="utf-8"):
function encode_string (line 48) | def encode_string(v, encoding="utf-8"):
function isnumeric (line 66) | def isnumeric(strg):
class lazydict (line 79) | class lazydict(dict):
method load (line 80) | def load(self):
method _lazy (line 85) | def _lazy(self, method, *args):
method __repr__ (line 94) | def __repr__(self):
method __len__ (line 97) | def __len__(self):
method __iter__ (line 100) | def __iter__(self):
method __contains__ (line 103) | def __contains__(self, *args):
method __getitem__ (line 106) | def __getitem__(self, *args):
method __setitem__ (line 109) | def __setitem__(self, *args):
method setdefault (line 112) | def setdefault(self, *args):
method get (line 115) | def get(self, *args, **kwargs):
method items (line 118) | def items(self):
method keys (line 121) | def keys(self):
method values (line 124) | def values(self):
method update (line 127) | def update(self, *args, **kwargs):
method pop (line 130) | def pop(self, *args):
method popitem (line 133) | def popitem(self, *args):
class lazylist (line 137) | class lazylist(list):
method load (line 138) | def load(self):
method _lazy (line 143) | def _lazy(self, method, *args):
method __repr__ (line 152) | def __repr__(self):
method __len__ (line 155) | def __len__(self):
method __iter__ (line 158) | def __iter__(self):
method __contains__ (line 161) | def __contains__(self, *args):
method insert (line 164) | def insert(self, *args):
method append (line 167) | def append(self, *args):
method extend (line 170) | def extend(self, *args):
method remove (line 173) | def remove(self, *args):
method pop (line 176) | def pop(self, *args):
function penntreebank2universal (line 213) | def penntreebank2universal(token, tag):
function find_tokens (line 350) | def find_tokens(
function _read (line 459) | def _read(path, encoding="utf-8", comment=";;;"):
class Lexicon (line 489) | class Lexicon(lazydict):
method __init__ (line 490) | def __init__(
method load (line 508) | def load(self):
method path (line 513) | def path(self):
method language (line 517) | def language(self):
class Rules (line 527) | class Rules:
method __init__ (line 528) | def __init__(self, lexicon=None, cmd=None):
method apply (line 535) | def apply(self, x):
class Morphology (line 540) | class Morphology(lazylist, Rules):
method __init__ (line 541) | def __init__(self, lexicon=None, path=""):
method path (line 562) | def path(self):
method load (line 565) | def load(self):
method apply (line 569) | def apply(self, token, previous=(None, None), next=(None, None)):
method insert (line 601) | def insert(self, i, tag, affix, cmd="hassuf", tagged=None):
method append (line 617) | def append(self, *args, **kwargs):
method extend (line 620) | def extend(self, rules=None):
class Context (line 632) | class Context(lazylist, Rules):
method __init__ (line 633) | def __init__(self, lexicon=None, path=""):
method path (line 670) | def path(self):
method load (line 673) | def load(self):
method apply (line 677) | def apply(self, tokens):
method insert (line 727) | def insert(self, i, tag1, tag2, cmd="prevtag", x=None, y=None, *args):
method append (line 739) | def append(self, *args, **kwargs):
method extend (line 742) | def extend(self, rules=None, *args):
class Entities (line 756) | class Entities(lazydict, Rules):
method __init__ (line 757) | def __init__(self, lexicon=None, path="", tag="NNP"):
method path (line 773) | def path(self):
method load (line 776) | def load(self):
method apply (line 783) | def apply(self, tokens):
method append (line 815) | def append(self, entity, name="pers"):
method extend (line 822) | def extend(self, entities):
function avg (line 854) | def avg(list):
class Score (line 858) | class Score(tuple):
method __new__ (line 859) | def __new__(self, polarity, subjectivity, assessments=None):
method __init__ (line 865) | def __init__(self, polarity, subjectivity, assessments=None):
class Sentiment (line 871) | class Sentiment(lazydict):
method __init__ (line 872) | def __init__(self, path="", language=None, synset=None, confidence=Non...
method path (line 890) | def path(self):
method language (line 894) | def language(self):
method confidence (line 898) | def confidence(self):
method load (line 901) | def load(self, path=None):
method synset (line 951) | def synset(self, id, pos=ADJECTIVE):
method __call__ (line 970) | def __call__(self, s, negation=True, **kwargs):
method assessments (line 1048) | def assessments(self, words=None, negation=True):
method annotate (line 1136) | def annotate(
function _suffix_rules (line 1154) | def _suffix_rules(token, tag="NN"):
function find_tags (line 1178) | def find_tags(
function find_chunks (line 1316) | def find_chunks(tagged, language="en"):
function find_prepositions (line 1360) | def find_prepositions(chunked):
class Parser (line 1424) | class Parser:
method __init__ (line 1425) | def __init__(self, lexicon=None, default=("NN", "NNP", "CD"), language...
method find_tokens (line 1440) | def find_tokens(self, string, **kwargs):
method find_tags (line 1453) | def find_tags(self, tokens, **kwargs):
method find_chunks (line 1466) | def find_chunks(self, tokens, **kwargs):
method find_prepositions (line 1476) | def find_prepositions(self, tokens, **kwargs):
method find_labels (line 1480) | def find_labels(self, tokens, **kwargs):
method find_lemmata (line 1484) | def find_lemmata(self, tokens, **kwargs):
method parse (line 1488) | def parse(
class TaggedString (line 1573) | class TaggedString(str):
method __new__ (line 1574) | def __new__(cls, string, tags=None, language=None):
method split (line 1595) | def split(self, sep=TOKENS):
class Spelling (line 1616) | class Spelling(lazydict):
method __init__ (line 1619) | def __init__(self, path=""):
method load (line 1622) | def load(self):
method path (line 1628) | def path(self):
method language (line 1632) | def language(self):
method train (line 1636) | def train(cls, s, path="spelling.txt"):
method _edit1 (line 1649) | def _edit1(self, w):
method _edit2 (line 1662) | def _edit2(self, w):
method _known (line 1668) | def _known(self, words=None):
method suggest (line 1674) | def suggest(self, w):
FILE: src/textblob/base.py
class BaseTagger (line 21) | class BaseTagger(metaclass=ABCMeta):
method tag (line 28) | def tag(self, text: str, tokenize=True) -> list[tuple[str, str]]:
class BaseNPExtractor (line 38) | class BaseNPExtractor(metaclass=ABCMeta):
method extract (line 45) | def extract(self, text: str) -> list[str]:
class BaseTokenizer (line 53) | class BaseTokenizer(nltk.tokenize.api.TokenizerI, metaclass=ABCMeta): #...
method tokenize (line 60) | def tokenize(self, text: str) -> list[str]:
method itokenize (line 67) | def itokenize(self, text: str, *args, **kwargs):
class BaseSentimentAnalyzer (line 84) | class BaseSentimentAnalyzer(metaclass=ABCMeta):
method __init__ (line 94) | def __init__(self):
method train (line 97) | def train(self):
method analyze (line 102) | def analyze(self, text) -> Any:
class BaseParser (line 116) | class BaseParser(metaclass=ABCMeta):
method parse (line 122) | def parse(self, text: AnyStr):
FILE: src/textblob/blob.py
function _penn_to_wordnet (line 55) | def _penn_to_wordnet(tag):
class Word (line 68) | class Word(str):
method __new__ (line 73) | def __new__(cls, string, pos_tag=None):
method __init__ (line 80) | def __init__(self, string, pos_tag=None):
method __repr__ (line 84) | def __repr__(self):
method __str__ (line 87) | def __str__(self):
method singularize (line 90) | def singularize(self):
method pluralize (line 94) | def pluralize(self):
method spellcheck (line 98) | def spellcheck(self):
method correct (line 109) | def correct(self):
method lemma (line 119) | def lemma(self):
method lemmatize (line 124) | def lemmatize(self, pos=None):
method stem (line 147) | def stem(self, stemmer=PorterStemmer):
method synsets (line 155) | def synsets(self):
method definitions (line 165) | def definitions(self):
method get_synsets (line 173) | def get_synsets(self, pos=None):
method define (line 185) | def define(self, pos=None):
class WordList (line 198) | class WordList(list):
method __init__ (line 201) | def __init__(self, collection):
method __str__ (line 207) | def __str__(self):
method __repr__ (line 211) | def __repr__(self):
method __getitem__ (line 216) | def __getitem__(self, key):
method __getslice__ (line 224) | def __getslice__(self, i, j):
method __setitem__ (line 228) | def __setitem__(self, index, obj):
method count (line 237) | def count(self, strg, case_sensitive=False, *args, **kwargs):
method append (line 247) | def append(self, obj):
method extend (line 256) | def extend(self, iterable):
method upper (line 263) | def upper(self):
method lower (line 267) | def lower(self):
method singularize (line 271) | def singularize(self):
method pluralize (line 275) | def pluralize(self):
method lemmatize (line 279) | def lemmatize(self):
method stem (line 283) | def stem(self, *args, **kwargs):
function _validated_param (line 288) | def _validated_param(obj, name, base_class, default, base_class_name=None):
function _initialize_models (line 303) | def _initialize_models(
class BaseBlob (line 331) | class BaseBlob(StringlikeMixin, BlobComparableMixin):
method __init__ (line 359) | def __init__(
method words (line 388) | def words(self):
method tokens (line 398) | def tokens(self):
method tokenize (line 404) | def tokenize(self, tokenizer=None):
method parse (line 413) | def parse(self, parser=None):
method classify (line 424) | def classify(self):
method sentiment (line 431) | def sentiment(self):
method sentiment_assessments (line 442) | def sentiment_assessments(self):
method polarity (line 455) | def polarity(self):
method subjectivity (line 463) | def subjectivity(self):
method noun_phrases (line 472) | def noun_phrases(self):
method pos_tags (line 483) | def pos_tags(self):
method word_counts (line 516) | def word_counts(self):
method np_counts (line 525) | def np_counts(self):
method ngrams (line 532) | def ngrams(self, n=3):
method correct (line 545) | def correct(self):
method _cmpkey (line 558) | def _cmpkey(self):
method _strkey (line 564) | def _strkey(self):
method __hash__ (line 568) | def __hash__(self):
method __add__ (line 571) | def __add__(self, other):
method split (line 587) | def split(self, sep=None, maxsplit=sys.maxsize):
class TextBlob (line 596) | class TextBlob(BaseBlob):
method sentences (line 613) | def sentences(self):
method words (line 618) | def words(self):
method raw_sentences (line 628) | def raw_sentences(self):
method serialized (line 633) | def serialized(self):
method to_json (line 637) | def to_json(self, *args, **kwargs):
method json (line 646) | def json(self):
method _create_sentence_objects (line 655) | def _create_sentence_objects(self):
class Sentence (line 682) | class Sentence(BaseBlob):
method __init__ (line 693) | def __init__(self, sentence, start_index=0, end_index=None, *args, **k...
method dict (line 701) | def dict(self):
class Blobber (line 714) | class Blobber:
method __init__ (line 750) | def __init__(
method __call__ (line 763) | def __call__(self, text):
method __repr__ (line 779) | def __repr__(self):
FILE: src/textblob/classifiers.py
function _get_words_from_dataset (line 49) | def _get_words_from_dataset(dataset):
function _get_document_tokens (line 68) | def _get_document_tokens(document):
function basic_extractor (line 79) | def basic_extractor(document, train_set):
function contains_extractor (line 106) | def contains_extractor(document):
class BaseClassifier (line 118) | class BaseClassifier:
method __init__ (line 139) | def __init__(
method _read_data (line 153) | def _read_data(self, dataset, format=None):
method classifier (line 172) | def classifier(self):
method classify (line 176) | def classify(self, text):
method train (line 180) | def train(self, labeled_featureset):
method labels (line 184) | def labels(self):
method extract_features (line 188) | def extract_features(self, text):
class NLTKClassifier (line 200) | class NLTKClassifier(BaseClassifier):
method __init__ (line 215) | def __init__(
method __repr__ (line 221) | def __repr__(self):
method classifier (line 226) | def classifier(self):
method train (line 235) | def train(self, *args, **kwargs):
method labels (line 256) | def labels(self):
method classify (line 260) | def classify(self, text):
method accuracy (line 268) | def accuracy(self, test_set, format=None):
method update (line 284) | def update(self, new_data, *args, **kwargs):
class NaiveBayesClassifier (line 305) | class NaiveBayesClassifier(NLTKClassifier):
method prob_classify (line 323) | def prob_classify(self, text):
method informative_features (line 342) | def informative_features(self, *args, **kwargs):
method show_informative_features (line 350) | def show_informative_features(self, *args, **kwargs):
class DecisionTreeClassifier (line 359) | class DecisionTreeClassifier(NLTKClassifier):
method pretty_format (line 377) | def pretty_format(self, *args, **kwargs):
method pseudocode (line 389) | def pseudocode(self, *args, **kwargs):
class PositiveNaiveBayesClassifier (line 398) | class PositiveNaiveBayesClassifier(NLTKClassifier):
method __init__ (line 440) | def __init__(
method __repr__ (line 455) | def __repr__(self):
method train (line 463) | def train(self, *args, **kwargs):
method update (line 477) | def update(
class MaxEntClassifier (line 512) | class MaxEntClassifier(NLTKClassifier):
method prob_classify (line 516) | def prob_classify(self, text):
FILE: src/textblob/decorators.py
class cached_property (line 17) | class cached_property:
method __init__ (line 25) | def __init__(self, func):
method __get__ (line 29) | def __get__(self, obj, cls):
function requires_nltk_corpus (line 36) | def requires_nltk_corpus(
FILE: src/textblob/download_corpora.py
function download_lite (line 34) | def download_lite():
function download_all (line 39) | def download_all():
function main (line 44) | def main():
FILE: src/textblob/en/__init__.py
function find_lemmata (line 20) | def find_lemmata(tokens):
class Parser (line 36) | class Parser(_Parser):
method find_lemmata (line 37) | def find_lemmata(self, tokens, **kwargs):
method find_tags (line 40) | def find_tags(self, tokens, **kwargs):
class Sentiment (line 50) | class Sentiment(_Sentiment):
method load (line 51) | def load(self, path=None):
function tokenize (line 85) | def tokenize(s, *args, **kwargs):
function parse (line 90) | def parse(s, *args, **kwargs):
function parsetree (line 95) | def parsetree(s, *args, **kwargs):
function split (line 100) | def split(s, token=None):
function tag (line 107) | def tag(s, tokenize=True, encoding="utf-8"):
function suggest (line 116) | def suggest(w):
function polarity (line 121) | def polarity(s, **kwargs):
function subjectivity (line 126) | def subjectivity(s, **kwargs):
function positive (line 131) | def positive(s, threshold=0.1, **kwargs):
FILE: src/textblob/en/inflect.py
function pluralize (line 534) | def pluralize(word: str, pos=NOUN, custom=None, classical=True) -> str:
function singularize (line 847) | def singularize(word: str, pos=NOUN, custom: MutableMapping[str, str] | ...
FILE: src/textblob/en/np_extractors.py
class ChunkParser (line 11) | class ChunkParser(nltk.ChunkParserI):
method __init__ (line 14) | def __init__(self):
method train (line 18) | def train(self):
method parse (line 30) | def parse(self, tokens):
class ConllExtractor (line 44) | class ConllExtractor(BaseNPExtractor):
method __init__ (line 63) | def __init__(self, parser=None):
method extract (line 66) | def extract(self, text):
method _parse_sentence (line 86) | def _parse_sentence(self, sentence):
class FastNPExtractor (line 92) | class FastNPExtractor(BaseNPExtractor):
method __init__ (line 110) | def __init__(self):
method train (line 114) | def train(self):
method _tokenize_sentence (line 137) | def _tokenize_sentence(self, sentence):
method extract (line 142) | def extract(self, text):
function _normalize_tags (line 173) | def _normalize_tags(chunk):
function _is_match (line 192) | def _is_match(tagged_phrase, cfg):
FILE: src/textblob/en/parsers.py
class PatternParser (line 9) | class PatternParser(BaseParser):
method parse (line 14) | def parse(self, text):
FILE: src/textblob/en/sentiments.py
class PatternAnalyzer (line 15) | class PatternAnalyzer(BaseSentimentAnalyzer):
method analyze (line 30) | def analyze(self, text, keep_assessments=False):
function _default_feature_extractor (line 48) | def _default_feature_extractor(words):
class NaiveBayesAnalyzer (line 53) | class NaiveBayesAnalyzer(BaseSentimentAnalyzer):
method __init__ (line 66) | def __init__(self, feature_extractor=_default_feature_extractor):
method train (line 72) | def train(self):
method analyze (line 94) | def analyze(self, text):
FILE: src/textblob/en/taggers.py
class PatternTagger (line 11) | class PatternTagger(BaseTagger):
method tag (line 17) | def tag(self, text, tokenize=True):
class NLTKTagger (line 24) | class NLTKTagger(BaseTagger):
method tag (line 30) | def tag(self, text):
FILE: src/textblob/exceptions.py
class TextBlobError (line 13) | class TextBlobError(Exception):
class MissingCorpusError (line 22) | class MissingCorpusError(TextBlobError):
method __init__ (line 27) | def __init__(self, message=MISSING_CORPUS_MESSAGE, *args, **kwargs):
class DeprecationError (line 34) | class DeprecationError(TextBlobError):
class TranslatorError (line 40) | class TranslatorError(TextBlobError):
class NotTranslated (line 46) | class NotTranslated(TranslatorError):
class FormatError (line 54) | class FormatError(TextBlobError):
FILE: src/textblob/formats.py
class BaseFormat (line 35) | class BaseFormat:
method __init__ (line 45) | def __init__(self, fp, **kwargs):
method to_iterable (line 48) | def to_iterable(self):
method detect (line 53) | def detect(cls, stream: str):
class DelimitedFormat (line 63) | class DelimitedFormat(BaseFormat):
method __init__ (line 69) | def __init__(self, fp, **kwargs):
method to_iterable (line 74) | def to_iterable(self):
method detect (line 79) | def detect(cls, stream):
class CSV (line 88) | class CSV(DelimitedFormat):
class TSV (line 99) | class TSV(DelimitedFormat):
class JSON (line 105) | class JSON(BaseFormat):
method __init__ (line 118) | def __init__(self, fp, **kwargs):
method to_iterable (line 122) | def to_iterable(self):
method detect (line 127) | def detect(cls, stream: str | bytes | bytearray):
function detect (line 145) | def detect(fp, max_read=1024):
function get_registry (line 160) | def get_registry():
function register (line 165) | def register(name, format_class):
FILE: src/textblob/mixins.py
class ComparableMixin (line 4) | class ComparableMixin:
method _cmpkey (line 7) | def _cmpkey(self):
method _compare (line 10) | def _compare(self, other, method):
method __lt__ (line 18) | def __lt__(self, other):
method __le__ (line 21) | def __le__(self, other):
method __eq__ (line 24) | def __eq__(self, other):
method __ge__ (line 27) | def __ge__(self, other):
method __gt__ (line 30) | def __gt__(self, other):
method __ne__ (line 33) | def __ne__(self, other):
class BlobComparableMixin (line 37) | class BlobComparableMixin(ComparableMixin):
method _compare (line 40) | def _compare(self, other, method):
class StringlikeMixin (line 47) | class StringlikeMixin:
method _strkey (line 55) | def _strkey(self) -> str:
method __repr__ (line 58) | def __repr__(self):
method __str__ (line 64) | def __str__(self):
method __len__ (line 69) | def __len__(self):
method __iter__ (line 73) | def __iter__(self):
method __contains__ (line 79) | def __contains__(self, sub):
method __getitem__ (line 83) | def __getitem__(self, index):
method find (line 94) | def find(self, sub, start=0, end=sys.maxsize):
method rfind (line 101) | def rfind(self, sub, start=0, end=sys.maxsize):
method index (line 108) | def index(self, sub, start=0, end=sys.maxsize):
method rindex (line 114) | def rindex(self, sub, start=0, end=sys.maxsize):
method startswith (line 120) | def startswith(self, prefix, start=0, end=sys.maxsize):
method endswith (line 124) | def endswith(self, suffix, start=0, end=sys.maxsize):
method title (line 132) | def title(self):
method format (line 136) | def format(self, *args, **kwargs):
method split (line 142) | def split(self, sep=None, maxsplit=sys.maxsize):
method strip (line 146) | def strip(self, chars=None):
method upper (line 152) | def upper(self):
method lower (line 156) | def lower(self):
method join (line 160) | def join(self, iterable):
method replace (line 169) | def replace(self, old, new, count=sys.maxsize):
FILE: src/textblob/tokenizers.py
class WordTokenizer (line 15) | class WordTokenizer(BaseTokenizer):
method tokenize (line 27) | def tokenize(self, text, include_punc=True):
class SentenceTokenizer (line 50) | class SentenceTokenizer(BaseTokenizer):
method tokenize (line 58) | def tokenize(self, text):
function word_tokenize (line 69) | def word_tokenize(text, include_punc=True, *args, **kwargs):
FILE: src/textblob/utils.py
function strip_punc (line 13) | def strip_punc(s: str, all=False):
function lowerstrip (line 26) | def lowerstrip(s: str, all=False):
function tree2str (line 36) | def tree2str(tree, concat=" "):
function filter_insignificant (line 45) | def filter_insignificant(
function is_filelike (line 61) | def is_filelike(obj):
FILE: tests/test_blob.py
class WordListTest (line 49) | class WordListTest(TestCase):
method setUp (line 50) | def setUp(self):
method test_len (line 54) | def test_len(self):
method test_slicing (line 58) | def test_slicing(self):
method test_repr (line 68) | def test_repr(self):
method test_slice_repr (line 72) | def test_slice_repr(self):
method test_str (line 76) | def test_str(self):
method test_singularize (line 80) | def test_singularize(self):
method test_pluralize (line 86) | def test_pluralize(self):
method test_lemmatize (line 91) | def test_lemmatize(self):
method test_stem (line 95) | def test_stem(self): # only PorterStemmer tested
method test_upper (line 99) | def test_upper(self):
method test_lower (line 103) | def test_lower(self):
method test_count (line 107) | def test_count(self):
method test_convert_to_list (line 113) | def test_convert_to_list(self):
method test_append (line 117) | def test_append(self):
method test_extend (line 124) | def test_extend(self):
method test_pop (line 130) | def test_pop(self):
method test_setitem (line 140) | def test_setitem(self):
method test_reverse (line 145) | def test_reverse(self):
class SentenceTest (line 151) | class SentenceTest(TestCase):
method setUp (line 152) | def setUp(self):
method test_repr (line 156) | def test_repr(self):
method test_stripped_sentence (line 159) | def test_stripped_sentence(self):
method test_len (line 165) | def test_len(self):
method test_dict (line 169) | def test_dict(self):
method test_pos_tags (line 181) | def test_pos_tags(self):
method test_noun_phrases (line 209) | def test_noun_phrases(self):
method test_words_are_word_objects (line 213) | def test_words_are_word_objects(self):
method test_string_equality (line 218) | def test_string_equality(self):
method test_correct (line 221) | def test_correct(self):
class TextBlobTest (line 230) | class TextBlobTest(TestCase):
method setUp (line 231) | def setUp(self):
method test_init (line 273) | def test_init(self):
method test_string_equality (line 283) | def test_string_equality(self):
method test_string_comparison (line 287) | def test_string_comparison(self):
method test_hash (line 292) | def test_hash(self):
method test_stripped (line 297) | def test_stripped(self):
method test_ngrams (line 301) | def test_ngrams(self):
method test_clean_html (line 315) | def test_clean_html(self):
method test_sentences (line 327) | def test_sentences(self):
method test_senences_with_space_before_punctuation (line 332) | def test_senences_with_space_before_punctuation(self):
method test_sentiment_of_foreign_text (line 337) | def test_sentiment_of_foreign_text(self):
method test_iter (line 346) | def test_iter(self):
method test_raw_sentences (line 350) | def test_raw_sentences(self):
method test_blob_with_no_sentences (line 355) | def test_blob_with_no_sentences(self):
method test_len (line 364) | def test_len(self):
method test_repr (line 368) | def test_repr(self):
method test_cmp (line 372) | def test_cmp(self):
method test_invalid_comparison (line 383) | def test_invalid_comparison(self):
method test_words (line 389) | def test_words(self):
method test_words_includes_apostrophes_in_contractions (line 411) | def test_words_includes_apostrophes_in_contractions(self):
method test_pos_tags (line 419) | def test_pos_tags(self):
method test_tags (line 436) | def test_tags(self):
method test_tagging_nonascii (line 439) | def test_tagging_nonascii(self):
method test_pos_tags_includes_one_letter_articles (line 447) | def test_pos_tags_includes_one_letter_articles(self):
method test_np_extractor_defaults_to_fast_tagger (line 452) | def test_np_extractor_defaults_to_fast_tagger(self):
method test_np_extractor_is_shared_among_instances (line 457) | def test_np_extractor_is_shared_among_instances(self):
method test_can_use_different_np_extractors (line 463) | def test_can_use_different_np_extractors(self):
method test_can_use_different_sentanalyzer (line 470) | def test_can_use_different_sentanalyzer(self):
method test_discrete_sentiment (line 475) | def test_discrete_sentiment(self):
method test_can_get_subjectivity_and_polarity_with_different_analyzer (line 479) | def test_can_get_subjectivity_and_polarity_with_different_analyzer(self):
method test_pos_tagger_defaults_to_pattern (line 485) | def test_pos_tagger_defaults_to_pattern(self):
method test_pos_tagger_is_shared_among_instances (line 489) | def test_pos_tagger_is_shared_among_instances(self):
method test_can_use_different_pos_tagger (line 494) | def test_can_use_different_pos_tagger(self):
method test_can_pass_np_extractor_to_constructor (line 500) | def test_can_pass_np_extractor_to_constructor(self):
method test_getitem (line 505) | def test_getitem(self):
method test_upper (line 510) | def test_upper(self):
method test_upper_and_words (line 515) | def test_upper_and_words(self):
method test_lower (line 519) | def test_lower(self):
method test_find (line 524) | def test_find(self):
method test_rfind (line 529) | def test_rfind(self):
method test_startswith (line 534) | def test_startswith(self):
method test_endswith (line 539) | def test_endswith(self):
method test_split (line 544) | def test_split(self):
method test_title (line 548) | def test_title(self):
method test_format (line 552) | def test_format(self):
method test_using_indices_for_slicing (line 557) | def test_using_indices_for_slicing(self):
method test_indices_with_only_one_sentences (line 563) | def test_indices_with_only_one_sentences(self):
method test_indices_with_multiple_puncutations (line 568) | def test_indices_with_multiple_puncutations(self):
method test_indices_short_names (line 574) | def test_indices_short_names(self):
method test_replace (line 580) | def test_replace(self):
method test_join (line 585) | def test_join(self):
method test_blob_noun_phrases (line 592) | def test_blob_noun_phrases(self):
method test_word_counts (line 597) | def test_word_counts(self):
method test_np_counts (line 615) | def test_np_counts(self):
method test_add (line 624) | def test_add(self):
method test_unicode (line 640) | def test_unicode(self):
method test_strip (line 644) | def test_strip(self):
method test_strip_and_words (line 650) | def test_strip_and_words(self):
method test_index (line 654) | def test_index(self):
method test_sentences_after_concatenation (line 658) | def test_sentences_after_concatenation(self):
method test_sentiment (line 665) | def test_sentiment(self):
method test_subjectivity (line 675) | def test_subjectivity(self):
method test_polarity (line 680) | def test_polarity(self):
method test_sentiment_of_emoticons (line 685) | def test_sentiment_of_emoticons(self):
method test_bad_init (line 690) | def test_bad_init(self):
method test_in (line 698) | def test_in(self):
method test_json (line 704) | def test_json(self):
method test_words_are_word_objects (line 719) | def test_words_are_word_objects(self):
method test_words_have_pos_tags (line 723) | def test_words_have_pos_tags(self):
method test_tokenizer_defaults_to_word_tokenizer (line 731) | def test_tokenizer_defaults_to_word_tokenizer(self):
method test_tokens_property (line 734) | def test_tokens_property(self):
method test_can_use_an_different_tokenizer (line 737) | def test_can_use_an_different_tokenizer(self):
method test_tokenize_method (line 742) | def test_tokenize_method(self):
method test_tags_uses_custom_tokenizer (line 750) | def test_tags_uses_custom_tokenizer(self):
method test_tags_with_custom_tokenizer_and_tagger (line 764) | def test_tags_with_custom_tokenizer_and_tagger(self):
method test_correct (line 783) | def test_correct(self):
method test_parse (line 813) | def test_parse(self):
method test_passing_bad_init_params (line 817) | def test_passing_bad_init_params(self):
method test_classify (line 830) | def test_classify(self):
method test_classify_without_classifier (line 839) | def test_classify_without_classifier(self):
method test_word_string_type_after_pos_tags_is_str (line 844) | def test_word_string_type_after_pos_tags_is_str(self):
class WordTest (line 851) | class WordTest(TestCase):
method setUp (line 852) | def setUp(self):
method test_init (line 856) | def test_init(self):
method test_singularize (line 862) | def test_singularize(self):
method test_pluralize (line 868) | def test_pluralize(self):
method test_repr (line 873) | def test_repr(self):
method test_str (line 876) | def test_str(self):
method test_has_str_methods (line 879) | def test_has_str_methods(self):
method test_spellcheck (line 884) | def test_spellcheck(self):
method test_spellcheck_special_cases (line 889) | def test_spellcheck_special_cases(self):
method test_correct (line 900) | def test_correct(self):
method test_lemmatize (line 907) | def test_lemmatize(self):
method test_lemma (line 916) | def test_lemma(self):
method test_stem (line 922) | def test_stem(self): # only PorterStemmer tested
method test_synsets (line 930) | def test_synsets(self):
method test_synsets_with_pos_argument (line 935) | def test_synsets_with_pos_argument(self):
method test_definitions (line 941) | def test_definitions(self):
method test_define (line 946) | def test_define(self):
class TestWordnetInterface (line 953) | class TestWordnetInterface(TestCase):
method setUp (line 954) | def setUp(self):
method test_synset (line 957) | def test_synset(self):
method test_lemma (line 962) | def test_lemma(self):
class BlobberTest (line 968) | class BlobberTest(TestCase):
method setUp (line 969) | def setUp(self):
method test_creates_blobs (line 972) | def test_creates_blobs(self):
method test_default_tagger (line 978) | def test_default_tagger(self):
method test_default_np_extractor (line 982) | def test_default_np_extractor(self):
method test_default_tokenizer (line 986) | def test_default_tokenizer(self):
method test_str_and_repr (line 990) | def test_str_and_repr(self):
method test_overrides (line 995) | def test_overrides(self):
method test_override_analyzer (line 1006) | def test_override_analyzer(self):
method test_overrider_classifier (line 1013) | def test_overrider_classifier(self):
function is_blob (line 1019) | def is_blob(obj):
FILE: tests/test_classifiers.py
class BadNLTKClassifier (line 49) | class BadNLTKClassifier(NLTKClassifier):
class TestNLTKClassifier (line 55) | class TestNLTKClassifier(unittest.TestCase):
method setUp (line 56) | def setUp(self):
method test_raises_value_error_without_nltk_class (line 59) | def test_raises_value_error_without_nltk_class(self):
class TestNaiveBayesClassifier (line 70) | class TestNaiveBayesClassifier(unittest.TestCase):
method setUp (line 71) | def setUp(self):
method test_default_extractor (line 74) | def test_default_extractor(self):
method test_classify (line 80) | def test_classify(self):
method test_classify_a_list_of_words (line 85) | def test_classify_a_list_of_words(self):
method test_train_from_lists_of_words (line 89) | def test_train_from_lists_of_words(self):
method test_prob_classify (line 95) | def test_prob_classify(self):
method test_accuracy (line 100) | def test_accuracy(self):
method test_update (line 104) | def test_update(self):
method test_labels (line 113) | def test_labels(self):
method test_show_informative_features (line 118) | def test_show_informative_features(self):
method test_informative_features (line 121) | def test_informative_features(self):
method test_custom_feature_extractor (line 126) | def test_custom_feature_extractor(self):
method test_init_with_csv_file (line 131) | def test_init_with_csv_file(self):
method test_init_with_csv_file_without_format_specifier (line 138) | def test_init_with_csv_file_without_format_specifier(self):
method test_init_with_json_file (line 145) | def test_init_with_json_file(self):
method test_init_with_json_file_without_format_specifier (line 152) | def test_init_with_json_file_without_format_specifier(self):
method test_init_with_custom_format (line 159) | def test_init_with_custom_format(self):
method test_data_with_no_available_format (line 179) | def test_data_with_no_available_format(self):
method test_accuracy_on_a_csv_file (line 186) | def test_accuracy_on_a_csv_file(self):
method test_accuracy_on_json_file (line 191) | def test_accuracy_on_json_file(self):
method test_init_with_tsv_file (line 196) | def test_init_with_tsv_file(self):
method test_init_with_bad_format_specifier (line 203) | def test_init_with_bad_format_specifier(self):
method test_repr (line 207) | def test_repr(self):
class TestDecisionTreeClassifier (line 214) | class TestDecisionTreeClassifier(unittest.TestCase):
method setUp (line 215) | def setUp(self):
method test_classify (line 218) | def test_classify(self):
method test_accuracy (line 223) | def test_accuracy(self):
method test_update (line 227) | def test_update(self):
method test_custom_feature_extractor (line 233) | def test_custom_feature_extractor(self):
method test_pseudocode (line 238) | def test_pseudocode(self):
method test_pretty_format (line 242) | def test_pretty_format(self):
method test_repr (line 248) | def test_repr(self):
class TestMaxEntClassifier (line 257) | class TestMaxEntClassifier(unittest.TestCase):
method setUp (line 258) | def setUp(self):
method test_classify (line 261) | def test_classify(self):
method test_prob_classify (line 266) | def test_prob_classify(self):
class TestPositiveNaiveBayesClassifier (line 272) | class TestPositiveNaiveBayesClassifier(unittest.TestCase):
method setUp (line 273) | def setUp(self):
method test_classifier (line 296) | def test_classifier(self):
method test_classify (line 301) | def test_classify(self):
method test_update (line 305) | def test_update(self):
method test_accuracy (line 317) | def test_accuracy(self):
method test_repr (line 328) | def test_repr(self):
function test_basic_extractor (line 335) | def test_basic_extractor():
function test_basic_extractor_with_list (line 343) | def test_basic_extractor_with_list():
function test_contains_extractor_with_string (line 351) | def test_contains_extractor_with_string():
function test_contains_extractor_with_list (line 360) | def test_contains_extractor_with_list():
function custom_extractor (line 369) | def custom_extractor(document):
function test_get_words_from_dataset (line 378) | def test_get_words_from_dataset():
FILE: tests/test_decorators.py
class Tokenizer (line 9) | class Tokenizer:
method tag (line 11) | def tag(self, text):
function test_decorator_raises_missing_corpus_exception (line 15) | def test_decorator_raises_missing_corpus_exception():
FILE: tests/test_formats.py
class TestFormats (line 12) | class TestFormats(unittest.TestCase):
method setUp (line 13) | def setUp(self):
method test_detect_csv (line 16) | def test_detect_csv(self):
method test_detect_json (line 21) | def test_detect_json(self):
method test_available (line 26) | def test_available(self):
class TestDelimitedFormat (line 33) | class TestDelimitedFormat(unittest.TestCase):
method test_delimiter_defaults_to_comma (line 34) | def test_delimiter_defaults_to_comma(self):
method test_detect (line 37) | def test_detect(self):
class TestCSV (line 46) | class TestCSV(unittest.TestCase):
method test_read_from_filename (line 47) | def test_read_from_filename(self):
method test_detect (line 51) | def test_detect(self):
class TestTSV (line 60) | class TestTSV(unittest.TestCase):
method test_read_from_file_object (line 61) | def test_read_from_file_object(self):
method test_detect (line 65) | def test_detect(self):
class TestJSON (line 75) | class TestJSON(unittest.TestCase):
method test_read_from_file_object (line 76) | def test_read_from_file_object(self):
method test_detect (line 80) | def test_detect(self):
method test_to_iterable (line 88) | def test_to_iterable(self):
class CustomFormat (line 97) | class CustomFormat(formats.BaseFormat):
method to_iterable (line 98) | def to_iterable():
method detect (line 102) | def detect(cls, stream):
class TestRegistry (line 106) | class TestRegistry(unittest.TestCase):
method setUp (line 107) | def setUp(self):
method test_register (line 110) | def test_register(self):
FILE: tests/test_inflect.py
class InflectTestCase (line 12) | class InflectTestCase(TestCase):
method s_singular_pluralize_test (line 13) | def s_singular_pluralize_test(self):
method s_singular_singularize_test (line 16) | def s_singular_singularize_test(self):
method diagnoses_singularize_test (line 19) | def diagnoses_singularize_test(self):
method bus_pluralize_test (line 22) | def bus_pluralize_test(self):
method test_all_singular_s (line 25) | def test_all_singular_s(self):
method test_all_singular_ie (line 29) | def test_all_singular_ie(self):
method test_all_singular_irregular (line 34) | def test_all_singular_irregular(self):
FILE: tests/test_np_extractor.py
class TestConllExtractor (line 11) | class TestConllExtractor(unittest.TestCase):
method setUp (line 12) | def setUp(self):
method test_extract (line 26) | def test_extract(self):
method test_parse_sentence (line 33) | def test_parse_sentence(self):
method test_filter_insignificant (line 38) | def test_filter_insignificant(self):
class BadExtractor (line 47) | class BadExtractor(BaseNPExtractor):
function test_cannot_instantiate_incomplete_extractor (line 53) | def test_cannot_instantiate_incomplete_extractor():
FILE: tests/test_parsers.py
class TestPatternParser (line 7) | class TestPatternParser(unittest.TestCase):
method setUp (line 8) | def setUp(self):
method test_parse (line 12) | def test_parse(self):
FILE: tests/test_sentiments.py
class TestPatternSentiment (line 13) | class TestPatternSentiment(unittest.TestCase):
method setUp (line 14) | def setUp(self):
method test_kind (line 17) | def test_kind(self):
method test_analyze (line 20) | def test_analyze(self):
method test_analyze_assessments (line 30) | def test_analyze_assessments(self):
class TestNaiveBayesAnalyzer (line 43) | class TestNaiveBayesAnalyzer(unittest.TestCase):
method setUp (line 44) | def setUp(self):
method test_kind (line 47) | def test_kind(self):
method test_analyze (line 51) | def test_analyze(self):
function assert_about_equal (line 67) | def assert_about_equal(first, second, places=4):
FILE: tests/test_taggers.py
class TestPatternTagger (line 13) | class TestPatternTagger(unittest.TestCase):
method setUp (line 14) | def setUp(self):
method test_init (line 18) | def test_init(self):
method test_tag (line 22) | def test_tag(self):
class TestNLTKTagger (line 42) | class TestNLTKTagger(unittest.TestCase):
method setUp (line 43) | def setUp(self):
method test_tag (line 47) | def test_tag(self):
function test_cannot_instantiate_incomplete_tagger (line 65) | def test_cannot_instantiate_incomplete_tagger():
FILE: tests/test_tokenizers.py
function is_generator (line 13) | def is_generator(obj):
class TestWordTokenizer (line 17) | class TestWordTokenizer(unittest.TestCase):
method setUp (line 18) | def setUp(self):
method tearDown (line 22) | def tearDown(self):
method test_tokenize (line 25) | def test_tokenize(self):
method test_exclude_punc (line 36) | def test_exclude_punc(self):
method test_itokenize (line 46) | def test_itokenize(self):
method test_word_tokenize (line 51) | def test_word_tokenize(self):
class TestSentenceTokenizer (line 57) | class TestSentenceTokenizer(unittest.TestCase):
method setUp (line 58) | def setUp(self):
method test_tokenize (line 62) | def test_tokenize(self):
method test_tokenize_with_multiple_punctuation (line 69) | def test_tokenize_with_multiple_punctuation(self):
method test_itokenize (line 81) | def test_itokenize(self):
method test_sent_tokenize (line 86) | def test_sent_tokenize(self):
FILE: tests/test_utils.py
class UtilsTests (line 10) | class UtilsTests(TestCase):
method setUp (line 11) | def setUp(self):
method test_strip_punc (line 14) | def test_strip_punc(self):
method test_strip_punc_all (line 17) | def test_strip_punc_all(self):
method test_lowerstrip (line 20) | def test_lowerstrip(self):
function test_is_filelike (line 24) | def test_is_filelike():
Condensed preview — 88 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,648K chars).
[
{
"path": ".github/dependabot.yml",
"chars": 216,
"preview": "version: 2\nupdates:\n- package-ecosystem: pip\n directory: \"/\"\n schedule:\n interval: daily\n open-pull-requests-limit"
},
{
"path": ".github/workflows/build-release.yml",
"chars": 2356,
"preview": "name: build\non:\n push:\n branches: [\"dev\"]\n tags: [\"*\"]\n pull_request:\n\njobs:\n tests:\n name: ${{ matrix.name "
},
{
"path": ".gitignore",
"chars": 501,
"preview": "*.py[cod]\n\n# virtualenv\n.venv/\nvenv/\n\n# C extensions\n*.so\n\n# Packages\n*.egg\n*.egg-info\ndist\nbuild\neggs\nparts\nbin\nvar\nsdi"
},
{
"path": ".konchrc",
"chars": 311,
"preview": "# -*- coding: utf-8 -*-\n# vi: set ft=python :\nimport konch\nfrom textblob import TextBlob, Blobber, Word, Sentence\n\nkonch"
},
{
"path": ".pre-commit-config.yaml",
"chars": 378,
"preview": "repos:\n- repo: https://github.com/astral-sh/ruff-pre-commit\n rev: v0.15.6\n hooks:\n - id: ruff\n - id: ruff-format"
},
{
"path": ".readthedocs.yml",
"chars": 212,
"preview": "version: 2\nsphinx:\n configuration: docs/conf.py\nformats:\n - pdf\nbuild:\n os: ubuntu-22.04\n tools:\n python: \"3.11\"\n"
},
{
"path": "AUTHORS.rst",
"chars": 1687,
"preview": "*******\nAuthors\n*******\n\nDevelopment Lead\n================\n\n- Steven Loria <sloria1@gmail.com> `@sloria <https://github."
},
{
"path": "CHANGELOG.rst",
"chars": 15057,
"preview": "Changelog\n=========\n\n0.19.0 (2025-01-13)\n___________________\n\nBug fixes:\n\n- Fix ``textblob.download_corpora`` script (:i"
},
{
"path": "CONTRIBUTING.rst",
"chars": 4083,
"preview": "Contributing guidelines\n=======================\n\nIn General\n----------\n\n- `PEP 8`_, when sensible.\n- Conventions *and* c"
},
{
"path": "LICENSE",
"chars": 1064,
"preview": "Copyright Steven Loria and contributors\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof"
},
{
"path": "NOTICE",
"chars": 1634,
"preview": "TextBlob includes some vendorized python libraries, including parts of pattern.\n\npattern License\n===============\n\nCopyri"
},
{
"path": "README.rst",
"chars": 3018,
"preview": "\nTextBlob: Simplified Text Processing\n====================================\n\n.. image:: https://badgen.net/pypi/v/TextBlo"
},
{
"path": "RELEASING.md",
"chars": 275,
"preview": "# Releasing\n\n1. Bump version in `pyproject.toml` and update the changelog\n with today's date.\n2. Commit: `git commit -"
},
{
"path": "SECURITY.md",
"chars": 192,
"preview": "# Security Contact Information\n\nTo report a security vulnerability, please use the\n[Tidelift security contact](https://t"
},
{
"path": "docs/Makefile",
"chars": 6770,
"preview": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHINXBUILD "
},
{
"path": "docs/_templates/side-primary.html",
"chars": 1199,
"preview": "<p class=\"logo\">\n <a href=\"{{ pathto(master_doc) }}\"\n ><img\n class=\"logo\"\n src=\"{{ pathto('_static/textblo"
},
{
"path": "docs/_templates/side-secondary.html",
"chars": 974,
"preview": "<p class=\"logo\">\n <a href=\"{{ pathto(master_doc) }}\"\n ><img\n class=\"logo\"\n src=\"{{ pathto('_static/textblo"
},
{
"path": "docs/_themes/.gitignore",
"chars": 22,
"preview": "*.pyc\n*.pyo\n.DS_Store\n"
},
{
"path": "docs/_themes/LICENSE",
"chars": 1863,
"preview": "Modifications: \n\nCopyright (c) 2010 Kenneth Reitz.\n\n\nOriginal Project: \n\nCopyright (c) 2010 by Armin Ronacher.\n\n\nSome ri"
},
{
"path": "docs/_themes/flask_theme_support.py",
"chars": 3905,
"preview": "# flasky extensions. flasky pygments style based on tango style\nfrom pygments.style import Style\nfrom pygments.token im"
},
{
"path": "docs/_themes/kr/layout.html",
"chars": 718,
"preview": "{%- extends \"basic/layout.html\" %} {%- block extrahead %} {{ super() }} {% if\ntheme_touch_icon %}\n<link\n rel=\"apple-tou"
},
{
"path": "docs/_themes/kr/relations.html",
"chars": 590,
"preview": "<h3>Related Topics</h3>\n<ul>\n <li><a href=\"{{ pathto(master_doc) }}\">Documentation overview</a><ul>\n {%- for parent in"
},
{
"path": "docs/_themes/kr/static/flasky.css_t",
"chars": 8405,
"preview": "/*\n * flasky.css_t\n * ~~~~~~~~~~~~\n *\n * :copyright: Copyright 2010 by Armin Ronacher. Modifications by Kenneth Reitz.\n "
},
{
"path": "docs/_themes/kr/static/small_flask.css",
"chars": 1163,
"preview": "/*\n * small_flask.css_t\n * ~~~~~~~~~~~~~~~~~\n *\n * :copyright: Copyright 2010 by Armin Ronacher.\n * :license: Flask Desi"
},
{
"path": "docs/_themes/kr/theme.conf",
"chars": 122,
"preview": "[theme]\ninherit = basic\nstylesheet = flasky.css\npygments_style = flask_theme_support.FlaskyStyle\n\n[options]\ntouch_icon ="
},
{
"path": "docs/_themes/kr_small/layout.html",
"chars": 683,
"preview": "{% extends \"basic/layout.html\" %} {% block header %} {{ super() }} {% if\npagename == 'index' %}\n<div class=\"indexwrapper"
},
{
"path": "docs/_themes/kr_small/static/flasky.css_t",
"chars": 4609,
"preview": "/*\n * flasky.css_t\n * ~~~~~~~~~~~~\n *\n * Sphinx stylesheet -- flasky theme based on nature theme.\n *\n * :copyright: Copy"
},
{
"path": "docs/_themes/kr_small/theme.conf",
"chars": 184,
"preview": "[theme]\ninherit = basic\nstylesheet = flasky.css\nnosidebar = true\npygments_style = flask_theme_support.FlaskyStyle\n\n[opti"
},
{
"path": "docs/advanced_usage.rst",
"chars": 5084,
"preview": ".. _advanced:\n\nAdvanced Usage: Overriding Models and the Blobber Class\n================================================="
},
{
"path": "docs/api_reference.rst",
"chars": 1611,
"preview": ".. _api:\n\nAPI Reference\n=============\n\nBlob Classes\n------------\n\n.. automodule:: textblob.blob\n :members:\n :inher"
},
{
"path": "docs/authors.rst",
"chars": 27,
"preview": ".. include:: ../AUTHORS.rst"
},
{
"path": "docs/changelog.rst",
"chars": 46,
"preview": ".. _changelog:\n\n.. include:: ../CHANGELOG.rst\n"
},
{
"path": "docs/classifiers.rst",
"chars": 6487,
"preview": ".. _classifiers:\n\nTutorial: Building a Text Classification System\n***********************************************\n\nThe `"
},
{
"path": "docs/conf.py",
"chars": 2622,
"preview": "import importlib.metadata\nimport os\nimport sys\n\nsys.path.append(os.path.abspath(\"_themes\"))\n\n# -- General configuration "
},
{
"path": "docs/contributing.rst",
"chars": 32,
"preview": ".. include:: ../CONTRIBUTING.rst"
},
{
"path": "docs/extensions.rst",
"chars": 752,
"preview": ".. _extensions:\n\n**********\nExtensions\n**********\n\nTextBlob supports adding custom models and new languages through \"ext"
},
{
"path": "docs/index.rst",
"chars": 2552,
"preview": ".. textblob documentation master file, created by\n sphinx-quickstart on Mon Aug 5 01:41:33 2013.\n You can adapt thi"
},
{
"path": "docs/install.rst",
"chars": 2489,
"preview": ".. _install:\n\nInstallation\n============\n\nInstalling/Upgrading From the PyPI\n----------------------------------\n::\n\n $"
},
{
"path": "docs/license.rst",
"chars": 48,
"preview": "License\n=======\n\n.. literalinclude:: ../LICENSE\n"
},
{
"path": "docs/make.bat",
"chars": 6705,
"preview": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\n"
},
{
"path": "docs/quickstart.rst",
"chars": 9386,
"preview": ".. _quickstart:\n\nTutorial: Quickstart\n====================\n\n.. module:: textblob.blob\n\nTextBlob aims to provide access t"
},
{
"path": "pyproject.toml",
"chars": 2428,
"preview": "[project]\nname = \"textblob\"\nversion = \"0.19.0\"\ndescription = \"Simple, Pythonic text processing. Sentiment analysis, part"
},
{
"path": "src/textblob/__init__.py",
"chars": 152,
"preview": "from .blob import Blobber, Sentence, TextBlob, Word, WordList\n\n__all__ = [\n \"TextBlob\",\n \"Word\",\n \"Sentence\",\n "
},
{
"path": "src/textblob/_text.py",
"chars": 64422,
"preview": "\"\"\"This file is adapted from the pattern library.\n\nURL: http://www.clips.ua.ac.be/pages/pattern-web\nLicence: BSD\n\"\"\"\nimp"
},
{
"path": "src/textblob/base.py",
"chars": 3079,
"preview": "\"\"\"Abstract base classes for models (taggers, noun phrase extractors, etc.)\nwhich define the interface for descendant cl"
},
{
"path": "src/textblob/blob.py",
"chars": 27446,
"preview": "\"\"\"Wrappers for various units of text, including the main\n:class:`TextBlob <textblob.blob.TextBlob>`, :class:`Word <text"
},
{
"path": "src/textblob/classifiers.py",
"chars": 19245,
"preview": "\"\"\"Various classifier implementations. Also includes basic feature extractor\nmethods.\n\nExample Usage:\n::\n\n >>> from t"
},
{
"path": "src/textblob/decorators.py",
"chars": 1258,
"preview": "\"\"\"Custom decorators.\"\"\"\n\nfrom __future__ import annotations\n\nfrom functools import wraps\nfrom typing import TYPE_CHECKI"
},
{
"path": "src/textblob/download_corpora.py",
"chars": 1015,
"preview": "#!/usr/bin/env python\n\"\"\"Downloads the necessary NLTK corpora for TextBlob.\n\nUsage: ::\n\n $ python -m textblob.downloa"
},
{
"path": "src/textblob/en/__init__.py",
"chars": 4230,
"preview": "\"\"\"This file is based on pattern.en. See the bundled NOTICE file for\nlicense information.\n\"\"\"\nimport os\n\nfrom textblob._"
},
{
"path": "src/textblob/en/en-context.txt",
"chars": 6578,
"preview": ";;; \n;;; The contextual rules are based on Brill's rule based tagger v1.14,\n;;; trained on Brown corpus and Penn T"
},
{
"path": "src/textblob/en/en-entities.txt",
"chars": 10329,
"preview": "50 Cent PERS\nAIDS\nAK-47\nAT&T ORG\nAbraham Lincoln PERS\nAcropolis LOC\nAdam Sandler PERS\nAdolf Hitler PERS\nAdriana Lima PER"
},
{
"path": "src/textblob/en/en-lexicon.txt",
"chars": 1220311,
"preview": ";;; \n;;; The lexicon was taken from Brill's rule based tagger v1.14,\n;;; trained on Brown corpus and Penn Treebank"
},
{
"path": "src/textblob/en/en-morphology.txt",
"chars": 3274,
"preview": ";;; \n;;; The morphological rules are based on Brill's rule based tagger v1.14,\n;;; trained on Brown corpus and Pen"
},
{
"path": "src/textblob/en/en-sentiment.xml",
"chars": 540708,
"preview": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!-- \nSUBJECTIVITY LEXICON FOR ENGLISH ADJECTIVES.\nAdjectives have a polarity (ne"
},
{
"path": "src/textblob/en/en-spelling.txt",
"chars": 321210,
"preview": ";;; \n;;; Based on several public domain books from Project Gutenberg\n;;; and frequency lists from Wiktionary and t"
},
{
"path": "src/textblob/en/inflect.py",
"chars": 23376,
"preview": "\"\"\"The pluralize and singular methods from the pattern library.\n\nLicenced under the BSD.\nSee here https://github.com/cli"
},
{
"path": "src/textblob/en/np_extractors.py",
"chars": 6704,
"preview": "\"\"\"Various noun phrase extractors.\"\"\"\n\nimport nltk\n\nfrom textblob.base import BaseNPExtractor\nfrom textblob.decorators i"
},
{
"path": "src/textblob/en/parsers.py",
"chars": 417,
"preview": "\"\"\"Various parser implementations.\n\n.. versionadded:: 0.6.0\n\"\"\"\nfrom textblob.base import BaseParser\nfrom textblob.en im"
},
{
"path": "src/textblob/en/sentiments.py",
"chars": 3844,
"preview": "\"\"\"Sentiment analysis implementations.\n\n.. versionadded:: 0.5.0\n\"\"\"\nfrom collections import namedtuple\n\nimport nltk\n\nfro"
},
{
"path": "src/textblob/en/taggers.py",
"chars": 931,
"preview": "\"\"\"Parts-of-speech tagger implementations.\"\"\"\n\nimport nltk\n\nimport textblob as tb\nfrom textblob.base import BaseTagger\nf"
},
{
"path": "src/textblob/exceptions.py",
"chars": 1432,
"preview": "MISSING_CORPUS_MESSAGE = \"\"\"\nLooks like you are missing some required data for this feature.\n\nTo download the necessary "
},
{
"path": "src/textblob/formats.py",
"chars": 4297,
"preview": "\"\"\"File formats for training and testing data.\n\nIncludes a registry of valid file formats. New file formats can be added"
},
{
"path": "src/textblob/inflect.py",
"chars": 354,
"preview": "\"\"\"Make word inflection default to English. This allows for backwards\ncompatibility so you can still import text.inflect"
},
{
"path": "src/textblob/mixins.py",
"chars": 6195,
"preview": "import sys\n\n\nclass ComparableMixin:\n \"\"\"Implements rich operators for an object.\"\"\"\n\n def _cmpkey(self):\n r"
},
{
"path": "src/textblob/np_extractors.py",
"chars": 444,
"preview": "\"\"\"Default noun phrase extractors are for English to maintain backwards\ncompatibility, so you can still do\n\n>>> from tex"
},
{
"path": "src/textblob/parsers.py",
"chars": 343,
"preview": "\"\"\"Default parsers to English for backwards compatibility so you can still do\n\n>>> from textblob.parsers import PatternP"
},
{
"path": "src/textblob/sentiments.py",
"chars": 519,
"preview": "\"\"\"Default sentiment analyzers are English for backwards compatibility, so\nyou can still do\n\n>>> from textblob.sentiment"
},
{
"path": "src/textblob/taggers.py",
"chars": 382,
"preview": "\"\"\"Default taggers to the English taggers for backwards incompatibility, so you\ncan still do\n\n>>> from textblob.taggers "
},
{
"path": "src/textblob/tokenizers.py",
"chars": 2511,
"preview": "\"\"\"Various tokenizer implementations.\n\n.. versionadded:: 0.4.0\n\"\"\"\n\nfrom itertools import chain\n\nimport nltk\n\nfrom textb"
},
{
"path": "src/textblob/utils.py",
"chars": 1766,
"preview": "from __future__ import annotations\n\nimport re\nimport string\nfrom typing import TYPE_CHECKING\n\nif TYPE_CHECKING:\n from"
},
{
"path": "src/textblob/wordnet.py",
"chars": 399,
"preview": "\"\"\"Wordnet interface. Contains classes for creating Synsets and Lemmas\ndirectly.\n\n.. versionadded:: 0.7.0\n\n\"\"\"\n\nimport n"
},
{
"path": "tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/data.csv",
"chars": 268,
"preview": "I love this car,pos\n美丽优于丑陋,pos\nI am so excited about the concert,pos\nI feel great this morning,pos\nHe is my best friend,"
},
{
"path": "tests/data.json",
"chars": 781,
"preview": "[\n {\n \"text\": \"I love this car\",\n \"label\": \"pos\"\n },\n {\n \"text\": \"美丽优于丑陋\",\n \"label\""
},
{
"path": "tests/data.tsv",
"chars": 268,
"preview": "I love this car\tpos\n美丽优于丑陋\tpos\nI am so excited about the concert\tpos\nI feel great this morning\tpos\nHe is my best friend\t"
},
{
"path": "tests/test_blob.py",
"chars": 37223,
"preview": "\"\"\"\nTests for the text processor.\n\"\"\"\n\nimport json\nfrom datetime import datetime\nfrom unittest import TestCase\n\nimport n"
},
{
"path": "tests/test_classifiers.py",
"chars": 13230,
"preview": "import os\nimport unittest\nfrom unittest import mock\n\nimport nltk\nimport pytest\n\nfrom textblob import formats\nfrom textbl"
},
{
"path": "tests/test_decorators.py",
"chars": 430,
"preview": "import unittest\n\nimport pytest\n\nfrom textblob.decorators import requires_nltk_corpus\nfrom textblob.exceptions import Mis"
},
{
"path": "tests/test_formats.py",
"chars": 3247,
"preview": "import os\nimport unittest\n\nfrom textblob import formats\n\nHERE = os.path.abspath(os.path.dirname(__file__))\nCSV_FILE = os"
},
{
"path": "tests/test_inflect.py",
"chars": 1026,
"preview": "from unittest import TestCase\n\nfrom textblob.en.inflect import (\n plural_categories,\n pluralize,\n singular_ie,\n"
},
{
"path": "tests/test_np_extractor.py",
"chars": 1825,
"preview": "import unittest\n\nimport nltk\nimport pytest\n\nfrom textblob.base import BaseNPExtractor\nfrom textblob.np_extractors import"
},
{
"path": "tests/test_parsers.py",
"chars": 426,
"preview": "import unittest\n\nfrom textblob.en import parse as pattern_parse\nfrom textblob.parsers import PatternParser\n\n\nclass TestP"
},
{
"path": "tests/test_sentiments.py",
"chars": 2266,
"preview": "import unittest\n\nimport pytest\n\nfrom textblob.sentiments import (\n CONTINUOUS,\n DISCRETE,\n NaiveBayesAnalyzer,\n"
},
{
"path": "tests/test_taggers.py",
"chars": 1964,
"preview": "import os\nimport unittest\n\nimport pytest\n\nimport textblob.taggers\nfrom textblob.base import BaseTagger\n\nHERE = os.path.a"
},
{
"path": "tests/test_tokenizers.py",
"chars": 2617,
"preview": "import unittest\n\nimport pytest\n\nfrom textblob.tokenizers import (\n SentenceTokenizer,\n WordTokenizer,\n sent_tok"
},
{
"path": "tests/test_utils.py",
"chars": 759,
"preview": "import os\nfrom unittest import TestCase\n\nfrom textblob.utils import is_filelike, lowerstrip, strip_punc\n\nHERE = os.path."
},
{
"path": "tox.ini",
"chars": 698,
"preview": "[tox]\nenvlist =\n lint\n py{39,310,311,312,313}\n py39-lowest\n\n[testenv]\nextras = tests\ndeps =\n lowest: nltk==3"
}
]
About this extraction
This page contains the full source code of the sloria/TextBlob GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 88 files (2.3 MB), approximately 615.5k tokens, and a symbol index with 629 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.