Showing preview only (8,396K chars total). Download the full file or copy to clipboard to get everything.
Repository: WZBSocialScienceCenter/tmtoolkit
Branch: master
Commit: 02990865ee89
Files: 99
Total size: 20.4 MB
Directory structure:
gitextract_8gohnn84/
├── .github/
│ └── workflows/
│ ├── runtests.yml
│ └── stale.yml
├── .gitignore
├── .readthedocs.yaml
├── AUTHORS.md
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.rst
├── conftest.py
├── doc/
│ ├── Makefile
│ └── source/
│ ├── api.rst
│ ├── bow.ipynb
│ ├── conf.py
│ ├── data/
│ │ ├── corpus_example/
│ │ │ ├── sample1.txt
│ │ │ ├── sample2.txt
│ │ │ └── sample3.txt
│ │ ├── news_articles_100.pickle
│ │ ├── news_articles_100.xlsx
│ │ └── tm_wordclouds/
│ │ └── .gitignore
│ ├── development.rst
│ ├── getting_started.ipynb
│ ├── index.rst
│ ├── install.rst
│ ├── intro.rst
│ ├── license_note.rst
│ ├── preprocessing.ipynb
│ ├── text_corpora.ipynb
│ ├── topic_modeling.ipynb
│ └── version_history.rst
├── examples/
│ ├── README.md
│ ├── __init__.py
│ ├── _benchmarktools.py
│ ├── benchmark_en_newsarticles.py
│ ├── bundestag18_tfidf.py
│ ├── data/
│ │ ├── ap.pickle
│ │ ├── bt18_sample_1000.pickle
│ │ └── nips.pickle
│ ├── gensim_evaluation.py
│ ├── minimal_tfidf.py
│ ├── topicmod_ap_nips_eval.py
│ └── topicmod_lda.py
├── requirements.txt
├── requirements_doc.txt
├── scripts/
│ ├── fulldata/
│ │ ├── .gitignore
│ │ └── README.md
│ ├── nips_data.py
│ ├── prepare_corpora.R
│ └── tmp/
│ └── .gitignore
├── setup.py
├── tests/
│ ├── __init__.py
│ ├── _testtextdata.py
│ ├── _testtools.py
│ ├── data/
│ │ ├── .gitignore
│ │ ├── 100NewsArticles.csv
│ │ ├── 100NewsArticles.xlsx
│ │ ├── 3ExampleDocs.xlsx
│ │ ├── bt18_speeches_sample.csv
│ │ ├── gutenberg/
│ │ │ ├── kafka_verwandlung.txt
│ │ │ └── werther/
│ │ │ ├── goethe_werther1.txt
│ │ │ └── goethe_werther2.txt
│ │ └── tiny_model_reuters_5_topics.pickle
│ ├── test_bow.py
│ ├── test_corpus.py
│ ├── test_corpusimport.py
│ ├── test_tokenseq.py
│ ├── test_topicmod__eval_tools.py
│ ├── test_topicmod_evaluate.py
│ ├── test_topicmod_model_io.py
│ ├── test_topicmod_model_stats.py
│ ├── test_topicmod_visualize.py
│ └── test_utils.py
├── tmtoolkit/
│ ├── __init__.py
│ ├── __main__.py
│ ├── bow/
│ │ ├── __init__.py
│ │ ├── bow_stats.py
│ │ └── dtm.py
│ ├── corpus/
│ │ ├── __init__.py
│ │ ├── _common.py
│ │ ├── _corpus.py
│ │ ├── _corpusfuncs.py
│ │ ├── _document.py
│ │ ├── _nltk_extras.py
│ │ └── visualize.py
│ ├── tokenseq.py
│ ├── topicmod/
│ │ ├── __init__.py
│ │ ├── _common.py
│ │ ├── _eval_tools.py
│ │ ├── evaluate.py
│ │ ├── model_io.py
│ │ ├── model_stats.py
│ │ ├── parallel.py
│ │ ├── tm_gensim.py
│ │ ├── tm_lda.py
│ │ ├── tm_sklearn.py
│ │ └── visualize.py
│ ├── types.py
│ └── utils.py
└── tox.ini
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/runtests.yml
================================================
# GitHub actions workflow for testing tmtoolkit
# Runs tests on Ubuntu, MacOS and Windows with Python versions 3.8, 3.9 and 3.10 each, which means 9 jobs are spawned.
# Tests are run using tox (https://tox.wiki/).
#
# author: Markus Konrad <markus.konrad@wzb.eu>
name: run tests
on:
push:
branches:
- master
- develop
- 'release*'
jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.8", "3.9", "3.10"]
testsuite: ["minimal", "full"]
steps:
- uses: actions/checkout@v2
- name: set up python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: install system dependencies (linux)
if: runner.os == 'Linux'
# only managed to install system dependencies on Linux runners
run: |
sudo apt update
sudo apt install libgmp-dev libmpfr-dev libmpc-dev
- name: install python dependencies
run: |
python -m pip install --upgrade pip
pip install tox
- name: run tox (linux)
# since system dependencies could only be installed on Linux runners, we run the "full" suite only on Linux ...
if: runner.os == 'Linux'
run: tox -e py-${{ matrix.testsuite }} -- --hypothesis-profile=ci
- name: run tox (macos or windows - minimal)
if: runner.os != 'Linux' && matrix.testsuite == 'minimal'
run: tox -e py-minimal -- --hypothesis-profile=ci
- name: run tox (macos or windows - recommendedextra)
# ... on all other OS we run the "recommendedextra" suite instead of the "full" suite
if: runner.os != 'Linux' && matrix.testsuite == 'full'
run: tox -e py-recommendedextra -- --hypothesis-profile=ci
================================================
FILE: .github/workflows/stale.yml
================================================
name: Close inactive issues
on:
schedule:
- cron: "23 3 * * *"
jobs:
close-issues:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v3
with:
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
stale-issue-message: "This issue is stale because it has been open for 30 days with no activity."
close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale."
days-before-pr-stale: -1
days-before-pr-close: -1
repo-token: ${{ secrets.GITHUB_TOKEN }}
================================================
FILE: .gitignore
================================================
.cache/
.idea/
**/__pycache__
*.pyc
.hypothesis
build/
dist/
*.egg-info/
.~lock.*
examples/data/*.pickle
!examples/data/ap.pickle
!examples/data/nips.pickle
!examples/data/bt18_sample_1000.pickle
**/.ipynb_checkpoints/
.pytest_cache/
.covreport/
.tox/
.Rhistory
doc/source/data/corpus_norm.pickle
.coverage
================================================
FILE: .readthedocs.yaml
================================================
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Build documentation in the doc/ directory with Sphinx
sphinx:
configuration: doc/source/conf.py
# Set the version of Python and other tools you might need
build:
os: ubuntu-20.04
tools:
python: "3.9"
# Optionally set the version of Python and requirements required to build your docs
python:
install:
- requirements: requirements_doc.txt
================================================
FILE: AUTHORS.md
================================================
# Authors
## Maintainer / main developer
[Markus Konrad](https://github.com/internaut) @ [WZB](https://github.com/WZBSocialScienceCenter/)
## Contributors
Sorted by date of first contribution:
* [Matt Cooper](https://github.com/mcooper)
* [Dominik Domhoff](https://github.com/ddomhoff)
* [Christof Kälin](https://github.com/christofkaelin)
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
================================================
FILE: MANIFEST.in
================================================
include AUTHORS.md
include conftest.py
include LICENSE
include README.rst
include requirements.txt
include requirements_doc.txt
graft doc/source
prune doc/source/.ipynb_*
graft tmtoolkit/data
================================================
FILE: Makefile
================================================
run_tests:
PYTHONPATH=. pytest -l tests/
cov_tests:
PYTHONPATH=. pytest --cov-report html:.covreport --cov=tmtoolkit tests/
coverage-badge -f -o coverage.svg
#rm .coverage*
sdist:
python setup.py sdist
wheel:
python setup.py bdist_wheel
readme:
cat doc/source/intro.rst > README.rst
echo >> README.rst
echo >> README.rst
doc/source/install.rst >> README.rst
echo >> README.rst
echo >> README.rst
cat doc/source/license_note.rst >> README.rst
================================================
FILE: README.rst
================================================
**This repository is archived. Further development of tmtoolkit has moved to https://github.com/internaut/tmtoolkit.**
------------
tmtoolkit: Text mining and topic modeling toolkit
=================================================
*tmtoolkit* is a set of tools for text mining and topic modeling with Python developed especially for the use in the
social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation
and a clear programming interface while offering good performance on large datasets by the means of vectorized
operations (via NumPy) and parallel computation (using Python's *multiprocessing* module and the
`loky <https://loky.readthedocs.io/>`_ package). The basis of tmtoolkit's text mining capabilities are built around
`SpaCy <https://spacy.io/>`_, which offers a `many language models <https://spacy.io/models>`_.
The documentation for tmtoolkit is available on `tmtoolkit.readthedocs.org <https://tmtoolkit.readthedocs.org>`_ and
the GitHub code repository is on
`github.com/WZBSocialScienceCenter/tmtoolkit <https://github.com/WZBSocialScienceCenter/tmtoolkit>`_.
**Upgrade note:**
Since Feb 8 2022, the newest version 0.11.0 of tmtoolkit is available on PyPI. This version features a new API
for text processing and mining which is incompatible with prior versions. It's advisable to first read the
first three chapters of the `tutorial <https://tmtoolkit.readthedocs.io/en/latest/getting_started.html>`_
to get used to the new API. You should also re-install tmtoolkit in a new virtual environment or completely
remove the old version prior to upgrading. See the
`installation instructions <https://tmtoolkit.readthedocs.io/en/latest/install.html>`_.
Requirements and installation
-----------------------------
**tmtoolkit works with Python 3.8 or newer (tested up to Python 3.10).**
The tmtoolkit package is highly modular and tries to install as few dependencies as possible. For requirements and
installation procedures, please have a look at the
`installation section in the documentation <https://tmtoolkit.readthedocs.io/en/latest/install.html>`_. For short,
the recommended way of installing tmtoolkit is to create and activate a
`Python Virtual Environment ("venv") <https://docs.python.org/3/tutorial/venv.html>`_ and then install tmtoolkit with
a recommended set of dependencies and a list of language models via the following:
.. code-block:: text
pip install -U "tmtoolkit[recommended]"
# add or remove language codes in the list for installing the models that you need;
# don't use spaces in the list of languages
python -m tmtoolkit setup en,de
Again, you should have a look at the detailed
`installation instructions <https://tmtoolkit.readthedocs.io/en/latest/install.html>`_ in order to install additional
packages that enable more features such as topic modeling.
Features
--------
Text preprocessing
^^^^^^^^^^^^^^^^^^
The tmtoolkit package offers several text preprocessing and text mining methods, including:
- `tokenization, sentence segmentation, part-of-speech (POS) tagging, named-entity recognition (NER) <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Configuring-the-NLP-pipeline,-parallel-processing-and-more-via-Corpus-parameters>`_ (via SpaCy)
- `lemmatization and token normalization <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Lemmatization-and-token-normalization>`_
- extensive `pattern matching capabilities <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Common-parameters-for-pattern-matching-functions>`_
(exact matching, regular expressions or "glob" patterns) to be used in many
methods of the package, e.g. for filtering on token or document level, or for
`keywords-in-context (KWIC) <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Keywords-in-context-(KWIC)-and-general-filtering-methods>`_
- adding and managing
`custom document and token attributes <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Working-with-document-and-token-attributes>`_
- accessing text corpora along with their
`document and token attributes as dataframes <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Accessing-tokens-and-token-attributes>`_
- calculating and `visualizing corpus summary statistics <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Visualizing-corpus-summary-statistics>`_
- finding out and joining `collocations <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Identifying-and-joining-token-collocations>`_
- `splitting and sampling corpora <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Corpus-functions-for-document-management>`_
- generating `n-grams <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-n-grams>`_
- generating `sparse document-term matrices <https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-a-sparse-document-term-matrix-(DTM)>`_
Wherever possible and useful, these methods can operate in parallel to speed up computations with large datasets.
Topic modeling
^^^^^^^^^^^^^^
* `model computation in parallel <https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Computing-topic-models-in-parallel>`_ for different copora
and/or parameter sets
* support for `lda <http://pythonhosted.org/lda/>`_,
`scikit-learn <http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html>`_
and `gensim <https://radimrehurek.com/gensim/>`_ topic modeling backends
* `evaluation of topic models <https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Evaluation-of-topic-models>`_ (e.g. in order to an optimal number
of topics for a given dataset) using several implemented metrics:
* model coherence (`Mimno et al. 2011 <https://dl.acm.org/citation.cfm?id=2145462>`_) or with
`metrics implemented in Gensim <https://radimrehurek.com/gensim/models/coherencemodel.html>`_)
* KL divergence method (`Arun et al. 2010 <http://doi.org/10.1007/978-3-642-13657-3_43>`_)
* probability of held-out documents (`Wallach et al. 2009 <https://doi.org/10.1145/1553374.1553515>`_)
* pair-wise cosine distance method (`Cao Juan et al. 2009 <http://doi.org/10.1016/j.neucom.2008.06.011>`_)
* harmonic mean method (`Griffiths, Steyvers 2004 <http://doi.org/10.1073/pnas.0307752101>`_)
* the loglikelihood or perplexity methods natively implemented in lda, sklearn or gensim
* `plotting of evaluation results <https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Evaluation-of-topic-models>`_
* `common statistics for topic models <https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Common-statistics-and-tools-for-topic-models>`_ such as
word saliency and distinctiveness (`Chuang et al. 2012 <https://dl.acm.org/citation.cfm?id=2254572>`_), topic-word
relevance (`Sievert and Shirley 2014 <https://www.aclweb.org/anthology/W14-3110>`_)
* `finding / filtering topics with pattern matching <https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Filtering-topics>`_
* `export estimated document-topic and topic-word distributions to Excel
<https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Displaying-and-exporting-topic-modeling-results>`_
* `visualize topic-word distributions and document-topic distributions <https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html#Visualizing-topic-models>`_
as word clouds or heatmaps
* model coherence (`Mimno et al. 2011 <https://dl.acm.org/citation.cfm?id=2145462>`_) for individual topics
* integrate `PyLDAVis <https://pyldavis.readthedocs.io/en/latest/>`_ to visualize results
Other features
^^^^^^^^^^^^^^
- loading and cleaning of raw text from
`text files, tabular files (CSV or Excel), ZIP files or folders <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Loading-text-data>`_
- `splitting and joining documents <https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html#Corpus-functions-for-document-management>`_
- `common statistics and transformations for document-term matrices <https://tmtoolkit.readthedocs.io/en/latest/bow.html>`_ like word cooccurrence and *tf-idf*
Limits
------
* all languages are supported, for which `SpaCy language models <https://spacy.io/models>`_ are available
* all data must reside in memory, i.e. no streaming of large data from the hard disk (which for example
`Gensim <https://radimrehurek.com/gensim/>`_ supports)
Contribute
----------
If you'd like to contribute, please read the `developer documentation <https://tmtoolkit.readthedocs.io/en/latest/development.html>`_ first.
License
-------
Code licensed under `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
See `LICENSE <https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/LICENSE>`_ file.
.. |pypi| image:: https://badge.fury.io/py/tmtoolkit.svg
:target: https://badge.fury.io/py/tmtoolkit
:alt: PyPI Version
.. |pypi_downloads| image:: https://img.shields.io/pypi/dm/tmtoolkit
:target: https://pypi.org/project/tmtoolkit/
:alt: Downloads from PyPI
.. |runtests| image:: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml/badge.svg
:target: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml
:alt: GitHub Actions CI Build Status
.. |coverage| image:: https://raw.githubusercontent.com/WZBSocialScienceCenter/tmtoolkit/master/coverage.svg?sanitize=true
:target: https://github.com/WZBSocialScienceCenter/tmtoolkit/tree/master/tests
:alt: Coverage status
.. |rtd| image:: https://readthedocs.org/projects/tmtoolkit/badge/?version=latest
:target: https://tmtoolkit.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. |zenodo| image:: https://zenodo.org/badge/109812180.svg
:target: https://zenodo.org/badge/latestdoi/109812180
:alt: Citable Zenodo DOI
================================================
FILE: conftest.py
================================================
"""
Configuration for tests with pytest
.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>
"""
from hypothesis import settings, HealthCheck
# set default timeout deadline
settings.register_profile('default', deadline=5000)
# profile for CI runs on GitHub machines, which may be slow from time to time so we disable the "too slow" HealthCheck
# and set the timeout deadline very high (60 sec.)
settings.register_profile('ci', suppress_health_check=(HealthCheck.too_slow, ), deadline=60000)
# load default settings profile
settings.load_profile('default')
================================================
FILE: doc/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
notebooks:
jupyter nbconvert --to notebook --execute --inplace --ExecutePreprocessor.timeout=600 --PlainTextFormatter.max_seq_length=20 source/*.ipynb
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: doc/source/api.rst
================================================
.. _api:
API
===
tmtoolkit.bow
-------------
tmtoolkit.bow.bow_stats
^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.bow.bow_stats
:members:
tmtoolkit.bow.dtm
^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.bow.dtm
:members:
tmtoolkit.corpus
----------------
Corpus class and corpus functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.corpus
:members:
:imported-members:
:exclude-members: find_spec, strip_tags, numbertoken_to_magnitude, simplify_unicode_chars, visualize
Functions to visualize corpus summary statistics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.corpus.visualize
:members:
tmtoolkit.tokenseq
------------------
.. automodule:: tmtoolkit.tokenseq
:members:
tmtoolkit.topicmod
------------------
.. automodule:: tmtoolkit.topicmod
:members:
Evaluation metrics for Topic Modeling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.topicmod.evaluate
:members:
Printing, importing and exporting topic model results
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.topicmod.model_io
:members:
Statistics for topic models and BoW matrices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.topicmod.model_stats
:members:
Parallel model fitting and evaluation with lda
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.topicmod.tm_lda
:members: AVAILABLE_METRICS, DEFAULT_METRICS, compute_models_parallel, evaluate_topic_models
Parallel model fitting and evaluation with scikit-learn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.topicmod.tm_sklearn
:members: AVAILABLE_METRICS, DEFAULT_METRICS, compute_models_parallel, evaluate_topic_models
Parallel model fitting and evaluation with Gensim
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.topicmod.tm_gensim
:members: AVAILABLE_METRICS, DEFAULT_METRICS, compute_models_parallel, evaluate_topic_models
Visualize topic models and topic model evaluation results
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Wordclouds from topic models
""""""""""""""""""""""""""""
.. autodata:: tmtoolkit.topicmod.visualize.DEFAULT_WORDCLOUD_KWARGS
.. autofunction:: tmtoolkit.topicmod.visualize.generate_wordclouds_for_topic_words
.. autofunction:: tmtoolkit.topicmod.visualize.generate_wordclouds_for_document_topics
.. autofunction:: tmtoolkit.topicmod.visualize.generate_wordcloud_from_probabilities_and_words
.. autofunction:: tmtoolkit.topicmod.visualize.generate_wordcloud_from_weights
.. autofunction:: tmtoolkit.topicmod.visualize.write_wordclouds_to_folder
.. autofunction:: tmtoolkit.topicmod.visualize.generate_wordclouds_from_distribution
Plot heatmaps for topic models
""""""""""""""""""""""""""""""
.. autofunction:: tmtoolkit.topicmod.visualize.plot_doc_topic_heatmap
.. autofunction:: tmtoolkit.topicmod.visualize.plot_topic_word_heatmap
.. autofunction:: tmtoolkit.topicmod.visualize.plot_heatmap
Plot probability distribution rankings for topic models
"""""""""""""""""""""""""""""""""""""""""""""""""""""""
.. autofunction:: tmtoolkit.topicmod.visualize.plot_topic_word_ranked_prob
.. autofunction:: tmtoolkit.topicmod.visualize.plot_doc_topic_ranked_prob
.. autofunction:: tmtoolkit.topicmod.visualize.plot_prob_distrib_ranked_prob
Plot topic model evaluation results
"""""""""""""""""""""""""""""""""""
.. autofunction:: tmtoolkit.topicmod.visualize.plot_eval_results
Other functions
"""""""""""""""
.. autofunction:: tmtoolkit.topicmod.visualize.parameters_for_ldavis
Base classes for parallel model fitting and evaluation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. automodule:: tmtoolkit.topicmod.parallel
:members:
tmtoolkit.utils
---------------
.. automodule:: tmtoolkit.utils
:members:
================================================
FILE: doc/source/bow.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with the Bag-of-Words representation\n",
"\n",
"The [bow module](api.rst#tmtoolkit-bow) in tmtoolkit contains several functions for working with Bag-of-Words (BoW) representations of documents. It's divided into two sub-modules: [bow.bow_stats](api.rst#module-tmtoolkit.bow.bow_stats) and [bow.dtm](api.rst#module-tmtoolkit.bow.dtm). The former implements several statistics and transformations for BoW representations, the latter contains functions to create and convert sparse or dense document-term matrices (DTMs).\n",
"\n",
"Most of the functions in both sub-modules accept and/or return sparse DTMs. The [previous chapter](preprocessing.ipynb) contained a section about what sparse DTMs are and [how they can be generated with tmtoolkit](preprocessing.ipynb#Generating-a-sparse-document-term-matrix-(DTM))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## An example document-term matrix\n",
"\n",
"Before we start with the [bow.dtm](api.rst#module-tmtoolkit.bow.dtm) module, we will generate a sparse DTM from a small example corpus."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:01.648092Z",
"iopub.status.busy": "2022-03-11T08:31:01.647002Z",
"iopub.status.idle": "2022-03-11T08:31:07.794595Z",
"shell.execute_reply": "2022-03-11T08:31:07.795222Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Corpus with 5 documents in English\n",
"> NewsArticles-3665 (1158 tokens): Presidential elections in France have never been a...\n",
"> NewsArticles-2058 (1174 tokens): Merkel : ' Only if Europe is doing well , will Ger...\n",
"> NewsArticles-3016 (621 tokens): Farron likens PM 's politics to Trump 's and Putin...\n",
"> NewsArticles-1206 (135 tokens): Man critical after four - car collision in Dublin ...\n",
"> NewsArticles-119 (110 tokens): Is a ' seven - day NHS ' feasible ? The \" seven...\n",
"total number of tokens: 3198 / vocabulary size: 1170\n"
]
}
],
"source": [
"import random\n",
"random.seed(20191113) # to make the sampling reproducible\n",
"\n",
"import numpy as np\n",
"np.set_printoptions(precision=5)\n",
"\n",
"from tmtoolkit.corpus import Corpus, print_summary\n",
"\n",
"corpus = Corpus.from_builtin_corpus('en-NewsArticles', sample=5)\n",
"print_summary(corpus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We employ a preprocessing pipeline that removes a lot of information from our original data in order to obtain a very condensed DTM."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:07.813273Z",
"iopub.status.busy": "2022-03-11T08:31:07.808624Z",
"iopub.status.idle": "2022-03-11T08:31:07.900780Z",
"shell.execute_reply": "2022-03-11T08:31:07.901468Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>doc</th>\n",
" <th>position</th>\n",
" <th>token</th>\n",
" <th>is_punct</th>\n",
" <th>is_stop</th>\n",
" <th>lemma</th>\n",
" <th>like_num</th>\n",
" <th>pos</th>\n",
" <th>tag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NewsArticles-119</td>\n",
" <td>0</td>\n",
" <td>day</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>day</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NewsArticles-119</td>\n",
" <td>1</td>\n",
" <td>nhs</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NHS</td>\n",
" <td>False</td>\n",
" <td>PROPN</td>\n",
" <td>NNP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NewsArticles-119</td>\n",
" <td>2</td>\n",
" <td>day</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>day</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NewsArticles-119</td>\n",
" <td>3</td>\n",
" <td>nhs</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NHS</td>\n",
" <td>False</td>\n",
" <td>PROPN</td>\n",
" <td>NNP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NewsArticles-119</td>\n",
" <td>4</td>\n",
" <td>pledge</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>pledge</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>914</th>\n",
" <td>NewsArticles-3665</td>\n",
" <td>349</td>\n",
" <td>article</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>article</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>915</th>\n",
" <td>NewsArticles-3665</td>\n",
" <td>350</td>\n",
" <td>author</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>author</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>916</th>\n",
" <td>NewsArticles-3665</td>\n",
" <td>351</td>\n",
" <td>al</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Al</td>\n",
" <td>False</td>\n",
" <td>PROPN</td>\n",
" <td>NNP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>917</th>\n",
" <td>NewsArticles-3665</td>\n",
" <td>352</td>\n",
" <td>jazeera</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Jazeera</td>\n",
" <td>False</td>\n",
" <td>PROPN</td>\n",
" <td>NNP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>918</th>\n",
" <td>NewsArticles-3665</td>\n",
" <td>353</td>\n",
" <td>policy</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>policy.-</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>919 rows × 9 columns</p>\n",
"</div>"
],
"text/plain": [
" doc position token is_punct is_stop lemma \\\n",
"0 NewsArticles-119 0 day False False day \n",
"1 NewsArticles-119 1 nhs False False NHS \n",
"2 NewsArticles-119 2 day False False day \n",
"3 NewsArticles-119 3 nhs False False NHS \n",
"4 NewsArticles-119 4 pledge False False pledge \n",
".. ... ... ... ... ... ... \n",
"914 NewsArticles-3665 349 article False False article \n",
"915 NewsArticles-3665 350 author False False author \n",
"916 NewsArticles-3665 351 al False False Al \n",
"917 NewsArticles-3665 352 jazeera False False Jazeera \n",
"918 NewsArticles-3665 353 policy False False policy.- \n",
"\n",
" like_num pos tag \n",
"0 False NOUN NN \n",
"1 False PROPN NNP \n",
"2 False NOUN NN \n",
"3 False PROPN NNP \n",
"4 False NOUN NN \n",
".. ... ... ... \n",
"914 False NOUN NN \n",
"915 False NOUN NN \n",
"916 False PROPN NNP \n",
"917 False PROPN NNP \n",
"918 False NOUN NN \n",
"\n",
"[919 rows x 9 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import (lemmatize, filter_for_pos, to_lowercase,\n",
" remove_punctuation, filter_clean_tokens, remove_common_tokens,\n",
" tokens_table)\n",
"\n",
"\n",
"corpus_norm = lemmatize(corpus, inplace=False)\n",
"filter_for_pos(corpus_norm, 'N')\n",
"to_lowercase(corpus_norm)\n",
"remove_punctuation(corpus_norm)\n",
"filter_clean_tokens(corpus_norm, remove_shorter_than=2)\n",
"# remove tokens that occur in all documents\n",
"remove_common_tokens(corpus_norm, df_threshold=5, proportions=0)\n",
" \n",
"tokens_table(corpus_norm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We retained all documents, but removed more than half of the token types:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:07.907873Z",
"iopub.status.busy": "2022-03-11T08:31:07.907071Z",
"iopub.status.idle": "2022-03-11T08:31:07.910623Z",
"shell.execute_reply": "2022-03-11T08:31:07.909986Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(5, 516)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import vocabulary_size\n",
"\n",
"len(corpus_norm), vocabulary_size(corpus_norm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We fetch the document labels and vocabulary and convert them to NumPy arrays, because such arrays allow advanced indexing methods such as boolean indexing."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:07.917028Z",
"iopub.status.busy": "2022-03-11T08:31:07.916249Z",
"iopub.status.idle": "2022-03-11T08:31:07.919215Z",
"shell.execute_reply": "2022-03-11T08:31:07.919844Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['NewsArticles-119', 'NewsArticles-1206', 'NewsArticles-2058',\n",
" 'NewsArticles-3016', 'NewsArticles-3665'], dtype='<U17')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import doc_labels\n",
"\n",
"labels = np.array(doc_labels(corpus_norm))\n",
"labels"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:07.925972Z",
"iopub.status.busy": "2022-03-11T08:31:07.925157Z",
"iopub.status.idle": "2022-03-11T08:31:07.928469Z",
"shell.execute_reply": "2022-03-11T08:31:07.928861Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['110pm', '70', 'abuse', 'access', 'accession', 'accusation', 'act',\n",
" 'addition', 'address', 'administration'], dtype='<U16')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import vocabulary\n",
"\n",
"vocab = np.array(vocabulary(corpus_norm))\n",
"vocab[:10] # only showing the first 10 token types here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we generate the sparse DTM:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:07.935252Z",
"iopub.status.busy": "2022-03-11T08:31:07.934343Z",
"iopub.status.idle": "2022-03-11T08:31:07.966878Z",
"shell.execute_reply": "2022-03-11T08:31:07.967381Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<5x516 sparse matrix of type '<class 'numpy.int32'>'\n",
"\twith 576 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import dtm\n",
"\n",
"mat = dtm(corpus_norm)\n",
"mat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a sparse DTM `mat`, an array of document labels `labels` that represent the rows of the DTM and an array of vocabulary tokens `vocab` that represent the columns of the DTM. We will use this data for the remainder of the chapter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `bow.dtm` module\n",
"\n",
"This module is quite small. Most importantly, there's a function to convert a DTM to a pandas DataFrame, [dtm_to_dataframe](api.rst#tmtoolkit.bow.dtm.dtm_to_dataframe). Note that the generated dataframe is *dense*, i.e. it uses up (much) more memory than the input DTM.\n",
"\n",
"Let's generate a dataframe from our DTM, the document labels and the vocabulary:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:07.987276Z",
"iopub.status.busy": "2022-03-11T08:31:07.986335Z",
"iopub.status.idle": "2022-03-11T08:31:07.989631Z",
"shell.execute_reply": "2022-03-11T08:31:07.990036Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>110pm</th>\n",
" <th>70</th>\n",
" <th>abuse</th>\n",
" <th>access</th>\n",
" <th>accession</th>\n",
" <th>accusation</th>\n",
" <th>act</th>\n",
" <th>addition</th>\n",
" <th>address</th>\n",
" <th>administration</th>\n",
" <th>...</th>\n",
" <th>wing</th>\n",
" <th>winston</th>\n",
" <th>work</th>\n",
" <th>workers</th>\n",
" <th>world</th>\n",
" <th>wound</th>\n",
" <th>year</th>\n",
" <th>york</th>\n",
" <th>yucel</th>\n",
" <th>�</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>NewsArticles-119</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NewsArticles-1206</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NewsArticles-2058</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NewsArticles-3016</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NewsArticles-3665</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 516 columns</p>\n",
"</div>"
],
"text/plain": [
" 110pm 70 abuse access accession accusation act \\\n",
"NewsArticles-119 0 0 0 1 0 0 0 \n",
"NewsArticles-1206 1 1 0 0 0 0 0 \n",
"NewsArticles-2058 0 0 0 0 1 1 0 \n",
"NewsArticles-3016 0 0 1 0 0 0 0 \n",
"NewsArticles-3665 0 0 0 1 0 0 1 \n",
"\n",
" addition address administration ... wing winston \\\n",
"NewsArticles-119 0 0 0 ... 0 0 \n",
"NewsArticles-1206 0 0 0 ... 0 0 \n",
"NewsArticles-2058 0 0 0 ... 1 0 \n",
"NewsArticles-3016 0 0 0 ... 0 1 \n",
"NewsArticles-3665 1 1 1 ... 1 0 \n",
"\n",
" work workers world wound year york yucel � \n",
"NewsArticles-119 0 0 0 0 0 0 0 0 \n",
"NewsArticles-1206 0 0 0 0 0 0 0 2 \n",
"NewsArticles-2058 2 1 0 0 2 0 2 0 \n",
"NewsArticles-3016 0 0 3 1 0 1 0 0 \n",
"NewsArticles-3665 0 0 0 0 1 0 0 0 \n",
"\n",
"[5 rows x 516 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.dtm import dtm_to_dataframe\n",
"\n",
"dtm_to_dataframe(mat, labels, vocab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that an index with the document labels was created and that the vocabulary tokens become the column names.\n",
"\n",
"You can combine tmtoolkit with [Gensim](https://radimrehurek.com/gensim/). The `bow.dtm` module provides several functions to convert data between both packages:\n",
"\n",
"- [dtm_and_vocab_to_gensim_corpus_and_dict](api.rst#tmtoolkit.bow.dtm.dtm_and_vocab_to_gensim_corpus_and_dict): converts a (sparse) DTM and a vocabulary list to a *Gensim Corpus* and *Gensim Dictionary*\n",
"- [dtm_to_gensim_corpus](api.rst#tmtoolkit.bow.dtm.dtm_to_gensim_corpus): convert a (sparse) DTM only to a *Gensim Corpus*\n",
"- [gensim_corpus_to_dtm](api.rst#tmtoolkit.bow.dtm.gensim_corpus_to_dtm): converts a *Gensim Corpus* object to a sparse DTM in COO format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The `bow.bow_stats` module\n",
"\n",
"This module provides several statistics and transformations for sparse or dense DTMs.\n",
"\n",
"### Document lengths, document and term frequencies, token co-occurrences\n",
"\n",
"Let's start with the [doc_lengths](api.rst#tmtoolkit.bow.bow_stats.doc_lengths) function, which simply gives the number of tokens per document (i.e. the row-wise sum of the DTM):"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:07.996085Z",
"iopub.status.busy": "2022-03-11T08:31:07.995474Z",
"iopub.status.idle": "2022-03-11T08:31:07.998779Z",
"shell.execute_reply": "2022-03-11T08:31:07.999384Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 38, 40, 330, 157, 354])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import doc_lengths\n",
"\n",
"doc_lengths(mat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The returned array is aligned to the document labels `labels` so we can see that the last document, \"NewsArticles-3665\", is the one with the most tokens. Or to do it computationally:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.007554Z",
"iopub.status.busy": "2022-03-11T08:31:08.006769Z",
"iopub.status.idle": "2022-03-11T08:31:08.009630Z",
"shell.execute_reply": "2022-03-11T08:31:08.010059Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'NewsArticles-3665'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels[doc_lengths(mat).argmax()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While `doc_lengths` gives the row-wise sum across the DTM, [term_frequencies](api.rst#tmtoolkit.bow.bow_stats.term_frequencies) gives the column-wise sum. This means it returns an array of the length of the vocabulary's size where each entry in that array reflects the number of occurrences of the respective vocabulary token (aka term).\n",
"\n",
"Let's calculate that measure, get its maximum and the token type(s) for that maximum value:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.016206Z",
"iopub.status.busy": "2022-03-11T08:31:08.015466Z",
"iopub.status.idle": "2022-03-11T08:31:08.020657Z",
"shell.execute_reply": "2022-03-11T08:31:08.021102Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(23, array(['medium'], dtype='<U16'))"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import term_frequencies\n",
"\n",
"term_freq = term_frequencies(mat)\n",
"(term_freq.max(), vocab[term_freq == term_freq.max()])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's also possible to calculate the proportional frequency, i.e. normalize the counts by the overall number of tokens via `proportions=1`. Alternatively, `proportions=2` gives you log proportions."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.027801Z",
"iopub.status.busy": "2022-03-11T08:31:08.027035Z",
"iopub.status.idle": "2022-03-11T08:31:08.030208Z",
"shell.execute_reply": "2022-03-11T08:31:08.029811Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['candidate', 'eu', 'macron', 'medium', 'merkel', 'refugee'],\n",
" dtype='<U16')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"term_prop = term_frequencies(mat, proportions=1)\n",
"vocab[term_prop >= 0.01]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function [doc_frequencies](api.rst#tmtoolkit.bow.bow_stats.doc_frequencies) returns how often each token in the vocabulary occurs at least *n* times per document. You can control *n* per parameter `min_val` which is set to `1` by default. The returned array is aligned with the vocabulary. Here, we calculate the document frequency with the default value `min_val=1`, extract the maximum document frequency and see which of the tokens in the `vocab` array reach the maximum document frequency:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.036728Z",
"iopub.status.busy": "2022-03-11T08:31:08.035151Z",
"iopub.status.idle": "2022-03-11T08:31:08.040272Z",
"shell.execute_reply": "2022-03-11T08:31:08.039582Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(4, array(['minister'], dtype='<U16'))"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import doc_frequencies\n",
"\n",
"df = doc_frequencies(mat)\n",
"max_df = df.max()\n",
"max_df, vocab[df == max_df]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It turns out that the maximum document frequency is 4 and only the token \"minister\" reaches that document frequency. This means only \"minister\" is mentioned across 4 documents at least once (because `min_val` is `1`). Remember that during preprocessing, we removed all tokens that occur across *all* five documents, hence there can't be a vocabulary token with a document frequency of 5.\n",
"\n",
"Let's see which vocabulary tokens occur within a single document at least 10 times:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.047063Z",
"iopub.status.busy": "2022-03-11T08:31:08.045099Z",
"iopub.status.idle": "2022-03-11T08:31:08.051111Z",
"shell.execute_reply": "2022-03-11T08:31:08.051606Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['candidate', 'eu', 'macron', 'medium', 'merkel', 'refugee'],\n",
" dtype='<U16')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = doc_frequencies(mat, min_val=10)\n",
"vocab[df > 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also calculate the *co-document frequency* or *token co-occurrence* matrix via [codoc_frequencies](api.rst#tmtoolkit.bow.bow_stats.codoc_frequencies). This measures how often each pair of vocabulary tokens occurs at least *n* times together in the same document. Again, you can control *n* per parameter `min_val` which is set to `1` by default. The result is a sparse matrix of shape *vocabulary size* by *vocabulary size*. The columns and rows give the pairs of tokens from the vocabulary.\n",
"\n",
"Let's generate a co-document frequency matrix and convert it to a dense representation, because our further operations don't support sparse matrices.\n",
"\n",
"A co-document frequency matrix is symmetric along the diagonal, because co-occurrence between a pair `(token1, token2)` is always the same as between `(token2, token1)`. We want to filter out the duplicate pairs and for that use [np.triu](https://docs.scipy.org/doc/numpy/reference/generated/numpy.triu.html) to take only the upper triangle of the matrix, i.e. set all values in the lower triangle including the matrix diagonal to zero (`k=1` does this):"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.056891Z",
"iopub.status.busy": "2022-03-11T08:31:08.055943Z",
"iopub.status.idle": "2022-03-11T08:31:08.067091Z",
"shell.execute_reply": "2022-03-11T08:31:08.067515Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 1, 0, ..., 0, 0, 1],\n",
" [0, 0, 0, ..., 0, 0, 1],\n",
" [0, 0, 0, ..., 1, 0, 0],\n",
" ...,\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0]])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import codoc_frequencies\n",
"\n",
"codoc_mat = codoc_frequencies(mat).todense()\n",
"codoc_upper = np.triu(codoc_mat, k=1)\n",
"codoc_upper"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we create a list that contains the pairs of tokens that occur together in at least two documents (`codoc_upper > 1`) together with their co-document frequency:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.079757Z",
"iopub.status.busy": "2022-03-11T08:31:08.073605Z",
"iopub.status.idle": "2022-03-11T08:31:08.083176Z",
"shell.execute_reply": "2022-03-11T08:31:08.084275Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[('government', 'minister', 3),\n",
" ('minister', 'time', 3),\n",
" ('access', 'channel', 2),\n",
" ('access', 'day', 2),\n",
" ('access', 'minister', 2),\n",
" ('access', 'news', 2),\n",
" ('april', 'author', 2),\n",
" ('april', 'co', 2),\n",
" ('april', 'critic', 2),\n",
" ('april', 'distribution', 2),\n",
" ('april', 'heart', 2),\n",
" ('april', 'law', 2),\n",
" ('april', 'minister', 2),\n",
" ('april', 'policy', 2),\n",
" ('april', 'question', 2),\n",
" ('april', 'right', 2),\n",
" ('april', 'state', 2),\n",
" ('april', 'support', 2),\n",
" ('april', 'system', 2),\n",
" ('april', 'time', 2),\n",
" ...]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"interesting_pairs = [(vocab[t1], vocab[t2], codoc_upper[t1, t2])\n",
" for t1, t2 in zip(*np.where(codoc_upper > 1))]\n",
"# sort by codoc freq. in desc. order\n",
"sorted(interesting_pairs, key=lambda x: x[2], reverse=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate sorted lists and datatables according to term frequency\n",
"\n",
"When working with DTMs, it's often helpful to rank terms per document according to their frequency. This is what [sorted_terms](api.rst#tmtoolkit.bow.bow_stats.sorted_terms) does for you. It further allows to specify the sorting order (the default is descending order via `ascending=False`) and several limits:\n",
"\n",
"- `lo_thresh` for the minimum term frequency\n",
"- `hi_thresh` for the maximum term frequency\n",
"- `top_n` for the maximum number of terms per document\n",
"\n",
"Let's display the top three tokens per document by frequency:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.093879Z",
"iopub.status.busy": "2022-03-11T08:31:08.092838Z",
"iopub.status.idle": "2022-03-11T08:31:08.100957Z",
"shell.execute_reply": "2022-03-11T08:31:08.100129Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[[('day', 3), ('nhs', 2), ('bbc', 2)],\n",
" [('car', 4), ('garda', 4), ('collision', 3)],\n",
" [('merkel', 14), ('refugee', 13), ('eu', 13)],\n",
" [('politic', 7), ('party', 6), ('farron', 5)],\n",
" [('medium', 23), ('candidate', 19), ('macron', 15)]]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import sorted_terms\n",
"\n",
"sorted_terms(mat, vocab, top_n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output is a list for each document (this means the output is aligned with the document labels `doc_labels`), with three pairs of `(token, frequency)` each. It's also possible to get this data as dataframe via [sorted_terms_table](api.rst#tmtoolkit.bow.bow_stats.sorted_terms_table), which gives a better overview and also includes labels for the documents. It accepts the same parameters for sorting and limitting the results:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.116818Z",
"iopub.status.busy": "2022-03-11T08:31:08.111903Z",
"iopub.status.idle": "2022-03-11T08:31:08.145474Z",
"shell.execute_reply": "2022-03-11T08:31:08.144250Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>token</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>doc</th>\n",
" <th>rank</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-119</th>\n",
" <th>1</th>\n",
" <td>day</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>nhs</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>bbc</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-1206</th>\n",
" <th>1</th>\n",
" <td>car</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>garda</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>collision</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-2058</th>\n",
" <th>1</th>\n",
" <td>merkel</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>refugee</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>eu</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-3016</th>\n",
" <th>1</th>\n",
" <td>politic</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>party</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>farron</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-3665</th>\n",
" <th>1</th>\n",
" <td>medium</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>candidate</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>macron</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" token value\n",
"doc rank \n",
"NewsArticles-119 1 day 3\n",
" 2 nhs 2\n",
" 3 bbc 2\n",
"NewsArticles-1206 1 car 4\n",
" 2 garda 4\n",
" 3 collision 3\n",
"NewsArticles-2058 1 merkel 14\n",
" 2 refugee 13\n",
" 3 eu 13\n",
"NewsArticles-3016 1 politic 7\n",
" 2 party 6\n",
" 3 farron 5\n",
"NewsArticles-3665 1 medium 23\n",
" 2 candidate 19\n",
" 3 macron 15"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import sorted_terms_table\n",
"\n",
"sorted_terms_table(mat, vocab, labels, top_n=3)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.157108Z",
"iopub.status.busy": "2022-03-11T08:31:08.154438Z",
"iopub.status.idle": "2022-03-11T08:31:08.174738Z",
"shell.execute_reply": "2022-03-11T08:31:08.175466Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>token</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>doc</th>\n",
" <th>rank</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"7\" valign=\"top\">NewsArticles-2058</th>\n",
" <th>1</th>\n",
" <td>merkel</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>refugee</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>eu</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>germany</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>country</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>turkey</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>europe</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">NewsArticles-3016</th>\n",
" <th>1</th>\n",
" <td>politic</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>party</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"7\" valign=\"top\">NewsArticles-3665</th>\n",
" <th>1</th>\n",
" <td>medium</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>candidate</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>macron</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>france</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>election</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>le</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>coverage</td>\n",
" <td>6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" token value\n",
"doc rank \n",
"NewsArticles-2058 1 merkel 14\n",
" 2 refugee 13\n",
" 3 eu 13\n",
" 4 germany 8\n",
" 5 country 8\n",
" 6 turkey 6\n",
" 7 europe 6\n",
"NewsArticles-3016 1 politic 7\n",
" 2 party 6\n",
"NewsArticles-3665 1 medium 23\n",
" 2 candidate 19\n",
" 3 macron 15\n",
" 4 france 9\n",
" 5 election 9\n",
" 6 le 7\n",
" 7 coverage 6"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted_terms_table(mat, vocab, labels, lo_thresh=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Term frequency–inverse document frequency transformation (tf-idf)\n",
"\n",
"[Term frequency–inverse document frequency transformation (tf-idf)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a matrix transformation that is often applied to DTMs in order to reflect the importance of a token to a document. The `bow_stats` module provides the function [tfidf](api.rst#tmtoolkit.bow.bow_stats.tfidf) for this. When the input is a sparse matrix, and the calculation supports operating on sparce matrices, the output will also be a sparse matrix, which means that the tf-idf transformation is implemented in a very memory-efficient way.\n",
"\n",
"Let's apply tf-idf to our DTM using the default way:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.184711Z",
"iopub.status.busy": "2022-03-11T08:31:08.181775Z",
"iopub.status.idle": "2022-03-11T08:31:08.187019Z",
"shell.execute_reply": "2022-03-11T08:31:08.186549Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<5x516 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 576 stored elements in COOrdinate format>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import tfidf\n",
"\n",
"tfidf_mat = tfidf(mat)\n",
"tfidf_mat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the output is a sparse matrix. Let's have a look at its values:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.194266Z",
"iopub.status.busy": "2022-03-11T08:31:08.193729Z",
"iopub.status.idle": "2022-03-11T08:31:08.197257Z",
"shell.execute_reply": "2022-03-11T08:31:08.197861Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0. , 0. , 0. , ..., 0. , 0. , 0. ],\n",
" [0.03132, 0.03132, 0. , ..., 0. , 0. , 0.06264],\n",
" [0. , 0. , 0. , ..., 0. , 0.00759, 0. ],\n",
" [0. , 0. , 0.00798, ..., 0.00798, 0. , 0. ],\n",
" [0. , 0. , 0. , ..., 0. , 0. , 0. ]])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tfidf_mat.todense()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course we can also pass this matrix to `sorted_terms_table` and observe that some rankings have changed in comparison to the untransformed DTM:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.203557Z",
"iopub.status.busy": "2022-03-11T08:31:08.202459Z",
"iopub.status.idle": "2022-03-11T08:31:08.221501Z",
"shell.execute_reply": "2022-03-11T08:31:08.221917Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>token</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>doc</th>\n",
" <th>rank</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-119</th>\n",
" <th>1</th>\n",
" <td>day</td>\n",
" <td>0.077434</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>bbc</td>\n",
" <td>0.065935</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>victoria</td>\n",
" <td>0.065935</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-1206</th>\n",
" <th>1</th>\n",
" <td>car</td>\n",
" <td>0.125276</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>garda</td>\n",
" <td>0.125276</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>collision</td>\n",
" <td>0.093957</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-2058</th>\n",
" <th>1</th>\n",
" <td>merkel</td>\n",
" <td>0.053148</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>refugee</td>\n",
" <td>0.049351</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>eu</td>\n",
" <td>0.038639</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-3016</th>\n",
" <th>1</th>\n",
" <td>politic</td>\n",
" <td>0.055856</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>farron</td>\n",
" <td>0.039897</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>party</td>\n",
" <td>0.037484</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-3665</th>\n",
" <th>1</th>\n",
" <td>medium</td>\n",
" <td>0.081394</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>candidate</td>\n",
" <td>0.067239</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>macron</td>\n",
" <td>0.053083</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" token value\n",
"doc rank \n",
"NewsArticles-119 1 day 0.077434\n",
" 2 bbc 0.065935\n",
" 3 victoria 0.065935\n",
"NewsArticles-1206 1 car 0.125276\n",
" 2 garda 0.125276\n",
" 3 collision 0.093957\n",
"NewsArticles-2058 1 merkel 0.053148\n",
" 2 refugee 0.049351\n",
" 3 eu 0.038639\n",
"NewsArticles-3016 1 politic 0.055856\n",
" 2 farron 0.039897\n",
" 3 party 0.037484\n",
"NewsArticles-3665 1 medium 0.081394\n",
" 2 candidate 0.067239\n",
" 3 macron 0.053083"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted_terms_table(tfidf_mat, vocab, labels, top_n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The tf-idf matrix is calculated from a DTM $D$ as $\\textit{tf}(D) \\cdot \\textit{idf}(D)$.\n",
"\n",
"\n",
"There are different variants for how to calculate the term frequency $\\textit{tf}(D)$ and the inverse document frequency $\\textit{idf(D)}$. The package tmtoolkit contains several functions that implement some of these variants. For $\\text{tf()}$ these are:\n",
"\n",
"- [tf_binary](api.rst#tmtoolkit.bow.bow_stats.tf_binary): binary term frequency matrix (matrix contains 1 whenever a term occurred in a document, else 0)\n",
"- [tf_proportions](api.rst#tmtoolkit.bow.bow_stats.tf_proportions): proportional term frequency matrix (term counts are normalized by document length)\n",
"- [tf_log](api.rst#tmtoolkit.bow.bow_stats.tf_log): log-normalized term frequency matrix (by default $\\log(1 + D)$)\n",
"- [tf_double_norm](api.rst#tmtoolkit.bow.bow_stats.tf_double_norm): double-normalized term frequency matrix\n",
" $K + (1-K) \\cdot \\frac{D}{\\textit{rowmax}(D)}$, where $\\textit{rowmax}(D)$ is a vector containing the maximum term count per document\n",
"\n",
"As you can see, all the term frequency functions are prefixed with a `tf_`. There are also two variants for $\\textit{idf()}$:\n",
"\n",
"- [idf](api.rst#tmtoolkit.bow.bow_stats.idf): calculates $\\log(\\frac{a + N}{b + \\textit{df}(D)})$ where $a$ and $b$ are smoothing constants, $N$ is the number of documents and $\\textit{df}(D)$ calculates the [document frequency](#Document-lengths,-document-and-term-frequencies,-token-co-occurrences)\n",
"- [idf_probabilistic](api.rst#tmtoolkit.bow.bow_stats.idf_probabilistic): calculates $\\log(a + \\frac{N - \\textit{df}(D)}{\\textit{df}(D)})$\n",
"\n",
"The term frequency functions always return a sparse matrix if possible and if the input is sparse. Let's try out two term frequency functions:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.228886Z",
"iopub.status.busy": "2022-03-11T08:31:08.228077Z",
"iopub.status.idle": "2022-03-11T08:31:08.231334Z",
"shell.execute_reply": "2022-03-11T08:31:08.232035Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0, 0, 0, ..., 0, 0, 0],\n",
" [1, 1, 0, ..., 0, 0, 1],\n",
" [0, 0, 0, ..., 0, 1, 0],\n",
" [0, 0, 1, ..., 1, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 0]])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import tf_binary, tf_proportions\n",
"\n",
"tf_binary(mat).todense()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.240833Z",
"iopub.status.busy": "2022-03-11T08:31:08.239939Z",
"iopub.status.idle": "2022-03-11T08:31:08.244072Z",
"shell.execute_reply": "2022-03-11T08:31:08.243310Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0. , 0. , 0. , ..., 0. , 0. , 0. ],\n",
" [0.025 , 0.025 , 0. , ..., 0. , 0. , 0.05 ],\n",
" [0. , 0. , 0. , ..., 0. , 0.00606, 0. ],\n",
" [0. , 0. , 0.00637, ..., 0.00637, 0. , 0. ],\n",
" [0. , 0. , 0. , ..., 0. , 0. , 0. ]])"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tf_proportions(mat).todense()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like the [document frequency](#Document-lengths,-document-and-term-frequencies,-token-co-occurrences) function `doc_frequencies`, the inverse document frequency functions also return a vector with the same length as the vocabulary. Let's use these functions and have a look at the inverse document frequency of certain tokens:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.254232Z",
"iopub.status.busy": "2022-03-11T08:31:08.253371Z",
"iopub.status.idle": "2022-03-11T08:31:08.257431Z",
"shell.execute_reply": "2022-03-11T08:31:08.256687Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[('110pm', 1.252762968495368),\n",
" ('70', 1.252762968495368),\n",
" ('abuse', 1.252762968495368),\n",
" ('access', 0.9808292530117262),\n",
" ('accession', 1.252762968495368),\n",
" ('accusation', 1.252762968495368),\n",
" ('act', 1.252762968495368),\n",
" ('addition', 1.252762968495368),\n",
" ('address', 1.252762968495368),\n",
" ('administration', 1.252762968495368)]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import idf, idf_probabilistic\n",
"\n",
"idf_vec = idf(mat)\n",
"list(zip(vocab, idf_vec))[:10]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.262323Z",
"iopub.status.busy": "2022-03-11T08:31:08.261799Z",
"iopub.status.idle": "2022-03-11T08:31:08.268979Z",
"shell.execute_reply": "2022-03-11T08:31:08.268178Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[('110pm', 1.6094379124341003),\n",
" ('70', 1.6094379124341003),\n",
" ('abuse', 1.6094379124341003),\n",
" ('access', 0.916290731874155),\n",
" ('accession', 1.6094379124341003),\n",
" ('accusation', 1.6094379124341003),\n",
" ('act', 1.6094379124341003),\n",
" ('addition', 1.6094379124341003),\n",
" ('address', 1.6094379124341003),\n",
" ('administration', 1.6094379124341003)]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"probidf_vec = idf_probabilistic(mat)\n",
"\n",
"list(zip(vocab, probidf_vec))[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that due to our very small sample, there's not much variation in the inverse document frequency values.\n",
"\n",
"By default, [tfidf](api.rst#tmtoolkit.bow.bow_stats.tfidf) uses `tf_proportions` and `idf` to calculate the tf-idf matrix. You can plug in other functions to get other variants of tf-idf:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.276735Z",
"iopub.status.busy": "2022-03-11T08:31:08.275969Z",
"iopub.status.idle": "2022-03-11T08:31:08.278738Z",
"shell.execute_reply": "2022-03-11T08:31:08.279357Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.40236, 0.40236, 0.40236, ..., 0.40236, 0.40236, 0.40236],\n",
" [0.70413, 0.70413, 0.40236, ..., 0.40236, 0.40236, 1.0059 ],\n",
" [0.40236, 0.40236, 0.40236, ..., 0.40236, 0.5748 , 0.40236],\n",
" [0.40236, 0.40236, 0.5748 , ..., 0.5748 , 0.40236, 0.40236],\n",
" [0.40236, 0.40236, 0.40236, ..., 0.40236, 0.40236, 0.40236]])"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.bow.bow_stats import tf_double_norm\n",
"\n",
"# we also set a \"K\" parameter for \"tf_double_norm\"\n",
"tfidf_mat2 = tfidf(mat, tf_func=tf_double_norm,\n",
" idf_func=idf_probabilistic, K=0.25)\n",
"tfidf_mat2"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:08.289639Z",
"iopub.status.busy": "2022-03-11T08:31:08.284300Z",
"iopub.status.idle": "2022-03-11T08:31:08.307249Z",
"shell.execute_reply": "2022-03-11T08:31:08.308053Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>token</th>\n",
" <th>value</th>\n",
" </tr>\n",
" <tr>\n",
" <th>doc</th>\n",
" <th>rank</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-119</th>\n",
" <th>1</th>\n",
" <td>bbc</td>\n",
" <td>1.207078</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>nhs</td>\n",
" <td>1.207078</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>victoria</td>\n",
" <td>1.207078</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-1206</th>\n",
" <th>1</th>\n",
" <td>car</td>\n",
" <td>1.609438</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>garda</td>\n",
" <td>1.609438</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>collision</td>\n",
" <td>1.307668</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-2058</th>\n",
" <th>1</th>\n",
" <td>merkel</td>\n",
" <td>1.609438</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>refugee</td>\n",
" <td>1.523218</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>germany</td>\n",
" <td>1.092119</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-3016</th>\n",
" <th>1</th>\n",
" <td>politic</td>\n",
" <td>1.609438</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>farron</td>\n",
" <td>1.264558</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>putin</td>\n",
" <td>1.092119</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">NewsArticles-3665</th>\n",
" <th>1</th>\n",
" <td>medium</td>\n",
" <td>1.609438</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>candidate</td>\n",
" <td>1.399511</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>macron</td>\n",
" <td>1.189585</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" token value\n",
"doc rank \n",
"NewsArticles-119 1 bbc 1.207078\n",
" 2 nhs 1.207078\n",
" 3 victoria 1.207078\n",
"NewsArticles-1206 1 car 1.609438\n",
" 2 garda 1.609438\n",
" 3 collision 1.307668\n",
"NewsArticles-2058 1 merkel 1.609438\n",
" 2 refugee 1.523218\n",
" 3 germany 1.092119\n",
"NewsArticles-3016 1 politic 1.609438\n",
" 2 farron 1.264558\n",
" 3 putin 1.092119\n",
"NewsArticles-3665 1 medium 1.609438\n",
" 2 candidate 1.399511\n",
" 3 macron 1.189585"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted_terms_table(tfidf_mat2, vocab, labels, top_n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"Once we have generated a DTM, we can use it for topic modeling. The [next chapter](topic_modeling.ipynb) will show how tmtoolkit can be used to evaluate the quality of your model, export essential information from it and visualize the results."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: doc/source/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
from datetime import date
import sphinx_rtd_theme
sys.path.insert(0, os.path.abspath('../..'))
# -- Project information -----------------------------------------------------
project = 'tmtoolkit'
copyright = f'{date.today().year}, Markus Konrad'
author = 'Markus Konrad'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'nbsphinx',
'sphinx.ext.autodoc',
'sphinx_rtd_theme'
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['**.ipynb_checkpoints']
# If true, '()' will be appended to :func: etc. cross-reference text.
add_function_parentheses = False
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
add_module_names = True
# type hints
autodoc_typehints = 'description'
autodoc_typehints_format = 'short'
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
html_theme = "sphinx_rtd_theme"
# html_static_path = ['static']
# Output file base name for HTML help builder.
htmlhelp_basename = '%sdoc' % project
# Never skip __init__
def skip(app, what, name, obj, would_skip, options):
if name == "__init__":
return False
return would_skip
def setup(app):
app.connect("autodoc-skip-member", skip)
================================================
FILE: doc/source/data/corpus_example/sample1.txt
================================================
This is the first example file. ☺ We showcase NER by just randomly listing famous people like Missy Elliott or George Harrison.
================================================
FILE: doc/source/data/corpus_example/sample2.txt
================================================
Here comes the second example (with HTML <i>tags</i> & entities).
This one contains three lines of plain text which means two paragraphs.
================================================
FILE: doc/source/data/corpus_example/sample3.txt
================================================
And here we go with the third and final example file.
Another line of text.
§2.
This is the second paragraph.
The third and final paragraph.
================================================
FILE: doc/source/data/tm_wordclouds/.gitignore
================================================
# Ignore everything in this directory
*
# Except this file
!.gitignore
================================================
FILE: doc/source/development.rst
================================================
.. _development:
Development
===========
This part of the documentation serves as developer documentation, i.e. a help for those who want to contribute to the development of the package.
Project overview
----------------
This project aims to provide a Python package that allows text processing, text mining and topic modeling with
- easy installation,
- extensive documentation,
- clear functional programming interface,
- good performance on large datasets.
All computations need to be performed in memory. Streaming data from disk is not supported so far.
The package is written in Python and uses other packages for key tasks:
- `SpaCy <https://spacy.io/>`_ is used for the text processing and text mining tasks
- `lda <http://pythonhosted.org/lda/>`_, `gensim <https://radimrehurek.com/gensim/>`_ or `scikit-learn <http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html>`_ are used for computing topic models
The project's packages are published to the `Python Package Index PyPI <https://pypi.org/project/tmtoolkit/>`_.
The package's dependencies are only installed on demand. There's a setup routine that provides an interface for easy installation of SpaCy's language models.
Text processing and normalization is often used to construct a Bag-of-Words (BoW) model which in turn is the input for topic models.
Contributing to tmtoolkit
-------------------------
If you want to contribute to tmtoolkit, you can create code or documentation patches (updates) and submit them as `pull requests <https://github.com/WZBSocialScienceCenter/tmtoolkit/pulls>`_ on GitHub. The first thing to do for this is to fork the `GitHub repository <https://github.com/WZBSocialScienceCenter/tmtoolkit>`_ and to clone it on your local machine. It's best to create a separate branch for your updates next. You should then set up your local machine for development as follows:
- create a `Python virtual environment <https://docs.python.org/3/tutorial/venv.html>`_ – make sure that the Python version you're using for this is supported by tmtoolkit
- update pip via ``pip install -U pip``
- if you're planning to contribute to the code or to the tutorials in the documentation:
- install *all* dependencies via ``pip install -r requirements.txt``
- run the tmtoolkit setup routine via ``python -m tmtoolkit setup all`` to install the required language models
- check that everything works by running all tests via ``pytest tests/``
- if you're *only* planning to contribute to the documentation (without the tutorials which are Jupyter Notebooks):
- install dependencies for documentation via ``pip install -r requirements_doc.txt``
You can then start working on the code or documentation. Make sure to run the tests and/or create new tests when you provide code updates in your pull request. You should also read this developer documentation completely before diving into the code.
Folder structure
----------------
The project's root folder contains files for documentation generation (``.readthedocs.yaml``), testing (``conftest.py``, ``coverage.svg``, ``tox.ini``) as well as project management and package building (``Makefile``, ``MANIFEST.in``, ``setup.py``). The subfolders include:
- ``.github/worflows``: provides Continuous Integration (CI) configuration for *GitHub Actions*,
- ``doc``: documentation source and built documentation files,
- ``examples``: example scripts and data to show some of the features (most features are better explained in the tutorial which is part of the documentation),
- ``scripts``: scripts used for preparing datasets that come along with the package,
- ``tests``: test suite,
- ``tmtoolkit``: package source code.
Packaging and dependency management
-----------------------------------
This package uses `setuptools <https://setuptools.pypa.io/en/latest/index.html>`_ for packaging. All package metadata and dependencies are defined in ``setup.py``. Since tmtoolkit allows installing dependencies on demand, there are several installation options defined in ``setup.py``. For development, the most important are:
- ``[dev]``: installs packages for development and packaging
- ``[test]``: installs packages for testing tmtoolkit
- ``[doc]``: installs packages for generating the documentation
- ``[all]``: installs all required and optional packages – recommended for development
The ``requirements.txt`` and ``requirements_doc.txt`` files simply point to the ``[all]`` and ``[doc]`` installation options.
The ``Makefile`` in the root folder contains targets for generating a Python *Wheel* package (``make wheel``) and a Python source distribution package (``make sdist``).
Built-in datasets
-----------------
All built-in datasets reside in ``tmtoolkit/data/<LANGUAGE_CODE>``, where ``LANGUAGE_CODE`` is an ISO language code. For the `ParlSpeech V2 <https://doi.org/10.7910/DVN/L4OAKN>`_ datasets, the samples are generated via the R script ``scripts/prepare_corpora.R``. The `News Articles <https://doi.org/10.7910/DVN/GMFCTR>`_ dataset is used without further processing.
Automated testing
-----------------
The tmtoolkit package relies on the following packages for testing:
- `pytest <https://pytest.org/>`_ as testing framework,
- `hypothesis <https://hypothesis.readthedocs.io/>`_ for property-based testing,
- `coverage <https://coverage.readthedocs.io/>`_ for measuring test coverage of the code,
- `tox <https://tox.wiki/>`_ for checking packaging and running tests in different virtual environments.
All tests are implemented in the ``tests`` directory and prefixed by ``test_``. The ``conftest.py`` file contains project-wide test configuration. The ``tox.ini`` file contains configuration for setting up the virtual environments for tox. For each release, tmtoolkit aims to support the last three major Python release versions, e.g. 3.8, 3.9 and 3.10, and all of these are tested with tox along with different dependency configurations from *minimal* to *full*. To use different versions of Python on the same system, it's recommended to use the `deadsnakes repository <https://launchpad.net/~deadsnakes/+archive/ubuntu/ppa>`_ on Ubuntu or Debian Linux.
The ``Makefile`` in the root folder contains a target for generating coverage reports and the coverage badge (``make cov_tests``).
Documentation
-------------
The `Sphinx <https://www.sphinx-doc.org/>`_ package is used for documentation. All objects exposed by the API are documented in the Sphinx format. All other parts of the documentation reside in ``doc/source``. The configuration for Sphinx lies in ``doc/source/conf.py``. The `nbsphinx <https://nbsphinx.readthedocs.io/>`_ package is used for generating the tutorial from Jupyter Notebooks which are also located in ``doc/source``.
The ``Makefile`` in the ``doc`` folder has several targets for generating the documentation. These are:
- ``make notebooks`` – run all notebooks to generate their outputs; these are stored in-place
- ``make clean`` – remove everything under ``doc/build``
- ``make html`` – generate the HTML documentation from the documentation source
The generated documentation then resides under ``doc/build``.
The documentation is published at `tmtoolkit.readthedocs.io <https://tmtoolkit.readthedocs.io/en/latest/>`_. For this, new commits to the master branch of the GitHub project or new tags are automatically built by `readthedocs.org <https://readthedocs.org/>`_. The ``.readthedocs.yaml`` file in the root folder sets up the build process for readthedocs.org.
Continuous integration
----------------------
Continuous integration routines are defined via `GitHub Actions (GA) <https://docs.github.com/en/actions>`_. For tmtoolkit, this so far only means automatic testing for new commits and releases on different machine configurations.
The GA set up for the tests is done in ``.github/worflows/runtests.yml``. There are "minimal" and "full" test suites for Ubuntu, MacOS and Windows with Python versions 3.8, 3.9 and 3.10 each, which means 18 jobs are spawned. Again, tox is used for running the tests on the machines.
Release management
------------------
Publishing a new release for tmtoolkit involves several steps, listed below. You may consider creating a `pre-release <https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/#pre-release-versioning>`_ for PyPI first before publishing a final release.
1. Preparation:
- create a new branch for the release version X.Y.Z as ``releaseX.Y.Z``
- check if there are new minimum version requirements for dependencies or generally new dependencies to be added in ``setup.py``
- check if the compatible Python versions should be updated in ``setup.py``
- set the new version in ``setup.py`` and ``tmtoolkit/__init__.py``
2. Documentation updates:
- check and possibly update the tutorials – do all code examples still work and are all important features covered?
- update documentation
- update README
- update changelog (``doc/source/version_history.rst``)
3. Testing:
- run examples and check if they work
- run tests locally via tox
- push to GitHub repository ``develop`` or ``release*`` branch to run tests via GitHub Actions
4. Publish package to PyPI:
- build source distribution via ``make sdist``
- build wheel via ``make wheel``
- check both via ``twine check dist/...``
- if checks passed, upload both to PyPI via ``twine upload dist/...``
5. Finalization
- make a new tag for the new version via ``git tag -a vX.Y.Z -m "version X.Y.Z"``
- push the new tag to the GitHub repository
- create a new release from the tag in the GitHub repository
- merge the development or release branch with the master branch and push the master branch to the GitHub repository
- log in to `readthedocs.org <https://readthedocs.org/>`_, go to the project page, activate the current version, let it build the documentation
- verify documentation on `tmtoolkit.readthedocs.io <https://tmtoolkit.readthedocs.io/en/latest/>`_
If you notice a (major) mistake in a release *after* publication, you have several options like yanking the release on PyPI, publishing a post-release or updating the build number of the wheel. See `this blog post <https://snarky.ca/what-to-do-when-you-botch-a-release-on-pypi/>`_ for more information about these options.
API style
---------
The tmtoolkit package provides a *functional API*. This is quite different from object-oriented APIs that are found in many other Python packages, where a programmer mainly uses classes and their methods that are exposed by an API. The tmtoolkit API on the other hand mainly exposes data structures and functions that operate on these data structures. In tmtoolkit, Python classes are usually used to implement more complex data structures such as documents or document corpora, but these classes don't provide (public) methods. Rather, they are used as function arguments, for example as in the large set of *corpus functions* that operate on text corpora as explained below.
Implementation details
----------------------
Top-level module and setup routine
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``__main__.py`` file provides a command-line interface for the package. It's only purpose is to allow easy installation of SpaCy language models via the :ref:`setup routine <setup>`. The ``tokenseq`` module provides functions that operate on single (string) tokens or sequences of tokens. These functions are used mainly internally in the ``corpus`` module, but are also exposed by the API to be used from a package user. The ``utils.py`` module provides helper functions used internally throughout the package, but also to be possibly used from a package user.
``bow`` module
^^^^^^^^^^^^^^
This module provides functions for generating document-term-matrices (DTMs), which are central to the BoW concept, and some common statistics used for these matrices.
``corpus`` module
^^^^^^^^^^^^^^^^^
This is the central module for text processing and text mining.
At the core of this module, there is the :class:`~tmtoolkit.corpus.Corpus` class implemented in ``corpus/_corpus.py``. It takes documents with raw text as input (i.e. a dict mapping *document labels* to text strings) and applies a SpaCy NLP pipeline to it. After that, the corpus consists of :class:`~tmtoolkit.corpus.Document` (implemented in ``corpus/_document.py``) objects which contain the textual data in tokenized form, i.e. as a sequence of *tokens* (roughly translated as "words" but other text contents such as numbers and punctuation also form separate tokens). Each token comes along with several *token attributes* which were estimated using the NLP pipeline. Examples for token attributes include the Part-of-Speech tag or the lemma.
The :class:`~tmtoolkit.corpus.Document` class stores the tokens and their "standard" attributes in a *token matrix*. This matrix is of shape *(N, M)* for *N* tokens and with *M* attributes. There are at least 2 or 3 attributes: ``whitespace`` (boolean – is there a whitespace after the token?), ``token`` (the actual token, i.e. "word" type) and optionally ``sent_start`` (only given when sentence information is parsed in the NLP pipeline).
The token matrix is a *uint64* matrix as it stores all information as *64 bit hash values*. Compared to sequences of strings, this reduces memory usage and allows faster computations and data modifications. E.g., when you transform a token (lets say "Hello" to "hello"), you only do one transformation, calculate one new hash value and replace each occurrence of the old hash with the new hash. The hashes are calculated with SpaCy's `hash_string <https://spacy.io/api/stringstore#hash_string>`_ function. For fast conversion between token/attribute hashes and strings, the mappings are stored in a *bidirectional dictionary* using the `bidict <https://pypi.org/project/bidict/>`_ package. Each column, i.e. each attribute, in the token matrix has a separate bidict in the ``bimaps`` dictionary that is shared between a corpus and each Document object. Using bidict proved to be *much* faster than using SpaCy's built in `Vocab / StringStore <https://spacy.io/api/stringstore>`_.
Besides "standard" token attributes that come from the SpaCy NLP pipeline, a user may also add custom token attributes. These are stored in each document's :attr:`~tmtoolkit.corpus.Document.custom_token_attrs` dictionary that map a attribute name to a NumPy array. These arrays are of arbitrary type and don't use the hashing approach. Besides token attributes, there are also *document attributes*. These are attributes attached to each document, for example the *document label* (unique document identifier). Custom document attributes can be added, e.g. to record the publication year of a document. Document attributes can also be of any type and are not hashed.
The :class:`~tmtoolkit.corpus.Corpus` class implements a data structure for text corpora with named documents. All these documents are stored in the corpus as :class:`~tmtoolkit.corpus.Document` objects. *Corpus functions* allow to operate on Corpus objects. They are implemented in ``corpus/_corpusfuncs.py``. All corpus functions that transform/modify a corpus, have an ``inplace`` argument, by default set to ``True``. If ``inplace`` is set to ``True``, the corpus will be directly modified in-place, i.e. modifying the input corpus. If ``inplace`` is set to ``False``, a copy of the input corpus is created and all modifications are applied to this copy. The original input corpus is not altered in that case. The ``corpus_func_inplace_opt`` decorator is used to mark corpus functions with the in-place option.
The :class:`~tmtoolkit.corpus.Corpus` class provides parallel processing capabilities for processing large data amounts. This can be controlled with the ``max_workers`` argument. Parallel processing is then enabled at two stages: First, it is simply enabled for the SpaCy NLP pipeline by setting up the pipeline accordingly. Second, a *reusable process pool executor* is created by the means of `loky <https://github.com/joblib/loky/>`_. This process pool is then used in corpus functions whenever parallel execution is beneficial over serial execution. The ``parallelexec`` decorator is used to mark (inner) functions for parallel execution.
``topicmod`` module
^^^^^^^^^^^^^^^^^^^
This is the central module for computing, evaluating and analyzing topic models.
In ``topicmod/evaluate.py`` there are mainly several evaluation metrics for topic models implemented. Topic models can be computed and evaluated in parallel, the base code for that is in ``topicmod/parallel.py``. Three modules use the base classes from ``topicmod/parallel.py`` to implement interfaces to popular topic modeling packages:
- ``topicmod/tm_gensim.py`` for `gensim <https://radimrehurek.com/gensim/>`_
- ``topicmod/tm_lda.py`` for `lda <http://pythonhosted.org/lda/>`_
- ``topicmod/tm_sklearn.py`` for `scikit-learn <http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html>`_
================================================
FILE: doc/source/getting_started.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started\n",
"\n",
"This is only quick overview for getting started. Corpus loading, text preprocessing, etc. are explained in depth in the respective chapters.\n",
"\n",
"## Loading a built-in text corpus\n",
"\n",
"Once you have installed tmtoolkit, you can start by loading a built-in dataset. Note that you must have installed tmtoolkit with the ``[recommended]`` or ``[textproc]`` option for this to work. See the [installation instructions](install.rst) for details.\n",
"\n",
"Let's import the [builtin_corpora_info](api.rst#tmtoolkit.corpus.builtin_corpora_info) function first and have a look which datasets are available:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:11.428539Z",
"iopub.status.busy": "2022-03-11T08:31:11.427097Z",
"iopub.status.idle": "2022-03-11T08:31:13.868641Z",
"shell.execute_reply": "2022-03-11T08:31:13.868205Z"
},
"pycharm": {
"is_executing": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"['de-parlspeech-v2-sample-bundestag',\n",
" 'en-News100',\n",
" 'en-NewsArticles',\n",
" 'en-parlspeech-v2-sample-houseofcommons',\n",
" 'es-parlspeech-v2-sample-congreso',\n",
" 'nl-parlspeech-v2-sample-tweedekamer']"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import builtin_corpora_info\n",
"\n",
"builtin_corpora_info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load one of these corpora, a sample of 100 articles from the [News Articles dataset from Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GMFCTR). For this, we import the [Corpus](api.rst#tmtoolkit.corpus.Corpus) class and use [Corpus.from_builtin_corpus](api.rst#tmtoolkit.corpus.Corpus.from_builtin_corpus). The raw text data will then be processed by an [NLP pipeline](https://spacy.io/usage/spacy-101#pipelines) with [SpaCy](https://spacy.io). That is, it will be tokenized and analyzed for the grammatical structure of each sentence and the linguistic attributes of each token, among other things. Since this step is computationally intense, it takes quite some time for large text corpora (it can be sped up by enabling parallel processing as explained later)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:13.873602Z",
"iopub.status.busy": "2022-03-11T08:31:13.873034Z",
"iopub.status.idle": "2022-03-11T08:31:27.438738Z",
"shell.execute_reply": "2022-03-11T08:31:27.438313Z"
},
"pycharm": {
"is_executing": false
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<Corpus [100 documents / language \"en\"]>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import Corpus\n",
"\n",
"corp = Corpus.from_builtin_corpus('en-News100')\n",
"corp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can have a look which documents were loaded (showing only the first ten document labels):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.444345Z",
"iopub.status.busy": "2022-03-11T08:31:27.443859Z",
"iopub.status.idle": "2022-03-11T08:31:27.447634Z",
"shell.execute_reply": "2022-03-11T08:31:27.448004Z"
},
"pycharm": {
"is_executing": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"['News100-2338',\n",
" 'News100-3228',\n",
" 'News100-1253',\n",
" 'News100-1615',\n",
" 'News100-3334',\n",
" 'News100-92',\n",
" 'News100-869',\n",
" 'News100-3092',\n",
" 'News100-3088',\n",
" 'News100-1173']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corp.doc_labels[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Accessing documents and document tokens\n",
"\n",
"We can now access each document in this corpus via its document label:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.454699Z",
"iopub.status.busy": "2022-03-11T08:31:27.453877Z",
"iopub.status.idle": "2022-03-11T08:31:27.456920Z",
"shell.execute_reply": "2022-03-11T08:31:27.457598Z"
},
"pycharm": {
"is_executing": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"Document \"News100-2338\" (680 tokens, 9 token attributes, 2 document attributes)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corp['News100-2338']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By accessing the corpus in this way, we get a [Document](api.rst#tmtoolkit.corpus.Document) object. We can query a document for its contents again using the square brackets syntax. Here, we access its tokens and show only the first ten:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.464307Z",
"iopub.status.busy": "2022-03-11T08:31:27.463502Z",
"iopub.status.idle": "2022-03-11T08:31:27.466655Z",
"shell.execute_reply": "2022-03-11T08:31:27.467342Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[\"'\",\n",
" 'This',\n",
" 'Is',\n",
" 'Us',\n",
" \"'\",\n",
" 'Makes',\n",
" 'Surprising',\n",
" 'Reveal',\n",
" 'About',\n",
" 'Jack']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corp['News100-2338']['token'][:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of the time, you won't need to access the `Document` objects of a corpus directly. You can rather use functions that provide a convenient interface to a corpus' contents, e.g. the [doc_tokens](api.rst#tmtoolkit.corpus.doc_tokens) function which allows to retrieve all documents' tokens along with additional token attributes like Part-of-Speech (POS) tags, token lemma, etc.\n",
"\n",
"Let's first import `doc_tokens` and then list the first ten tokens of the documents \"News100-2338\" and \"News100-3228\":"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.514038Z",
"iopub.status.busy": "2022-03-11T08:31:27.490838Z",
"iopub.status.idle": "2022-03-11T08:31:27.516094Z",
"shell.execute_reply": "2022-03-11T08:31:27.516757Z"
}
},
"outputs": [],
"source": [
"from tmtoolkit.corpus import doc_tokens\n",
"\n",
"tokens = doc_tokens(corp)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.522649Z",
"iopub.status.busy": "2022-03-11T08:31:27.521855Z",
"iopub.status.idle": "2022-03-11T08:31:27.524583Z",
"shell.execute_reply": "2022-03-11T08:31:27.524981Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[\"'\",\n",
" 'This',\n",
" 'Is',\n",
" 'Us',\n",
" \"'\",\n",
" 'Makes',\n",
" 'Surprising',\n",
" 'Reveal',\n",
" 'About',\n",
" 'Jack']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokens['News100-2338'][:10]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.530930Z",
"iopub.status.busy": "2022-03-11T08:31:27.530117Z",
"iopub.status.idle": "2022-03-11T08:31:27.533513Z",
"shell.execute_reply": "2022-03-11T08:31:27.534198Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['Neil',\n",
" 'Gorsuch',\n",
" 'facing',\n",
" \"'\",\n",
" 'rigorous',\n",
" \"'\",\n",
" 'confirmation',\n",
" 'hearing',\n",
" 'this',\n",
" 'week']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokens['News100-3228'][:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can retrieve more information than just the tokens. Let's also get the POS tags via `with_attr='pos'` and enable structuring the results according to the sentences in the document via `sentences=True`:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.557612Z",
"iopub.status.busy": "2022-03-11T08:31:27.547961Z",
"iopub.status.idle": "2022-03-11T08:31:27.646644Z",
"shell.execute_reply": "2022-03-11T08:31:27.647236Z"
}
},
"outputs": [],
"source": [
"tokens = doc_tokens(corp, sentences=True, with_attr='pos')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each document, we now have a dictionary with two entries, \"token\" and \"pos\":"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.653137Z",
"iopub.status.busy": "2022-03-11T08:31:27.652327Z",
"iopub.status.idle": "2022-03-11T08:31:27.655359Z",
"shell.execute_reply": "2022-03-11T08:31:27.656021Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['token', 'pos'])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokens['News100-2338'].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Within these dictionary entries, the tokens and the POS tags are contained inside a list of sentences. So for example to get the POS tags for each token in the fourth sentence (i.e. index 3), we can write:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.660989Z",
"iopub.status.busy": "2022-03-11T08:31:27.660349Z",
"iopub.status.idle": "2022-03-11T08:31:27.664014Z",
"shell.execute_reply": "2022-03-11T08:31:27.663575Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['DET',\n",
" 'NOUN',\n",
" 'VERB',\n",
" 'ADP',\n",
" 'ADP',\n",
" 'DET',\n",
" 'ADJ',\n",
" 'PROPN',\n",
" 'VERB',\n",
" 'ADP',\n",
" 'PROPN',\n",
" 'PART',\n",
" 'PUNCT',\n",
" 'PROPN',\n",
" 'PROPN',\n",
" 'PUNCT',\n",
" 'VERB',\n",
" 'ADP',\n",
" 'VERB',\n",
" 'ADP',\n",
" ...]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# index 3 is the fourth sentence, since indices start with 0\n",
"tokens['News100-2338']['pos'][3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could for example combine the tokens and their POS tags by using `zip`. Here we do that for the first five tokens in the fourth sentence:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.670332Z",
"iopub.status.busy": "2022-03-11T08:31:27.669849Z",
"iopub.status.idle": "2022-03-11T08:31:27.675261Z",
"shell.execute_reply": "2022-03-11T08:31:27.675913Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[('The', 'DET'),\n",
" ('episode', 'NOUN'),\n",
" ('started', 'VERB'),\n",
" ('off', 'ADP'),\n",
" ('with', 'ADP')]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(zip(tokens['News100-2338']['token'][3][:5],\n",
" tokens['News100-2338']['pos'][3][:5]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get an overview about the contents of a corpus, it's often more useful to get it in a tabular format. The tmtoolkit package provides a function to generate a [pandas DataFrame](https://pandas.pydata.org/) from a corpus, [tokens_table](api.rst#tmtoolkit.corpus.tokens_table).\n",
"\n",
"We'll use that now and instruct it to also return the sentence index of each token via `sentences=True`:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:27.702561Z",
"iopub.status.busy": "2022-03-11T08:31:27.698198Z",
"iopub.status.idle": "2022-03-11T08:31:28.165449Z",
"shell.execute_reply": "2022-03-11T08:31:28.164889Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>doc</th>\n",
" <th>sent</th>\n",
" <th>position</th>\n",
" <th>token</th>\n",
" <th>is_punct</th>\n",
" <th>is_stop</th>\n",
" <th>lemma</th>\n",
" <th>like_num</th>\n",
" <th>pos</th>\n",
" <th>tag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>News100-1026</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Kremlin</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Kremlin</td>\n",
" <td>False</td>\n",
" <td>PROPN</td>\n",
" <td>NNP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>News100-1026</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>gives</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>give</td>\n",
" <td>False</td>\n",
" <td>VERB</td>\n",
" <td>VBZ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>News100-1026</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>no</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>no</td>\n",
" <td>False</td>\n",
" <td>DET</td>\n",
" <td>DT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>News100-1026</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>comment</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>comment</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>News100-1026</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>on</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>on</td>\n",
" <td>False</td>\n",
" <td>ADP</td>\n",
" <td>IN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" doc sent position token is_punct is_stop lemma \\\n",
"0 News100-1026 0 0 Kremlin False False Kremlin \n",
"1 News100-1026 0 1 gives False False give \n",
"2 News100-1026 0 2 no False True no \n",
"3 News100-1026 0 3 comment False False comment \n",
"4 News100-1026 0 4 on False True on \n",
"\n",
" like_num pos tag \n",
"0 False PROPN NNP \n",
"1 False VERB VBZ \n",
"2 False DET DT \n",
"3 False NOUN NN \n",
"4 False ADP IN "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import tokens_table\n",
"\n",
"toktbl = tokens_table(corp, sentences=True)\n",
"toktbl.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using subsetting, we can for example select the fourth sentence in the \"News100-2338\" document:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:28.180703Z",
"iopub.status.busy": "2022-03-11T08:31:28.180121Z",
"iopub.status.idle": "2022-03-11T08:31:28.195724Z",
"shell.execute_reply": "2022-03-11T08:31:28.196322Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>doc</th>\n",
" <th>sent</th>\n",
" <th>position</th>\n",
" <th>token</th>\n",
" <th>is_punct</th>\n",
" <th>is_stop</th>\n",
" <th>lemma</th>\n",
" <th>like_num</th>\n",
" <th>pos</th>\n",
" <th>tag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>28191</th>\n",
" <td>News100-2338</td>\n",
" <td>3</td>\n",
" <td>101</td>\n",
" <td>The</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>the</td>\n",
" <td>False</td>\n",
" <td>DET</td>\n",
" <td>DT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28192</th>\n",
" <td>News100-2338</td>\n",
" <td>3</td>\n",
" <td>102</td>\n",
" <td>episode</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>episode</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28193</th>\n",
" <td>News100-2338</td>\n",
" <td>3</td>\n",
" <td>103</td>\n",
" <td>started</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>start</td>\n",
" <td>False</td>\n",
" <td>VERB</td>\n",
" <td>VBD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28194</th>\n",
" <td>News100-2338</td>\n",
" <td>3</td>\n",
" <td>104</td>\n",
" <td>off</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>off</td>\n",
" <td>False</td>\n",
" <td>ADP</td>\n",
" <td>RP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28195</th>\n",
" <td>News100-2338</td>\n",
" <td>3</td>\n",
" <td>105</td>\n",
" <td>with</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>with</td>\n",
" <td>False</td>\n",
" <td>ADP</td>\n",
" <td>IN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" doc sent position token is_punct is_stop lemma \\\n",
"28191 News100-2338 3 101 The False True the \n",
"28192 News100-2338 3 102 episode False False episode \n",
"28193 News100-2338 3 103 started False False start \n",
"28194 News100-2338 3 104 off False True off \n",
"28195 News100-2338 3 105 with False True with \n",
"\n",
" like_num pos tag \n",
"28191 False DET DT \n",
"28192 False NOUN NN \n",
"28193 False VERB VBD \n",
"28194 False ADP RP \n",
"28195 False ADP IN "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"toktbl[(toktbl.doc == 'News100-2338') & (toktbl.sent == 3)].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"We can do much more with text corpora in terms of accessing and transforming their contents. This is shown in great detail in the [chapter on text preprocessing](preprocessing.ipynb).\n",
"\n",
"Next, we proceed with [working with text corpora](text_corpora.ipynb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: doc/source/index.rst
================================================
.. tmtoolkit documentation master file, created by
sphinx-quickstart on Tue Aug 27 11:30:06 2019.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
.. include:: intro.rst
.. include:: license_note.rst
.. toctree::
:maxdepth: 4
:caption: Contents:
install
getting_started
text_corpora
preprocessing
bow
topic_modeling
api
development
version_history
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: doc/source/install.rst
================================================
.. _install:
Installation
============
Requirements
------------
**tmtoolkit works with Python 3.8 or newer (tested up to Python 3.10).**
Requirements are automatically installed via *pip* as described below. Additional packages can also be installed
via *pip* for certain use cases (see :ref:`optional_packages`).
Installation instructions
-------------------------
The package *tmtoolkit* is available on `PyPI <https://pypi.org/project/tmtoolkit/>`_ and can be installed via
Python package manager *pip*. It is highly recommended to install tmtoolkit and its dependencies in a separate
`Python Virtual Environment ("venv") <https://docs.python.org/3/tutorial/venv.html>`_ and upgrade to the latest
*pip* version (you may also choose to install
`virtualenvwrapper <https://virtualenvwrapper.readthedocs.io/en/latest/>`_, which makes managing venvs a lot
easier).
Creating and activating a venv *without* virtualenvwrapper:
.. code-block:: text
python3 -m venv myenv
# activating the environment (on Windows type "myenv\Scripts\activate.bat")
source myenv/bin/activate
Alternatively, creating and activating a venv *with* virtualenvwrapper:
.. code-block:: text
mkvirtualenv myenv
# activating the environment
workon myenv
Upgrading pip (*only* do this when you've activated your venv):
.. code-block:: text
pip install -U pip
The tmtoolkit package is highly modular and tries to install as few software dependencies as possible. So in order to
install tmtoolkit, you can first choose if you want a minimal installation or install a recommended set of
packages that enable most features. For the recommended installation, you can type **one of the following**, depending
on the preferred package for topic modeling:
.. code-block:: text
# recommended installation without topic modeling
pip install -U "tmtoolkit[recommended]"
# recommended installation with "lda" for topic modeling
pip install -U "tmtoolkit[recommended,lda]"
# recommended installation with "scikit-learn" for topic modeling
pip install -U "tmtoolkit[recommended,sklearn]"
# recommended installation with "gensim" for topic modeling
pip install -U "tmtoolkit[recommended,gensim]"
# you may also select several topic modeling packages
pip install -U "tmtoolkit[recommended,lda,sklearn,gensim]"
The **minimal** installation will only install a base set of dependencies and will only enable the modules for BoW
statistics, token sequence operations, topic modeling and utility functions. You can install it as follows:
.. code-block:: text
# alternative installation if you only want to install a minimum set of dependencies
pip install -U tmtoolkit
.. note:: The tmtoolkit package is about 7MB big, because it contains some example corpora.
.. _setup:
**After that, you should initially run tmtoolkit's setup routine.** This makes sure that all required data files are
present and downloads them if necessary. You should specify a list of languages for which language models should be
downloaded and installed. The list of available language models corresponds with the models provided by
`SpaCy <https://spacy.io/usage/models#languages>`_ (except for "multi-language"). You need to specify the two-letter ISO
language code for the language models that you want to install. **Don't use spaces in the list of languages.**
E.g. in order to install models for English and German:
.. code-block:: text
python -m tmtoolkit setup en,de
To install *all* available language models, you can run:
.. code-block:: text
python -m tmtoolkit setup all
.. _optional_packages:
Optional packages
-----------------
For additional features, you can install further packages using the following installation options:
- ``pip install -U tmtoolkit[textproc_extra]`` for Unicode normalization and simplification and for stemming with *nltk*
- ``pip install -U tmtoolkit[wordclouds]`` for generating word clouds
- ``pip install -U tmtoolkit[lda]`` for topic modeling with LDA
- ``pip install -U tmtoolkit[sklearn]`` for topic modeling with scikit-learn
- ``pip install -U tmtoolkit[gensim]`` for topic modeling and additional evaluation metrics with Gensim
- ``pip install -U tmtoolkit[topic_modeling_eval_extra]`` for topic modeling evaluation metrics ``griffiths_2004`` and
``held_out_documents_wallach09`` (see further information below)
For LDA evaluation metrics ``griffiths_2004`` and ``held_out_documents_wallach09`` it is necessary to install
`gmpy2 <https://github.com/aleaxit/gmpy>`_ for multiple-precision arithmetic. This in turn requires installing some C
header libraries for GMP, MPFR and MPC. On Debian/Ubuntu systems this is done with:
.. code-block:: text
sudo apt install libgmp-dev libmpfr-dev libmpc-dev
================================================
FILE: doc/source/intro.rst
================================================
tmtoolkit: Text mining and topic modeling toolkit
=================================================
|pypi| |pypi_downloads| |rtd| |runtests| |coverage| |zenodo|
*tmtoolkit* is a set of tools for text mining and topic modeling with Python developed especially for the use in the
social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation
and a clear programming interface while offering good performance on large datasets by the means of vectorized
operations (via NumPy) and parallel computation (using Python's *multiprocessing* module and the
`loky <https://loky.readthedocs.io/>`_ package). The basis of tmtoolkit's text mining capabilities are built around
`SpaCy <https://spacy.io/>`_, which offers a `many language models <https://spacy.io/models>`_. Currently,
the following languages are supported for text mining:
- Catalan
- Chinese
- Danish
- Dutch
- English
- French
- German
- Greek
- Italian
- Japanese
- Lithuanian
- Macedonian
- Norwegian Bokmål
- Polish
- Portuguese
- Romanian
- Russian
- Spanish
The documentation for tmtoolkit is available on `tmtoolkit.readthedocs.org <https://tmtoolkit.readthedocs.org>`_ and
the GitHub code repository is on
`github.com/WZBSocialScienceCenter/tmtoolkit <https://github.com/WZBSocialScienceCenter/tmtoolkit>`_.
Features
--------
Text preprocessing and text mining
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The tmtoolkit package offers several text preprocessing and text mining methods, including:
- `tokenization, sentence segmentation, part-of-speech (POS) tagging, named-entity recognition (NER) <text_corpora.ipynb#Configuring-the-NLP-pipeline,-parallel-processing-and-more-via-Corpus-parameters>`_ (via SpaCy)
- `lemmatization and token normalization <preprocessing.ipynb#Lemmatization-and-token-normalization>`_
- extensive `pattern matching capabilities <preprocessing.ipynb#Common-parameters-for-pattern-matching-functions>`_
(exact matching, regular expressions or "glob" patterns) to be used in many
methods of the package, e.g. for filtering on token or document level, or for
`keywords-in-context (KWIC) <preprocessing.ipynb#Keywords-in-context-(KWIC)-and-general-filtering-methods>`_
- adding and managing
`custom document and token attributes <preprocessing.ipynb#Working-with-document-and-token-attributes>`_
- accessing text corpora along with their
`document and token attributes as dataframes <preprocessing.ipynb#Accessing-tokens-and-token-attributes>`_
- calculating and `visualizing corpus summary statistics <preprocessing.ipynb#Visualizing-corpus-summary-statistics>`_
- finding out and joining `collocations <preprocessing.ipynb#Identifying-and-joining-token-collocations>`_
- `splitting and sampling corpora <text_corpora.ipynb#Corpus-functions-for-document-management>`_
- generating `n-grams <preprocessing.ipynb#Generating-n-grams>`_
- generating `sparse document-term matrices <preprocessing.ipynb#Generating-a-sparse-document-term-matrix-(DTM)>`_
Wherever possible and useful, these methods can operate in parallel to speed up computations with large datasets.
Topic modeling
^^^^^^^^^^^^^^
- `model computation in parallel <topic_modeling.ipynb#Computing-topic-models-in-parallel>`_ for different copora
and/or parameter sets
- support for `lda <http://pythonhosted.org/lda/>`_,
`scikit-learn <http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html>`_
and `gensim <https://radimrehurek.com/gensim/>`_ topic modeling backends
- `evaluation of topic models <topic_modeling.ipynb#Evaluation-of-topic-models>`_ (e.g. in order to an optimal number
of topics for a given dataset) using several implemented metrics:
- model coherence (`Mimno et al. 2011 <https://dl.acm.org/citation.cfm?id=2145462>`_) or with
`metrics implemented in Gensim <https://radimrehurek.com/gensim/models/coherencemodel.html>`_)
- KL divergence method (`Arun et al. 2010 <http://doi.org/10.1007/978-3-642-13657-3_43>`_)
- probability of held-out documents (`Wallach et al. 2009 <https://doi.org/10.1145/1553374.1553515>`_)
- pair-wise cosine distance method (`Cao Juan et al. 2009 <http://doi.org/10.1016/j.neucom.2008.06.011>`_)
- harmonic mean method (`Griffiths, Steyvers 2004 <http://doi.org/10.1073/pnas.0307752101>`_)
- the loglikelihood or perplexity methods natively implemented in lda, sklearn or gensim
- `plotting of evaluation results <topic_modeling.ipynb#Evaluation-of-topic-models>`_
- `common statistics for topic models <topic_modeling.ipynb#Common-statistics-and-tools-for-topic-models>`_ such as
word saliency and distinctiveness (`Chuang et al. 2012 <https://dl.acm.org/citation.cfm?id=2254572>`_), topic-word
relevance (`Sievert and Shirley 2014 <https://www.aclweb.org/anthology/W14-3110>`_)
- `finding / filtering topics with pattern matching <topic_modeling.ipynb#Filtering-topics>`_
- `export estimated document-topic and topic-word distributions to Excel
<topic_modeling.ipynb#Displaying-and-exporting-topic-modeling-results>`_
- `visualize topic-word distributions and document-topic distributions <topic_modeling.ipynb#Visualizing-topic-models>`_
as word clouds or heatmaps
- model coherence (`Mimno et al. 2011 <https://dl.acm.org/citation.cfm?id=2145462>`_) for individual topics
- integrate `PyLDAVis <https://pyldavis.readthedocs.io/en/latest/>`_ to visualize results
Other features
^^^^^^^^^^^^^^
- loading and cleaning of raw text from
`text files, tabular files (CSV or Excel), ZIP files or folders <text_corpora.ipynb#Loading-text-data>`_
- `splitting and joining documents <text_corpora.ipynb#Corpus-functions-for-document-management>`_
- `common statistics and transformations for document-term matrices <bow.ipynb>`_ like word cooccurrence and *tf-idf*
Limits
------
- only languages are supported, for which `SpaCy language models <https://spacy.io/models>`_ are available
- all data must reside in memory, i.e. no streaming of large data from the hard disk (which for example
`Gensim <https://radimrehurek.com/gensim/>`_ supports)
Built-in datasets
-----------------
Currently tmtoolkit comes with the following built-in datasets which can be loaded via
:meth:`~tmtoolkit.corpus.Corpus.from_builtin_corpus`:
- *"en-NewsArticles"*: `News Articles <https://doi.org/10.7910/DVN/GMFCTR>`_
*(Dai, Tianru, 2017, "News Articles", https://doi.org/10.7910/DVN/GMFCTR, Harvard Dataverse, V1)*
- random samples from `ParlSpeech V2 <https://doi.org/10.7910/DVN/L4OAKN>`_
*(Rauh, Christian; Schwalbach, Jan, 2020, "The ParlSpeech V2 data set: Full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of nine representative democracies", https://doi.org/10.7910/DVN/L4OAKN, Harvard Dataverse)* for different languages:
- *"de-parlspeech-v2-sample-bundestag"*
- *"en-parlspeech-v2-sample-houseofcommons"*
- *"es-parlspeech-v2-sample-congreso"*
- *"nl-parlspeech-v2-sample-tweedekamer"*
About this documentation
------------------------
This documentation guides you in several chapters from installing tmtoolkit to its specific use cases and shows some
examples with built-in corpora and other datasets. All "hands on" chapters from
`Getting started <getting_started.ipynb>`_ to `Topic modeling <topic_modeling.ipynb>`_ are generated from
`Jupyter Notebooks <https://jupyter.org/>`_. If you want to follow along using these notebooks, you can download them
from the `GitHub repository <https://github.com/WZBSocialScienceCenter/tmtoolkit/tree/master/doc/source>`_.
There are also a few other examples as plain Python scripts available in the
`examples folder <https://github.com/WZBSocialScienceCenter/tmtoolkit/tree/master/examples>`_ of the GitHub repository.
.. |pypi| image:: https://badge.fury.io/py/tmtoolkit.svg
:target: https://badge.fury.io/py/tmtoolkit
:alt: PyPI Version
.. |pypi_downloads| image:: https://img.shields.io/pypi/dm/tmtoolkit
:target: https://pypi.org/project/tmtoolkit/
:alt: Downloads from PyPI
.. |runtests| image:: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml/badge.svg
:target: https://github.com/WZBSocialScienceCenter/tmtoolkit/actions/workflows/runtests.yml
:alt: GitHub Actions CI Build Status
.. |coverage| image:: https://raw.githubusercontent.com/WZBSocialScienceCenter/tmtoolkit/master/coverage.svg?sanitize=true
:target: https://github.com/WZBSocialScienceCenter/tmtoolkit/tree/master/tests
:alt: Coverage status
.. |rtd| image:: https://readthedocs.org/projects/tmtoolkit/badge/?version=latest
:target: https://tmtoolkit.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. |zenodo| image:: https://zenodo.org/badge/109812180.svg
:target: https://zenodo.org/badge/latestdoi/109812180
:alt: Citable Zenodo DOI
================================================
FILE: doc/source/license_note.rst
================================================
License
=======
Code licensed under `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
See `LICENSE <https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/LICENSE>`_ file.
================================================
FILE: doc/source/preprocessing.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text preprocessing and basic text mining\n",
"\n",
"During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw document texts in a way that makes it easier to perform eventual text mining and analysis methods in a later stage, e.g. by reducing noise in the dataset. The package tmtoolkit provides a rich set of tools for this purpose implemented as *corpus functions* in the [tmtoolkit.corpus](api.rst#tmtoolkit-corpus) module.\n",
"\n",
"<div class=\"alert alert-info\">\n",
"\n",
"**Reminder: Corpus functions**\n",
"\n",
"All *corpus functions* accept a [Corpus](api.rst#tmtoolkit.corpus.Corpus) object as first argument and operate on it. A corpus function may retrieve information from a corpus and/or modify the corpus object.\n",
"\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optional: enabling logging output\n",
"\n",
"By default, tmtoolkit does not expose any internal logging messages. Sometimes, for example for diagnostic output during debugging or in order to see progress for long running operations, it's helpful to enable logging output display. For that, you can use the [enable_logging](api.rst#tmtoolkit.utils.enable_logging) function. By default, it enables logging to console for the `INFO` level."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:30.666116Z",
"iopub.status.busy": "2022-03-11T08:31:30.665273Z",
"iopub.status.idle": "2022-03-11T08:31:33.130734Z",
"shell.execute_reply": "2022-03-11T08:31:33.130266Z"
}
},
"outputs": [],
"source": [
"from tmtoolkit.utils import enable_logging\n",
"\n",
"enable_logging()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading example data\n",
"\n",
"Let's load a sample of ten documents from the built-in *NewsArticles* dataset. We'll use only a small number of documents here to have a better overview at the beginning. We can later use a larger sample. To apply sampling right at the beginning when loading the data, we pass the `sample=100` parameter to the [from_builtin_corpus](api.rst#tmtoolkit.corpus.Corpus.from_builtin_corpus) class method. We also use [print_summary](api.rst#tmtoolkit.corpus.print_summary) like shown in the previous chapter."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:33.136940Z",
"iopub.status.busy": "2022-03-11T08:31:33.136002Z",
"iopub.status.idle": "2022-03-11T08:31:44.590577Z",
"shell.execute_reply": "2022-03-11T08:31:44.591189Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-03-11 09:31:33,939:INFO:tmtoolkit:creating Corpus instance with no documents\n",
"2022-03-11 09:31:33,940:INFO:tmtoolkit:using serial processing\n",
"2022-03-11 09:31:34,666:INFO:tmtoolkit:sampling 100 documents(s) out of 3824\n",
"2022-03-11 09:31:34,668:INFO:tmtoolkit:adding text from 100 documents(s)\n",
"2022-03-11 09:31:34,669:INFO:tmtoolkit:running NLP pipeline on 100 documents\n",
"2022-03-11 09:31:44,583:INFO:tmtoolkit:generating document texts\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Corpus with 100 documents in English\n",
"> NewsArticles-1185 (1271 tokens): For more than a week,-France - has been rocked by ...\n",
"> NewsArticles-1100 (224 tokens): President Trump says he has asked the Justice Depa...\n",
"> NewsArticles-1515 (426 tokens): Trump suggests Obama was ' behind ' town hall prot...\n",
"> NewsArticles-1353 (30 tokens): Islamic State battle : Fierce gunfight outside Mos...\n",
"> NewsArticles-1472 (298 tokens): Royal Bank of Scotland sees losses widening Bai...\n",
"> NewsArticles-1407 (202 tokens): Minister reiterates Govt support for Finucane inqu...\n",
"> NewsArticles-1377 (774 tokens): Turkey - backed rebels in ' near full control ' of...\n",
"> NewsArticles-1263 (410 tokens): Russian doctors use mobile field hospital to provi...\n",
"> NewsArticles-1387 (513 tokens): Protests after Anaheim policeman drags teen , fire...\n",
"> NewsArticles-1119 (975 tokens): An amazing moment in history : Donald Trump 's pre...\n",
"(and 90 more documents)\n",
"total number of tokens: 59598 / vocabulary size: 9223\n"
]
}
],
"source": [
"import random\n",
"random.seed(20220119) # to make the sampling reproducible\n",
"\n",
"from tmtoolkit.corpus import Corpus, print_summary\n",
"\n",
"corpus_small = Corpus.from_builtin_corpus('en-NewsArticles', sample=100)\n",
"print_summary(corpus_small)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The logging information was printed on red, the information below on white came from `print_summary`. We will disable logging again using [disable_logging](api.rst#tmtoolkit.utils.disable_logging):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:44.595588Z",
"iopub.status.busy": "2022-03-11T08:31:44.594957Z",
"iopub.status.idle": "2022-03-11T08:31:44.597243Z",
"shell.execute_reply": "2022-03-11T08:31:44.597658Z"
}
},
"outputs": [],
"source": [
"from tmtoolkit.utils import disable_logging\n",
"\n",
"disable_logging()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These are the names of the documents that were loaded:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:44.606796Z",
"iopub.status.busy": "2022-03-11T08:31:44.605979Z",
"iopub.status.idle": "2022-03-11T08:31:44.609384Z",
"shell.execute_reply": "2022-03-11T08:31:44.610049Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['NewsArticles-1100',\n",
" 'NewsArticles-1119',\n",
" 'NewsArticles-1185',\n",
" 'NewsArticles-1263',\n",
" 'NewsArticles-1353',\n",
" 'NewsArticles-1377',\n",
" 'NewsArticles-1387',\n",
" 'NewsArticles-1407',\n",
" 'NewsArticles-1472',\n",
" 'NewsArticles-1515',\n",
" 'NewsArticles-1519',\n",
" 'NewsArticles-1546',\n",
" 'NewsArticles-1561',\n",
" 'NewsArticles-1587',\n",
" 'NewsArticles-1589',\n",
" 'NewsArticles-1610',\n",
" 'NewsArticles-162',\n",
" 'NewsArticles-169',\n",
" 'NewsArticles-1777',\n",
" 'NewsArticles-1787',\n",
" ...]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import doc_labels\n",
"\n",
"doc_labels(corpus_small)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Accessing tokens and token attributes\n",
"\n",
"We start with accessing the documents' tokens and their *token attributes* using [doc_tokens](api.rst#tmtoolkit.corpus.doc_tokens) and [tokens_table](api.rst#tmtoolkit.corpus.tokens_table). Token attributes are meta information attached to each token. These can be linguistic features, such as the Part of Speech (POS) tag, indicators for stopwords or punctuation, etc. The default attributes are a subset of [SpaCy's token attributes](https://spacy.io/api/token#attributes). You can configure which of these attributes are stored using the `spacy_token_attrs` parameter of the `Corpus` constructor. You can also add your own token attributes. This will be shown later on.\n",
"\n",
"At first we load the tokens along with their attributes via `doc_tokens`, which gives us a dictionary mapping document labels to document data. Each document data is another dictionary that contains the tokens and their attributes. We start by checking which token attributes are loaded by default in any document (here, we use \"NewsArticles-2433\"):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:44.636673Z",
"iopub.status.busy": "2022-03-11T08:31:44.614540Z",
"iopub.status.idle": "2022-03-11T08:31:44.758067Z",
"shell.execute_reply": "2022-03-11T08:31:44.757172Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['token', 'is_punct', 'is_stop', 'like_num', 'tag', 'pos', 'lemma'])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tmtoolkit.corpus import doc_tokens, tokens_table\n",
"\n",
"# with_attr=True adds default set of token attributes\n",
"tok = doc_tokens(corpus_small, with_attr=True)\n",
"tok['NewsArticles-2433'].keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So each document's data can be accessed like in the example above and it will contain the seven data entries listed above. The `'token'` entry gives the actual tokens of the document. Let's show the first five tokens for a document:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:44.768251Z",
"iopub.status.busy": "2022-03-11T08:31:44.766129Z",
"iopub.status.idle": "2022-03-11T08:31:44.773772Z",
"shell.execute_reply": "2022-03-11T08:31:44.773042Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['DOJ', ':', '2', 'Russian', 'spies']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tok['NewsArticles-2433']['token'][:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The other entries are the attributes corresponding to each token. Here, we display the first five lemmata for the same document and the first five punctuation indicator values. The colon is correctly identified as punctuation character."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:44.780683Z",
"iopub.status.busy": "2022-03-11T08:31:44.779864Z",
"iopub.status.idle": "2022-03-11T08:31:44.783580Z",
"shell.execute_reply": "2022-03-11T08:31:44.784216Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['doj', ':', '2', 'russian', 'spy']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tok['NewsArticles-2433']['lemma'][:5]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:44.792224Z",
"iopub.status.busy": "2022-03-11T08:31:44.791187Z",
"iopub.status.idle": "2022-03-11T08:31:44.794865Z",
"shell.execute_reply": "2022-03-11T08:31:44.795563Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[False, True, False, False, False]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tok['NewsArticles-2433']['is_punct'][:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If your NLP pipeline performs sentence recognition, you can pass the parameter `sentences=True` which will add another level to the output representing sentences. This means that for each item like `'token'`, `'lemma'`, etc. we will get a list of sentences. For example, the following will print the tokens of the 8th sentence (index 7):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:44.806125Z",
"iopub.status.busy": "2022-03-11T08:31:44.805251Z",
"iopub.status.idle": "2022-03-11T08:31:44.988802Z",
"shell.execute_reply": "2022-03-11T08:31:44.989413Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['A',\n",
" 'Justice',\n",
" 'Department',\n",
" 'official',\n",
" 'said',\n",
" 'the',\n",
" 'agency',\n",
" 'has',\n",
" 'not',\n",
" 'confirmed',\n",
" 'it',\n",
" 'is',\n",
" 'the',\n",
" 'same',\n",
" 'person',\n",
" 'and',\n",
" 'declined',\n",
" 'further',\n",
" 'comment',\n",
" 'to',\n",
" ...]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tok_sents = doc_tokens(corpus_small, sentences=True, with_attr=True)\n",
"tok_sents['NewsArticles-2433']['token'][7] # index 7 means 8th sentence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a more compact overview, it's better to use the [tokens_table](api.rst#tmtoolkit.corpus.tokens_table) function. This will generate a [pandas DataFrame](https://pandas.pydata.org/) from the documents in the corpus and it will by default include all token attributes, along with a column for the document label (`doc`) and the token position inside the document (`position`)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:45.005773Z",
"iopub.status.busy": "2022-03-11T08:31:45.001847Z",
"iopub.status.idle": "2022-03-11T08:31:45.324859Z",
"shell.execute_reply": "2022-03-11T08:31:45.325499Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>doc</th>\n",
" <th>position</th>\n",
" <th>token</th>\n",
" <th>is_punct</th>\n",
" <th>is_stop</th>\n",
" <th>lemma</th>\n",
" <th>like_num</th>\n",
" <th>pos</th>\n",
" <th>tag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NewsArticles-1100</td>\n",
" <td>0</td>\n",
" <td>President</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>President</td>\n",
" <td>False</td>\n",
" <td>PROPN</td>\n",
" <td>NNP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NewsArticles-1100</td>\n",
" <td>1</td>\n",
" <td>Trump</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>Trump</td>\n",
" <td>False</td>\n",
" <td>PROPN</td>\n",
" <td>NNP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NewsArticles-1100</td>\n",
" <td>2</td>\n",
" <td>says</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>say</td>\n",
" <td>False</td>\n",
" <td>VERB</td>\n",
" <td>VBZ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NewsArticles-1100</td>\n",
" <td>3</td>\n",
" <td>he</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>he</td>\n",
" <td>False</td>\n",
" <td>PRON</td>\n",
" <td>PRP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NewsArticles-1100</td>\n",
" <td>4</td>\n",
" <td>has</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>have</td>\n",
" <td>False</td>\n",
" <td>AUX</td>\n",
" <td>VBZ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59593</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>282</td>\n",
" <td>priorities</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>priority</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NNS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59594</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>283</td>\n",
" <td>for</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>for</td>\n",
" <td>False</td>\n",
" <td>ADP</td>\n",
" <td>IN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59595</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>284</td>\n",
" <td>the</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>the</td>\n",
" <td>False</td>\n",
" <td>DET</td>\n",
" <td>DT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59596</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>285</td>\n",
" <td>nation</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>nation</td>\n",
" <td>False</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59597</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>286</td>\n",
" <td>.</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>.</td>\n",
" <td>False</td>\n",
" <td>PUNCT</td>\n",
" <td>.</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>59598 rows × 9 columns</p>\n",
"</div>"
],
"text/plain": [
" doc position token is_punct is_stop lemma \\\n",
"0 NewsArticles-1100 0 President False False President \n",
"1 NewsArticles-1100 1 Trump False False Trump \n",
"2 NewsArticles-1100 2 says False False say \n",
"3 NewsArticles-1100 3 he False True he \n",
"4 NewsArticles-1100 4 has False True have \n",
"... ... ... ... ... ... ... \n",
"59593 NewsArticles-960 282 priorities False False priority \n",
"59594 NewsArticles-960 283 for False True for \n",
"59595 NewsArticles-960 284 the False True the \n",
"59596 NewsArticles-960 285 nation False False nation \n",
"59597 NewsArticles-960 286 . True False . \n",
"\n",
" like_num pos tag \n",
"0 False PROPN NNP \n",
"1 False PROPN NNP \n",
"2 False VERB VBZ \n",
"3 False PRON PRP \n",
"4 False AUX VBZ \n",
"... ... ... ... \n",
"59593 False NOUN NNS \n",
"59594 False ADP IN \n",
"59595 False DET DT \n",
"59596 False NOUN NN \n",
"59597 False PUNCT . \n",
"\n",
"[59598 rows x 9 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tbl = tokens_table(corpus_small)\n",
"tbl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use all sorts of filtering operations on this dataframe. See the [pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html) for details. Here, we select all tokens that were identified as \"number-like\":"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:45.330709Z",
"iopub.status.busy": "2022-03-11T08:31:45.330112Z",
"iopub.status.idle": "2022-03-11T08:31:45.350282Z",
"shell.execute_reply": "2022-03-11T08:31:45.350715Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>doc</th>\n",
" <th>position</th>\n",
" <th>token</th>\n",
" <th>is_punct</th>\n",
" <th>is_stop</th>\n",
" <th>lemma</th>\n",
" <th>like_num</th>\n",
" <th>pos</th>\n",
" <th>tag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>288</th>\n",
" <td>NewsArticles-1119</td>\n",
" <td>64</td>\n",
" <td>fifteen</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>fifteen</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>320</th>\n",
" <td>NewsArticles-1119</td>\n",
" <td>96</td>\n",
" <td>one</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>one</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>328</th>\n",
" <td>NewsArticles-1119</td>\n",
" <td>104</td>\n",
" <td>four</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>four</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>759</th>\n",
" <td>NewsArticles-1119</td>\n",
" <td>535</td>\n",
" <td>100</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>100</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>787</th>\n",
" <td>NewsArticles-1119</td>\n",
" <td>563</td>\n",
" <td>four</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>four</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59253</th>\n",
" <td>NewsArticles-901</td>\n",
" <td>856</td>\n",
" <td>85</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>85</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59256</th>\n",
" <td>NewsArticles-901</td>\n",
" <td>859</td>\n",
" <td>9</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>9</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59374</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>63</td>\n",
" <td>2021</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2021</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59400</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>89</td>\n",
" <td>2010</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2010</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59413</th>\n",
" <td>NewsArticles-960</td>\n",
" <td>102</td>\n",
" <td>1,550</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>1,550</td>\n",
" <td>True</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1139 rows × 9 columns</p>\n",
"</div>"
],
"text/plain": [
" doc position token is_punct is_stop lemma \\\n",
"288 NewsArticles-1119 64 fifteen False True fifteen \n",
"320 NewsArticles-1119 96 one False True one \n",
"328 NewsArticles-1119 104 four False True four \n",
"759 NewsArticles-1119 535 100 False False 100 \n",
"787 NewsArticles-1119 563 four False True four \n",
"... ... ... ... ... ... ... \n",
"59253 NewsArticles-901 856 85 False False 85 \n",
"59256 NewsArticles-901 859 9 False False 9 \n",
"59374 NewsArticles-960 63 2021 False False 2021 \n",
"59400 NewsArticles-960 89 2010 False False 2010 \n",
"59413 NewsArticles-960 102 1,550 False False 1,550 \n",
"\n",
" like_num pos tag \n",
"288 True NUM CD \n",
"320 True NUM CD \n",
"328 True NUM CD \n",
"759 True NUM CD \n",
"787 True NUM CD \n",
"... ... ... .. \n",
"59253 True NUM CD \n",
"59256 True NUM CD \n",
"59374 True NUM CD \n",
"59400 True NUM CD \n",
"59413 True NUM CD \n",
"\n",
"[1139 rows x 9 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tbl[tbl.like_num]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This however only filters the table *output*. We will later see how to filter corpus documents and tokens.\n",
"\n",
"If you want to generate the table only for a subset of documents, you can use the `select` parameter and provide one or more document labels. Similar to that, you can use the `with_attr` parameter to list only a subset of the token attributes."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:45.355739Z",
"iopub.status.busy": "2022-03-11T08:31:45.355195Z",
"iopub.status.idle": "2022-03-11T08:31:45.368407Z",
"shell.execute_reply": "2022-03-11T08:31:45.369059Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>doc</th>\n",
" <th>sent</th>\n",
" <th>position</th>\n",
" <th>token</th>\n",
" <th>pos</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>DOJ</td>\n",
" <td>NOUN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>:</td>\n",
" <td>PUNCT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>NUM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Russian</td>\n",
" <td>ADJ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>spies</td>\n",
" <td>NOUN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>837</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>27</td>\n",
" <td>837</td>\n",
" <td>to</td>\n",
" <td>PART</td>\n",
" </tr>\n",
" <tr>\n",
" <th>838</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>27</td>\n",
" <td>838</td>\n",
" <td>reflect</td>\n",
" <td>VERB</td>\n",
" </tr>\n",
" <tr>\n",
" <th>839</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>27</td>\n",
" <td>839</td>\n",
" <td>new</td>\n",
" <td>ADJ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>840</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>27</td>\n",
" <td>840</td>\n",
" <td>developments</td>\n",
" <td>NOUN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>841</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>27</td>\n",
" <td>841</td>\n",
" <td>.</td>\n",
" <td>PUNCT</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>842 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" doc sent position token pos\n",
"0 NewsArticles-2433 0 0 DOJ NOUN\n",
"1 NewsArticles-2433 0 1 : PUNCT\n",
"2 NewsArticles-2433 0 2 2 NUM\n",
"3 NewsArticles-2433 0 3 Russian ADJ\n",
"4 NewsArticles-2433 0 4 spies NOUN\n",
".. ... ... ... ... ...\n",
"837 NewsArticles-2433 27 837 to PART\n",
"838 NewsArticles-2433 27 838 reflect VERB\n",
"839 NewsArticles-2433 27 839 new ADJ\n",
"840 NewsArticles-2433 27 840 developments NOUN\n",
"841 NewsArticles-2433 27 841 . PUNCT\n",
"\n",
"[842 rows x 5 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select a single document and only show the \"pos\" attribute (coarse POS tag)\n",
"tokens_table(corpus_small, select='NewsArticles-2433', sentences=True, with_attr='pos')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2022-03-11T08:31:45.379087Z",
"iopub.status.busy": "2022-03-11T08:31:45.378266Z",
"iopub.status.idle": "2022-03-11T08:31:45.397700Z",
"shell.execute_reply": "2022-03-11T08:31:45.397266Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>doc</th>\n",
" <th>position</th>\n",
" <th>token</th>\n",
" <th>pos</th>\n",
" <th>tag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>0</td>\n",
" <td>DOJ</td>\n",
" <td>NOUN</td>\n",
" <td>NN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>1</td>\n",
" <td>:</td>\n",
" <td>PUNCT</td>\n",
" <td>:</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>NUM</td>\n",
" <td>CD</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>3</td>\n",
" <td>Russian</td>\n",
" <td>ADJ</td>\n",
" <td>JJ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NewsArticles-2433</td>\n",
" <td>4</td>\n",
" <td>spies</td>\n",
" <td>NOUN</td>\n",
" <td>NNS</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1949</th>\n",
" <td>NewsArticles-49</td>\n",
" <td>1107</td>\n",
" <td>fight</td>\n",
" <td>VERB</td>\n",
" <td>VB</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1950</th>\n",
" <td>NewsArticles-49</td>\n",
" <td>1108</td>\n",
" <td>to</td>\n",
" <td>PART</td>\n",
" <td>TO</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1951</th>\n",
" <td>NewsArticles-49</td>\n",
" <td>1109</td>\n",
" <td>defend</td>\n",
" <td>VERB</td>\n",
" <td>VB</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1952</th>\n",
" <td>NewsArticles-49</td>\n",
" <td>1110</td>\n",
" <td>it</td>\n",
" <td>PRON</td>\n",
" <td>PRP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1953</th>\n",
" <td>NewsArticles-49</td>\n",
" <td>1111</td>\n",
" <td>.</td>\n",
" <td>PUNCT</td>\n",
" <td>.</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1954 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" doc position token pos tag\n",
"0 NewsArticles-2433 0 DOJ NOUN NN\n",
"1 NewsArticles-2433 1 : PUNCT :\n",
"2 NewsArticles-2433 2 2 NUM CD\n",
"3 NewsArticles-2433 3 Russian ADJ JJ\n",
"4 NewsArticles-2433 4 spies NOUN NNS\n",
"... ... ... ... ... ...\n",
"1949 NewsArticles-49 1107 fight VERB VB\n",
"1950 NewsArticles-49 1108 to PART TO\n",
"1951 NewsArticles-49 1109 defend VERB VB\n",
"1952 NewsArticles-49 1110 it PRON PRP\n",
"1953 NewsArticles-49 1111 . PUNCT .\n",
"\n",
"[1954 rows x 5 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# select two documents and only show the \"pos\" and \"tag\" attributes\n",
"# (coarse and detailed POS tags)\n",
"tokens_table(corpus_small, select=['NewsArticles-2433', 'NewsArticles-49'],\n",
" with_attr=['pos', 'tag'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-info\">\n",
"\n",
"**Side note: Common corpus function parameters**\n",
" \n",
"Many corpus functions share the same parameter names and when they do, they implicate the same behavior. As already explained, all corpus functions accept a `Corpus` object as first parameter. But next to that, many corpus functions also accept a `select` parameter, which can always be used to specify a subset of the documents to which the respective function is applied. We also already got to know the `sentences` parameter that some corpus functions accept in order to also represent the sentence structure of a document in their output.\n",
" \n",
"To know which functions accept which parameter, check their documentation.\n",
"\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Corpus vocabulary\n",
"\n",
"The corpus *vocabulary* is the set of unique tokens (usually refered to as *token types*) in a corpus. We can get that se
gitextract_8gohnn84/ ├── .github/ │ └── workflows/ │ ├── runtests.yml │ └── stale.yml ├── .gitignore ├── .readthedocs.yaml ├── AUTHORS.md ├── LICENSE ├── MANIFEST.in ├── Makefile ├── README.rst ├── conftest.py ├── doc/ │ ├── Makefile │ └── source/ │ ├── api.rst │ ├── bow.ipynb │ ├── conf.py │ ├── data/ │ │ ├── corpus_example/ │ │ │ ├── sample1.txt │ │ │ ├── sample2.txt │ │ │ └── sample3.txt │ │ ├── news_articles_100.pickle │ │ ├── news_articles_100.xlsx │ │ └── tm_wordclouds/ │ │ └── .gitignore │ ├── development.rst │ ├── getting_started.ipynb │ ├── index.rst │ ├── install.rst │ ├── intro.rst │ ├── license_note.rst │ ├── preprocessing.ipynb │ ├── text_corpora.ipynb │ ├── topic_modeling.ipynb │ └── version_history.rst ├── examples/ │ ├── README.md │ ├── __init__.py │ ├── _benchmarktools.py │ ├── benchmark_en_newsarticles.py │ ├── bundestag18_tfidf.py │ ├── data/ │ │ ├── ap.pickle │ │ ├── bt18_sample_1000.pickle │ │ └── nips.pickle │ ├── gensim_evaluation.py │ ├── minimal_tfidf.py │ ├── topicmod_ap_nips_eval.py │ └── topicmod_lda.py ├── requirements.txt ├── requirements_doc.txt ├── scripts/ │ ├── fulldata/ │ │ ├── .gitignore │ │ └── README.md │ ├── nips_data.py │ ├── prepare_corpora.R │ └── tmp/ │ └── .gitignore ├── setup.py ├── tests/ │ ├── __init__.py │ ├── _testtextdata.py │ ├── _testtools.py │ ├── data/ │ │ ├── .gitignore │ │ ├── 100NewsArticles.csv │ │ ├── 100NewsArticles.xlsx │ │ ├── 3ExampleDocs.xlsx │ │ ├── bt18_speeches_sample.csv │ │ ├── gutenberg/ │ │ │ ├── kafka_verwandlung.txt │ │ │ └── werther/ │ │ │ ├── goethe_werther1.txt │ │ │ └── goethe_werther2.txt │ │ └── tiny_model_reuters_5_topics.pickle │ ├── test_bow.py │ ├── test_corpus.py │ ├── test_corpusimport.py │ ├── test_tokenseq.py │ ├── test_topicmod__eval_tools.py │ ├── test_topicmod_evaluate.py │ ├── test_topicmod_model_io.py │ ├── test_topicmod_model_stats.py │ ├── test_topicmod_visualize.py │ └── test_utils.py ├── tmtoolkit/ │ ├── __init__.py │ ├── __main__.py │ ├── bow/ │ │ ├── __init__.py │ │ ├── bow_stats.py │ │ └── dtm.py │ ├── corpus/ │ │ ├── __init__.py │ │ ├── _common.py │ │ ├── _corpus.py │ │ ├── _corpusfuncs.py │ │ ├── _document.py │ │ ├── _nltk_extras.py │ │ └── visualize.py │ ├── tokenseq.py │ ├── topicmod/ │ │ ├── __init__.py │ │ ├── _common.py │ │ ├── _eval_tools.py │ │ ├── evaluate.py │ │ ├── model_io.py │ │ ├── model_stats.py │ │ ├── parallel.py │ │ ├── tm_gensim.py │ │ ├── tm_lda.py │ │ ├── tm_sklearn.py │ │ └── visualize.py │ ├── types.py │ └── utils.py └── tox.ini
SYMBOL INDEX (542 symbols across 34 files)
FILE: doc/source/conf.py
function skip (line 80) | def skip(app, what, name, obj, would_skip, options):
function setup (line 85) | def setup(app):
FILE: examples/_benchmarktools.py
function add_timing (line 8) | def add_timing(label):
function print_timings (line 13) | def print_timings():
FILE: examples/bundestag18_tfidf.py
function del_special_chars (line 88) | def del_special_chars(t):
function correct_contractions (line 97) | def correct_contractions(t):
function correct_hyphenation (line 104) | def correct_hyphenation(t):
FILE: tests/_testtools.py
function strategy_2d_array (line 7) | def strategy_2d_array(dtype, minval=0, maxval=None, **kwargs):
function strategy_dtm (line 30) | def strategy_dtm():
function strategy_dtm_small (line 34) | def strategy_dtm_small():
function strategy_2d_prob_distribution (line 38) | def strategy_2d_prob_distribution():
function strategy_tokens (line 42) | def strategy_tokens(*args, **kwargs):
function strategy_lists_of_tokens (line 46) | def strategy_lists_of_tokens(*args, **kwargs):
function strategy_texts (line 50) | def strategy_texts(*args, **kwargs):
function strategy_texts_printable (line 54) | def strategy_texts_printable():
function strategy_str_str_dict (line 58) | def strategy_str_str_dict(keys_args, keys_kwargs, values_args, values_kw...
function strategy_str_str_dict_printable (line 62) | def strategy_str_str_dict_printable():
FILE: tests/test_bow.py
function test_doc_lengths (line 24) | def test_doc_lengths(dtm, matrix_type):
function test_doc_frequencies (line 45) | def test_doc_frequencies(dtm, matrix_type):
function test_doc_frequencies2 (line 77) | def test_doc_frequencies2():
function test_codoc_frequencies (line 94) | def test_codoc_frequencies(dtm, matrix_type, proportions):
function test_codoc_frequencies2 (line 135) | def test_codoc_frequencies2():
function test_term_frequencies (line 153) | def test_term_frequencies(dtm, matrix_type):
function test_tf_binary (line 192) | def test_tf_binary(dtm, matrix_type):
function test_tf_proportions (line 232) | def test_tf_proportions(dtm, matrix_type):
function test_tf_log (line 261) | def test_tf_log(dtm, matrix_type):
function test_tf_double_norm (line 294) | def test_tf_double_norm(dtm, matrix_type, K):
function test_idf (line 321) | def test_idf(dtm, matrix_type):
function test_idf_probabilistic (line 341) | def test_idf_probabilistic(dtm, matrix_type):
function test_tfidf (line 367) | def test_tfidf(dtm, matrix_type, tf_func, K, idf_func, smooth, smooth_lo...
function test_tfidf_example (line 414) | def test_tfidf_example():
function test_sorted_terms (line 443) | def test_sorted_terms(dtm, matrix_type, lo_thresh, hi_thresh, top_n, asc...
function test_sorted_terms_example (line 492) | def test_sorted_terms_example():
function test_sorted_terms_table (line 526) | def test_sorted_terms_table(dtm, matrix_type, lo_thresh, hi_thresh, top_...
function test_dtm_to_dataframe (line 557) | def test_dtm_to_dataframe(dtm, matrix_type):
function test_dtm_to_gensim_corpus_and_gensim_corpus_to_dtm (line 589) | def test_dtm_to_gensim_corpus_and_gensim_corpus_to_dtm(dtm, matrix_type):
function test_dtm_and_vocab_to_gensim_corpus_and_dict (line 610) | def test_dtm_and_vocab_to_gensim_corpus_and_dict(dtm, matrix_type, as_ge...
FILE: tests/test_corpus.py
function spacy_instance_en_sm (line 57) | def spacy_instance_en_sm():
function corpus_en (line 62) | def corpus_en():
function corpus_en_module (line 67) | def corpus_en_module():
function corpora_en_serial_and_parallel (line 72) | def corpora_en_serial_and_parallel():
function corpora_en_serial_and_parallel_module (line 79) | def corpora_en_serial_and_parallel_module():
function corpora_en_serial_and_parallel_also_w_vectors_module (line 86) | def corpora_en_serial_and_parallel_also_w_vectors_module():
function corpus_de (line 94) | def corpus_de():
function corpus_de_module (line 99) | def corpus_de_module():
function test_datadirs (line 107) | def test_datadirs():
function test_fixtures_n_docs_and_doc_labels (line 113) | def test_fixtures_n_docs_and_doc_labels(corpus_en, corpus_de):
function test_corpus_no_lang_given (line 124) | def test_corpus_no_lang_given():
function test_empty_corpus (line 129) | def test_empty_corpus():
function test_corpus_init (line 147) | def test_corpus_init():
function test_corpus_init_md_model_required (line 253) | def test_corpus_init_md_model_required():
function test_corpus_init_and_properties_hypothesis (line 272) | def test_corpus_init_and_properties_hypothesis(spacy_instance_en_sm, doc...
function test_corpus_init_otherlang_by_langcode (line 346) | def test_corpus_init_otherlang_by_langcode():
function test_corpus_setitem_delitem (line 369) | def test_corpus_setitem_delitem(corpora_en_serial_and_parallel):
function test_corpus_iter_contains (line 404) | def test_corpus_iter_contains(corpora_en_serial_and_parallel):
function test_corpus_update (line 414) | def test_corpus_update(corpora_en_serial_and_parallel):
function test_doc_tokens_hypothesis (line 453) | def test_doc_tokens_hypothesis(corpora_en_serial_and_parallel_module, **...
function test_doc_lengths (line 577) | def test_doc_lengths(corpora_en_serial_and_parallel_module, select, as_t...
function test_doc_token_lengths (line 614) | def test_doc_token_lengths(corpora_en_serial_and_parallel_module, select):
function test_doc_num_sents (line 644) | def test_doc_num_sents(corpora_en_serial_and_parallel_module, select, as...
function test_doc_sent_lengths (line 685) | def test_doc_sent_lengths(corpora_en_serial_and_parallel_module, apply_f...
function test_doc_labels (line 716) | def test_doc_labels(corpora_en_serial_and_parallel_module, sort):
function test_doc_labels_sample (line 729) | def test_doc_labels_sample(corpora_en_serial_and_parallel_module, n):
function test_doc_texts (line 745) | def test_doc_texts(corpora_en_serial_and_parallel_module, collapse, sele...
function test_doc_frequencies (line 793) | def test_doc_frequencies(corpora_en_serial_and_parallel_module, proporti...
function test_doc_vectors (line 833) | def test_doc_vectors(corpora_en_serial_and_parallel_also_w_vectors_modul...
function test_token_vectors (line 866) | def test_token_vectors(corpora_en_serial_and_parallel_also_w_vectors_mod...
function test_spacydocs (line 905) | def test_spacydocs(corpora_en_serial_and_parallel_also_w_vectors_module,...
function test_vocabulary_hypothesis (line 937) | def test_vocabulary_hypothesis(corpora_en_serial_and_parallel_module, se...
function test_vocabulary_counts (line 986) | def test_vocabulary_counts(corpora_en_serial_and_parallel_module, select...
function test_vocabulary_size (line 1039) | def test_vocabulary_size(corpora_en_serial_and_parallel_module, select, ...
function test_tokens_table_hypothesis (line 1065) | def test_tokens_table_hypothesis(corpora_en_serial_and_parallel_module, ...
function test_corpus_tokens_flattened (line 1118) | def test_corpus_tokens_flattened(corpora_en_serial_and_parallel_module, ...
function test_corpus_num_tokens (line 1173) | def test_corpus_num_tokens(corpora_en_serial_and_parallel_module, select):
function test_corpus_num_chars (line 1187) | def test_corpus_num_chars(corpora_en_serial_and_parallel_module, select):
function test_corpus_unique_chars (line 1202) | def test_corpus_unique_chars(corpora_en_serial_and_parallel_module, sele...
function test_corpus_collocations_hypothesis (line 1232) | def test_corpus_collocations_hypothesis(corpora_en_serial_and_parallel_m...
function test_corpus_summary (line 1278) | def test_corpus_summary(corpora_en_serial_and_parallel_module, max_docum...
function test_print_summary (line 1300) | def test_print_summary(capsys, corpora_en_serial_and_parallel_module):
function test_dtm (line 1312) | def test_dtm(corpora_en_serial_and_parallel_module, select, as_table, dt...
function test_ngrams_hypothesis (line 1386) | def test_ngrams_hypothesis(corpora_en_serial_and_parallel_module, n, joi...
function test_kwic_hypothesis (line 1435) | def test_kwic_hypothesis(corpora_en_serial_and_parallel_module, **args):
function test_kwic_example (line 1582) | def test_kwic_example(corpora_en_serial_and_parallel_module):
function test_kwic_table_hypothesis (line 1625) | def test_kwic_table_hypothesis(corpora_en_serial_and_parallel_module, **...
function test_save_load_corpus (line 1710) | def test_save_load_corpus(corpora_en_serial_and_parallel_module):
function test_load_corpus_from_tokens_hypothesis (line 1732) | def test_load_corpus_from_tokens_hypothesis(corpora_en_serial_and_parall...
function test_load_corpus_from_tokens_table (line 1811) | def test_load_corpus_from_tokens_table(corpora_en_serial_and_parallel, w...
function test_serialize_deserialize_corpus (line 1852) | def test_serialize_deserialize_corpus(corpora_en_serial_and_parallel_mod...
function test_corpus_add_files_and_from_files (line 1875) | def test_corpus_add_files_and_from_files(corpora_en_serial_and_parallel,...
function test_corpus_add_folder_and_from_folder (line 1943) | def test_corpus_add_folder_and_from_folder(corpora_en_serial_and_paralle...
function test_corpus_add_tabular_and_from_tabular (line 2024) | def test_corpus_add_tabular_and_from_tabular(corpora_en_serial_and_paral...
function test_corpus_add_zip_and_from_zip (line 2114) | def test_corpus_add_zip_and_from_zip(corpora_en_serial_and_parallel, inp...
function test_corpus_from_builtin_corpus (line 2150) | def test_corpus_from_builtin_corpus(max_workers, sample):
function test_set_remove_document_attr (line 2185) | def test_set_remove_document_attr(corpora_en_serial_and_parallel, attrna...
function test_set_remove_token_attr (line 2235) | def test_set_remove_token_attr(corpora_en_serial_and_parallel, attrname,...
function test_corpus_retokenize (line 2297) | def test_corpus_retokenize(corpora_en_serial_and_parallel, testcase, inp...
function test_transform_tokens_upper_lower (line 2326) | def test_transform_tokens_upper_lower(corpora_en_serial_and_parallel, te...
function test_remove_chars_or_punctuation (line 2369) | def test_remove_chars_or_punctuation(corpora_en_serial_and_parallel, tes...
function test_normalize_unicode (line 2397) | def test_normalize_unicode(corpora_en_serial_and_parallel, inplace):
function test_simplify_unicode (line 2423) | def test_simplify_unicode(corpora_en_serial_and_parallel, method, inplace):
function test_numbers_to_magnitudes (line 2452) | def test_numbers_to_magnitudes(corpora_en_serial_and_parallel, inplace):
function test_lemmatize (line 2471) | def test_lemmatize(corpora_en_serial_and_parallel, inplace):
function test_join_collocations_by_patterns (line 2494) | def test_join_collocations_by_patterns(corpora_en_serial_and_parallel, t...
function test_join_collocations_by_statistic_hypothesis (line 2572) | def test_join_collocations_by_statistic_hypothesis(corpora_en_serial_and...
function test_filter_tokens_by_mask (line 2624) | def test_filter_tokens_by_mask(corpora_en_serial_and_parallel, inverse, ...
function test_filter_tokens (line 2670) | def test_filter_tokens(corpora_en_serial_and_parallel, testtype, search_...
function test_filter_tokens_custom_attr_bug (line 2716) | def test_filter_tokens_custom_attr_bug(corpora_en_serial_and_parallel):
function test_filter_for_pos (line 2738) | def test_filter_for_pos(corpora_en_serial_and_parallel, testtype, search...
function test_filter_tokens_by_doc_frequency (line 2775) | def test_filter_tokens_by_doc_frequency(corpora_en_serial_and_parallel, ...
function test_filter_documents (line 2846) | def test_filter_documents(corpora_en_serial_and_parallel, testtype, sear...
function test_filter_documents_by_docattr (line 2924) | def test_filter_documents_by_docattr(corpora_en_serial_and_parallel, tes...
function test_filter_documents_by_length (line 3004) | def test_filter_documents_by_length(corpora_en_serial_and_parallel, test...
function test_filter_clean_tokens (line 3051) | def test_filter_clean_tokens(corpora_en_serial_and_parallel, remove_punc...
function test_filter_tokens_with_kwic (line 3105) | def test_filter_tokens_with_kwic(corpora_en_serial_and_parallel, testtyp...
function test_corpus_ngramify (line 3164) | def test_corpus_ngramify(corpora_en_serial_and_parallel, n, join_str, in...
function test_corpus_sample (line 3203) | def test_corpus_sample(corpora_en_serial_and_parallel, n, inplace):
function test_corpus_split_by_paragraph (line 3224) | def test_corpus_split_by_paragraph(corpora_en_serial_and_parallel, inpla...
function test_corpus_join_documents (line 3258) | def test_corpus_join_documents(corpora_en_serial_and_parallel, join, glu...
function test_builtin_corpora_info (line 3301) | def test_builtin_corpora_info(with_paths):
function test_corpus_workflow_example1 (line 3330) | def test_corpus_workflow_example1(corpora_en_serial_and_parallel):
function _check_corpus_inplace_modif (line 3400) | def _check_corpus_inplace_modif(corp_a, corp_b, inplace, check_attrs=Non...
function _check_corpus_docs (line 3413) | def _check_corpus_docs(corp: c.Corpus, has_sents: bool):
function _check_copies (line 3434) | def _check_copies(corp_a, corp_b, same_nlp_instance):
function _check_copies_attrs (line 3446) | def _check_copies_attrs(corp_a, corp_b, check_attrs=None, dont_check_att...
function _dataframes_equal (line 3478) | def _dataframes_equal(df1, df2, require_same_index=True):
FILE: tests/test_corpusimport.py
function test_import_corpus (line 12) | def test_import_corpus():
FILE: tests/test_tokenseq.py
function test_token_lengths (line 29) | def test_token_lengths(tokens, expected):
function test_token_lengths_hypothesis (line 35) | def test_token_lengths_hypothesis(tokens, as_array):
function test_unique_chars_hypothesis (line 47) | def test_unique_chars_hypothesis(tokens):
function test_collapse_tokens (line 62) | def test_collapse_tokens(tokens, tokens_as_array, collapse, collapse_as_...
function test_simplify_unicode_chars (line 96) | def test_simplify_unicode_chars(token, method, ascii_encoding_errors):
function test_strip_tags (line 129) | def test_strip_tags(value, expected):
function test_pmi_hypothesis (line 138) | def test_pmi_hypothesis(xy, as_prob, n_total_factor, k, normalize):
function test_simple_collocation_counts_hypothesis (line 172) | def test_simple_collocation_counts_hypothesis(xy):
function test_token_collocations (line 231) | def test_token_collocations(args, expected):
function test_token_collocations_hypothesis (line 252) | def test_token_collocations_hypothesis(sentences, threshold, min_count, ...
function test_token_match (line 325) | def test_token_match(pattern, tokens, match_type, ignore_case, glob_meth...
function test_token_match_multi_pattern (line 342) | def test_token_match_multi_pattern(pattern, tokens, match_type, ignore_c...
function test_token_match_subsequent (line 347) | def test_token_match_subsequent():
function test_token_match_subsequent_hypothesis (line 372) | def test_token_match_subsequent_hypothesis(tokens, n_patterns):
function test_token_glue_subsequent (line 399) | def test_token_glue_subsequent():
function test_token_glue_subsequent_hypothesis (line 416) | def test_token_glue_subsequent_hypothesis(tokens, n_patterns):
function test_token_ngrams_hypothesis (line 449) | def test_token_ngrams_hypothesis(tokens, n, join, join_str, ngram_contai...
function test_numbertoken_to_magnitude (line 527) | def test_numbertoken_to_magnitude(numbertoken, char, firstchar, below_on...
FILE: tests/test_topicmod__eval_tools.py
function test_split_dtm_for_cross_validation (line 15) | def test_split_dtm_for_cross_validation(dtm, matrix_type, n_folds):
FILE: tests/test_topicmod_evaluate.py
function test_metric_held_out_documents_wallach09 (line 18) | def test_metric_held_out_documents_wallach09():
function test_compute_models_parallel_lda_multi_vs_singleproc (line 100) | def test_compute_models_parallel_lda_multi_vs_singleproc():
function test_compute_models_parallel_lda_multiple_docs (line 130) | def test_compute_models_parallel_lda_multiple_docs():
function test_evaluation_all_engines_unavail_metric (line 204) | def test_evaluation_all_engines_unavail_metric():
function test_evaluation_lda_all_metrics_multi_vs_singleproc (line 213) | def test_evaluation_lda_all_metrics_multi_vs_singleproc():
function test_evaluation_gensim_all_metrics (line 279) | def test_evaluation_gensim_all_metrics():
function test_compute_models_parallel_gensim (line 313) | def test_compute_models_parallel_gensim():
function test_compute_models_parallel_gensim_multiple_docs (line 328) | def test_compute_models_parallel_gensim_multiple_docs():
function test_evaluation_sklearn_all_metrics (line 390) | def test_evaluation_sklearn_all_metrics():
function test_compute_models_parallel_sklearn (line 430) | def test_compute_models_parallel_sklearn():
function test_compute_models_parallel_sklearn_multiple_docs (line 445) | def test_compute_models_parallel_sklearn_multiple_docs():
function test_results_by_parameter_single_validation (line 509) | def test_results_by_parameter_single_validation(n_param_sets, n_params, ...
FILE: tests/test_topicmod_model_io.py
function test_save_load_ldamodel_pickle (line 16) | def test_save_load_ldamodel_pickle():
function test_ldamodel_top_topic_words (line 45) | def test_ldamodel_top_topic_words(topic_word, top_n):
function test_ldamodel_top_word_topics (line 67) | def test_ldamodel_top_word_topics(topic_word, top_n):
function test_ldamodel_top_doc_topics (line 88) | def test_ldamodel_top_doc_topics(doc_topic, top_n):
function test_ldamodel_top_topic_docs (line 109) | def test_ldamodel_top_topic_docs(doc_topic, top_n):
function test_ldamodel_full_topic_words (line 128) | def test_ldamodel_full_topic_words(topic_word):
function test_ldamodel_full_doc_topics (line 143) | def test_ldamodel_full_doc_topics(doc_topic):
function test_save_ldamodel_summary_to_excel (line 163) | def test_save_ldamodel_summary_to_excel(n_docs, n_topics, size_vocab, to...
FILE: tests/test_topicmod_model_stats.py
function test_top_n_from_distribution (line 17) | def test_top_n_from_distribution(n, distrib):
function test_top_words_for_topics (line 40) | def test_top_words_for_topics(topic_word_distrib, vocab, top_n):
function test_top_words_for_topics2 (line 66) | def test_top_words_for_topics2():
function test_get_marginal_topic_distrib (line 103) | def test_get_marginal_topic_distrib(dtm, n_topics):
function test_get_marginal_word_distrib (line 127) | def test_get_marginal_word_distrib(dtm, n_topics):
function test_get_word_distinctiveness (line 152) | def test_get_word_distinctiveness(dtm, n_topics):
function test_get_word_saliency (line 177) | def test_get_word_saliency(dtm, n_topics):
function test_get_most_or_least_salient_words (line 201) | def test_get_most_or_least_salient_words(dtm, n_topics, n_salient_words):
function test_get_most_or_least_distinct_words (line 237) | def test_get_most_or_least_distinct_words(dtm, n_topics, n_distinct_words):
function test_get_topic_word_relevance (line 273) | def test_get_topic_word_relevance(dtm, n_topics, lambda_):
function test_get_most_or_least_relevant_words_for_topic (line 299) | def test_get_most_or_least_relevant_words_for_topic(dtm, n_topics, lambd...
function test_generate_topic_labels_from_top_words (line 336) | def test_generate_topic_labels_from_top_words(dtm, n_topics, lambda_):
function test_filter_topics (line 377) | def test_filter_topics():
function test_exclude_topics (line 461) | def test_exclude_topics(exclude, pass_topic_word, renormalize, return_ne...
FILE: tests/test_topicmod_visualize.py
function test_generate_wordclouds_for_topic_words (line 15) | def test_generate_wordclouds_for_topic_words():
function test_generate_wordclouds_for_document_topics (line 44) | def test_generate_wordclouds_for_document_topics():
function test_write_wordclouds_to_folder (line 75) | def test_write_wordclouds_to_folder(tmpdir):
function test_plot_doc_topic_heatmap (line 104) | def test_plot_doc_topic_heatmap(doc_topic, make_topic_labels):
function test_plot_topic_word_heatmap (line 125) | def test_plot_topic_word_heatmap(topic_word):
FILE: tests/test_utils.py
function test_enable_disable_logging (line 31) | def test_enable_disable_logging(caplog, level, fmt):
function test_pickle_unpickle (line 80) | def test_pickle_unpickle():
function test_path_split (line 91) | def test_path_split():
function test_read_text_file (line 106) | def test_read_text_file():
function test_linebreaks_win2unix (line 119) | def test_linebreaks_win2unix(text):
function test_empty_chararray (line 126) | def test_empty_chararray():
function test_as_chararray (line 136) | def test_as_chararray(x, as_numpy_array):
function test_dict2df (line 154) | def test_dict2df(data, key_name, value_name, sort, asc):
function test_applychain (line 197) | def test_applychain(expected, funcs, initial_arg):
function test_flatten_list (line 212) | def test_flatten_list(l):
function test_mat2d_window_from_indices (line 225) | def test_mat2d_window_from_indices(mat, n_row_indices, n_col_indices, co...
function test_merge_dicts (line 272) | def test_merge_dicts(dicts, sort_keys, safe):
function test_merge_sets (line 300) | def test_merge_sets(sets, safe):
function test_sample_dict (line 317) | def test_sample_dict(d, n):
function test_greedy_partitioning (line 333) | def test_greedy_partitioning(elems_dict, k):
function test_combine_sparse_matrices_columnwise (line 355) | def test_combine_sparse_matrices_columnwise():
function test_split_func_args (line 467) | def test_split_func_args(testfn, testargs, expargs1, expargs2):
FILE: tmtoolkit/__main__.py
function _setup (line 44) | def _setup(args):
function _help (line 114) | def _help(args):
FILE: tmtoolkit/bow/bow_stats.py
function doc_lengths (line 13) | def doc_lengths(dtm):
function doc_frequencies (line 31) | def doc_frequencies(dtm, min_val=1, proportions=0):
function word_cooccurrence (line 58) | def word_cooccurrence(dtm, min_val=1, proportions=0):
function codoc_frequencies (line 66) | def codoc_frequencies(dtm, min_val=1, proportions=0):
function term_frequencies (line 102) | def term_frequencies(dtm, proportions=0):
function tf_binary (line 133) | def tf_binary(dtm):
function tf_proportions (line 147) | def tf_proportions(dtm):
function tf_log (line 173) | def tf_log(dtm, log_fn=np.log1p):
function tf_double_norm (line 196) | def tf_double_norm(dtm, K=0.5):
function idf (line 219) | def idf(dtm, smooth_log=1, smooth_df=1):
function idf_probabilistic (line 248) | def idf_probabilistic(dtm, smooth=1):
function tfidf (line 274) | def tfidf(dtm, tf_func=tf_proportions, idf_func=idf, **kwargs):
function sorted_terms (line 314) | def sorted_terms(mat, vocab, lo_thresh=0, hi_tresh=None, top_n=None, asc...
function sorted_terms_table (line 419) | def sorted_terms_table(mat, vocab, doc_labels, lo_thresh=0, hi_tresh=Non...
FILE: tmtoolkit/bow/dtm.py
function create_sparse_dtm (line 15) | def create_sparse_dtm(vocab, docs, n_unique_tokens, vocab_is_sorted=Fals...
function dtm_to_dataframe (line 84) | def dtm_to_dataframe(dtm, doc_labels, vocab):
function dtm_to_gensim_corpus (line 112) | def dtm_to_gensim_corpus(dtm):
function gensim_corpus_to_dtm (line 141) | def gensim_corpus_to_dtm(corpus):
function dtm_and_vocab_to_gensim_corpus_and_dict (line 157) | def dtm_and_vocab_to_gensim_corpus_and_dict(dtm, vocab, as_gensim_dictio...
FILE: tmtoolkit/corpus/_common.py
function simplified_pos (line 81) | def simplified_pos(pos: str, tagset: str = 'ud', default: str = '') -> str:
FILE: tmtoolkit/corpus/_corpus.py
class Corpus (line 33) | class Corpus:
method __init__ (line 87) | def __init__(self, docs: Optional[Union[Dict[str, str], Sequence[Docum...
method __str__ (line 300) | def __str__(self) -> str:
method __repr__ (line 304) | def __repr__(self) -> str:
method __len__ (line 314) | def __len__(self) -> int:
method __getitem__ (line 322) | def __getitem__(self, k: Union[str, int, slice]) -> Union[Document, Li...
method __setitem__ (line 341) | def __setitem__(self, doc_label: str, doc: Union[str, Doc, Document]):
method __delitem__ (line 375) | def __delitem__(self, doc_label):
method __iter__ (line 393) | def __iter__(self) -> Iterator[str]:
method __contains__ (line 397) | def __contains__(self, doc_label) -> bool:
method __copy__ (line 406) | def __copy__(self) -> Corpus:
method __deepcopy__ (line 414) | def __deepcopy__(self, memodict=None) -> Corpus:
method items (line 422) | def items(self) -> ItemsView[str, Document]:
method keys (line 430) | def keys(self) -> KeysView[str]:
method values (line 438) | def values(self) -> ValuesView[Document]:
method get (line 446) | def get(self, *args) -> Document:
method update (line 454) | def update(self, new_docs: Union[Dict[str, Union[str, Doc, Document]],...
method uses_unigrams (line 491) | def uses_unigrams(self) -> bool:
method spacy_token_attrs (line 496) | def spacy_token_attrs(self) -> Tuple[str, ...]:
method token_attrs (line 503) | def token_attrs(self) -> Tuple[str, ...]:
method custom_token_attrs_defaults (line 510) | def custom_token_attrs_defaults(self) -> Dict[str, Any]:
method doc_attrs (line 515) | def doc_attrs(self) -> Tuple[str, ...]:
method doc_attrs_defaults (line 520) | def doc_attrs_defaults(self) -> Dict[str, Any]:
method ngrams (line 525) | def ngrams(self) -> int:
method ngrams_join_str (line 530) | def ngrams_join_str(self) -> str:
method language (line 535) | def language(self) -> str:
method language_model (line 543) | def language_model(self) -> str:
method has_sents (line 551) | def has_sents(self) -> bool:
method doc_labels (line 556) | def doc_labels(self) -> List[str]:
method n_docs (line 561) | def n_docs(self) -> int:
method workers_docs (line 566) | def workers_docs(self) -> List[List[str]]:
method max_workers (line 575) | def max_workers(self):
method max_workers (line 580) | def max_workers(self, max_workers):
method from_files (line 622) | def from_files(cls, files: Union[str, Collection[str], Dict[str, str]]...
method from_folder (line 635) | def from_folder(cls, folder: str, **kwargs) -> Corpus:
method from_tabular (line 648) | def from_tabular(cls, files: Union[str, Collection[str]], **kwargs) ->...
method from_zip (line 663) | def from_zip(cls, zipfile: str, **kwargs) -> Corpus:
method from_builtin_corpus (line 677) | def from_builtin_corpus(cls, corpus_label, **kwargs) -> Corpus:
method _nlppipe (line 701) | def _nlppipe(self, docs: ValuesView[str]) -> Union[Iterator[Doc], Gene...
method _init_bimaps (line 712) | def _init_bimaps(self):
method _init_docs (line 720) | def _init_docs(self, docs: Dict[str, str]):
method _init_document (line 750) | def _init_document(self, spacydoc: Doc, label: str):
method _update_bimaps (line 779) | def _update_bimaps(self, which_docs: Union[str, Optional[Collection[st...
method _update_workers_docs (line 844) | def _update_workers_docs(self, based_on_docs=None):
method _serialize (line 859) | def _serialize(self, deepcopy_attrs: bool, store_nlp_instance_pointer:...
method _deserialize (line 921) | def _deserialize(cls, data: Dict[str, Any]) -> Corpus:
method _construct_from_func (line 960) | def _construct_from_func(cls, add_fn: Callable, *args, **kwargs) -> Co...
FILE: tmtoolkit/corpus/_corpusfuncs.py
class ParallelTask (line 61) | class ParallelTask:
function _paralleltask (line 72) | def _paralleltask(corpus: Corpus, tokens: Dict[str, Any], force_serialpr...
function parallelexec (line 81) | def parallelexec(collect_fn: Callable) -> Callable[[CorpusFunc], Callable]:
function corpus_func_inplace_opt (line 144) | def corpus_func_inplace_opt(fn: Callable) -> Callable:
function tabular_result_option (line 190) | def tabular_result_option(key: str, value: str) -> Callable:
function corpus_func_update_bimaps (line 232) | def corpus_func_update_bimaps(which_attrs: Union[str, Optional[Collectio...
function doc_tokens (line 269) | def doc_tokens(docs: Corpus,
function doc_lengths (line 438) | def doc_lengths(docs: Corpus, select: Optional[Union[str, Collection[str...
function doc_token_lengths (line 458) | def doc_token_lengths(docs: Corpus, select: Optional[Union[str, Collecti...
function doc_num_sents (line 486) | def doc_num_sents(docs: Corpus, select: Optional[Union[str, Collection[s...
function doc_sent_lengths (line 517) | def doc_sent_lengths(docs: Corpus, select: Optional[Union[str, Collectio...
function doc_labels (line 544) | def doc_labels(docs: Corpus, sort: bool = True) -> List[str]:
function doc_labels_sample (line 558) | def doc_labels_sample(docs: Corpus, n: int) -> Set[str]:
function doc_texts (line 572) | def doc_texts(docs: Corpus, select: Optional[Union[str, Collection[str]]...
function doc_frequencies (line 611) | def doc_frequencies(docs: Corpus, select: Optional[Union[str, Collection...
function doc_vectors (line 671) | def doc_vectors(docs: Union[Corpus, Dict[str, Doc]], select: Optional[Un...
function token_vectors (line 693) | def token_vectors(docs: Union[Corpus, Dict[str, Doc]], select: Optional[...
function spacydocs (line 719) | def spacydocs(docs: Corpus, select: Optional[Union[str, Collection[str]]...
function vocabulary (line 758) | def vocabulary(docs: Corpus, select: Optional[Union[str, Collection[str]...
function vocabulary_counts (line 799) | def vocabulary_counts(docs: Corpus, select: Optional[Union[str, Collecti...
function vocabulary_size (line 849) | def vocabulary_size(docs: Union[Corpus, Dict[str, List[str]]], select: O...
function tokens_table (line 863) | def tokens_table(docs: Corpus,
function corpus_tokens_flattened (line 982) | def corpus_tokens_flattened(docs: Corpus, select: Optional[Union[str, Co...
function corpus_num_tokens (line 1022) | def corpus_num_tokens(docs: Corpus, select: Optional[Union[str, Collecti...
function corpus_num_chars (line 1033) | def corpus_num_chars(docs: Corpus, select: Optional[Union[str, Collectio...
function corpus_unique_chars (line 1044) | def corpus_unique_chars(docs: Corpus, select: Optional[Union[str, Collec...
function corpus_collocations (line 1055) | def corpus_collocations(docs: Corpus,
function corpus_summary (line 1139) | def corpus_summary(docs: Corpus,
function print_summary (line 1196) | def print_summary(docs: Corpus,
function dtm (line 1214) | def dtm(docs: Corpus, select: Optional[Union[str, Collection[str]]] = No...
function ngrams (line 1288) | def ngrams(docs: Corpus, n: int, select: Optional[Union[str, Collection[...
function kwic (line 1316) | def kwic(docs: Corpus, search_tokens: Any, context_size: Union[int, Tupl...
function kwic_table (line 1409) | def kwic_table(docs: Corpus, search_tokens: Any, context_size: Union[int...
function corpus_add_files (line 1489) | def corpus_add_files(docs: Corpus, files: Union[str, Collection[str], Di...
function corpus_add_folder (line 1535) | def corpus_add_folder(docs: Corpus, folder: str, valid_extensions: Colle...
function corpus_add_tabular (line 1614) | def corpus_add_tabular(docs: Corpus, files: Union[str, Collection[str]],
function corpus_add_zip (line 1659) | def corpus_add_zip(docs: Corpus, zipfile: str, valid_extensions: Collect...
function save_corpus_to_picklefile (line 1760) | def save_corpus_to_picklefile(docs: Corpus, picklefile: str) -> None:
function load_corpus_from_picklefile (line 1774) | def load_corpus_from_picklefile(picklefile: str) -> Corpus:
function load_corpus_from_tokens (line 1790) | def load_corpus_from_tokens(tokens: Dict[str, Any],
function load_corpus_from_tokens_table (line 1831) | def load_corpus_from_tokens_table(tokens: pd.DataFrame,
function serialize_corpus (line 1883) | def serialize_corpus(docs: Corpus, deepcopy_attrs: bool = True) -> Dict[...
function deserialize_corpus (line 1896) | def deserialize_corpus(serialized_corpus_data: dict) -> Corpus:
function set_document_attr (line 1912) | def set_document_attr(docs: Corpus, /, attrname: str, data: Dict[str, An...
function remove_document_attr (line 1940) | def remove_document_attr(docs: Corpus, /, attrname: str, inplace: bool =...
function set_token_attr (line 1964) | def set_token_attr(docs: Corpus, /, attrname: str, data: Dict[str, Any],...
function remove_token_attr (line 2027) | def remove_token_attr(docs: Corpus, /, attrname: str, inplace: bool = Tr...
function corpus_retokenize (line 2055) | def corpus_retokenize(docs: Corpus, collapse: Optional[str] = ' ', inpla...
function transform_tokens (line 2098) | def transform_tokens(docs: Corpus, /, func: Callable, select: Optional[U...
function to_lowercase (line 2148) | def to_lowercase(docs: Corpus, /, select: Optional[Union[str, Collection...
function to_uppercase (line 2161) | def to_uppercase(docs: Corpus, /, select: Optional[Union[str, Collection...
function remove_chars (line 2174) | def remove_chars(docs: Corpus, /, chars: Iterable[str], select: Optional...
function remove_punctuation (line 2189) | def remove_punctuation(docs: Corpus, /, select: Optional[Union[str, Coll...
function normalize_unicode (line 2204) | def normalize_unicode(docs: Corpus, /, select: Optional[Union[str, Colle...
function simplify_unicode (line 2222) | def simplify_unicode(docs: Corpus, /, select: Optional[Union[str, Collec...
function numbers_to_magnitudes (line 2248) | def numbers_to_magnitudes(docs: Corpus, /, select: Optional[Union[str, C...
function lemmatize (line 2294) | def lemmatize(docs: Corpus, /, select: Optional[Union[str, Collection[st...
function join_collocations_by_patterns (line 2324) | def join_collocations_by_patterns(docs: Corpus, /, patterns: Sequence[str],
function join_collocations_by_statistic (line 2414) | def join_collocations_by_statistic(docs: Corpus, /, threshold: float,
function filter_tokens_by_mask (line 2522) | def filter_tokens_by_mask(docs: Corpus, /, mask: Dict[str, Union[List[bo...
function remove_tokens_by_mask (line 2570) | def remove_tokens_by_mask(docs: Corpus, /, mask: Dict[str, Union[List[bo...
function filter_tokens (line 2587) | def filter_tokens(docs: Corpus, /, search_tokens: Any, by_attr: Optional...
function remove_tokens (line 2634) | def remove_tokens(docs: Corpus, /, search_tokens: Any, by_attr: Optional...
function filter_for_pos (line 2664) | def filter_for_pos(docs: Corpus, /, search_pos: Union[str, Collection[st...
function filter_tokens_by_doc_frequency (line 2704) | def filter_tokens_by_doc_frequency(docs: Corpus, /, which: str, df_thres...
function remove_common_tokens (line 2752) | def remove_common_tokens(docs: Corpus, /, df_threshold: Union[int, float...
function remove_uncommon_tokens (line 2769) | def remove_uncommon_tokens(docs: Corpus, /, df_threshold: Union[int, flo...
function filter_documents_by_mask (line 2788) | def filter_documents_by_mask(docs: Corpus, /, mask: Dict[str, bool], inv...
function remove_documents_by_mask (line 2819) | def remove_documents_by_mask(docs: Corpus, /, mask: Dict[str, bool], inp...
function find_documents (line 2835) | def find_documents(docs: Corpus, /, search_tokens: Any, by_attr: Optiona...
function filter_documents (line 2886) | def filter_documents(docs: Corpus, /, search_tokens: Any, by_attr: Optio...
function remove_documents (line 2935) | def remove_documents(docs: Corpus, /, search_tokens: Any, by_attr: Optio...
function filter_documents_by_docattr (line 2967) | def filter_documents_by_docattr(docs: Corpus, /, search_tokens: Any, by_...
function remove_documents_by_docattr (line 3005) | def remove_documents_by_docattr(docs: Corpus, /, search_tokens: Any, by_...
function filter_documents_by_label (line 3033) | def filter_documents_by_label(docs: Corpus, /, search_tokens: Any, match...
function remove_documents_by_label (line 3063) | def remove_documents_by_label(docs: Corpus, /, search_tokens: Any, match...
function filter_documents_by_length (line 3091) | def filter_documents_by_length(docs: Corpus, /, relation: str, threshold...
function remove_documents_by_length (line 3116) | def remove_documents_by_length(docs: Corpus, /, relation: str, threshold...
function filter_clean_tokens (line 3132) | def filter_clean_tokens(docs: Corpus, /,
function filter_tokens_with_kwic (line 3273) | def filter_tokens_with_kwic(docs: Corpus, /, search_tokens: Any,
function corpus_ngramify (line 3330) | def corpus_ngramify(docs: Corpus, /, n: int, join_str: str = ' ', inplac...
function corpus_sample (line 3347) | def corpus_sample(docs: Corpus, /, n: int, inplace: bool = True) -> Opti...
function corpus_split_by_paragraph (line 3370) | def corpus_split_by_paragraph(docs: Corpus, /, paragraph_linebreaks: int...
function corpus_split_by_token (line 3400) | def corpus_split_by_token(docs: Corpus, /, split: str, new_doc_label_fmt...
function corpus_join_documents (line 3477) | def corpus_join_documents(docs: Corpus, /, join: Dict[str, Union[str, Li...
function builtin_corpora_info (line 3610) | def builtin_corpora_info(with_paths: bool = False) -> Union[List[str], D...
function _filter_documents (line 3637) | def _filter_documents(chunk, search_tokens, match_type, ignore_case, glo...
function _build_kwic_parallel (line 3664) | def _build_kwic_parallel(docs, search_tokens, context_size, by_attr, mat...
function _finalize_kwic_results (line 3745) | def _finalize_kwic_results(kwic_results, only_non_empty, glue, as_tables...
function _create_embed_tokens_for_collocations (line 3807) | def _create_embed_tokens_for_collocations(docs: Corpus, embed_tokens_min...
function _apply_collocations (line 3839) | def _apply_collocations(tokenmat: np.ndarray,
function _comparison_operator_from_str (line 3884) | def _comparison_operator_from_str(which: str, common_alias=False, equal=...
function _match_against (line 3910) | def _match_against(docs: Union[Corpus, Dict[str, Document]], by_attr: st...
function _check_filter_args (line 3918) | def _check_filter_args(**kwargs):
function _token_pattern_matches (line 3927) | def _token_pattern_matches(tokens: Dict[str, List[Any]], search_tokens: ...
function _load_text_from_files (line 3951) | def _load_text_from_files(files: Collection[str],
function _load_text_from_tabular_files (line 4006) | def _load_text_from_tabular_files(files: Union[str, Collection[str]],
function _spacydocs_for_vectors (line 4088) | def _spacydocs_for_vectors(docs, select, collapse):
function _single_str_to_set (line 4107) | def _single_str_to_set(select: Optional[Union[str, Collection[str]]], ch...
FILE: tmtoolkit/corpus/_document.py
class Document (line 23) | class Document:
method __init__ (line 36) | def __init__(self, bimaps: Optional[Dict[str, bidict]], label: str, ha...
method __len__ (line 85) | def __len__(self) -> int:
method __repr__ (line 93) | def __repr__(self) -> str:
method __str__ (line 102) | def __str__(self) -> str:
method __getitem__ (line 110) | def __getitem__(self, attr: str) -> list:
method __setitem__ (line 122) | def __setitem__(self, attr: str, values: Union[Sequence, np.ndarray]):
method __delitem__ (line 158) | def __delitem__(self, attr: str):
method __copy__ (line 172) | def __copy__(self) -> Document:
method label (line 181) | def label(self) -> str:
method has_sents (line 186) | def has_sents(self) -> bool:
method token_attrs (line 195) | def token_attrs(self) -> List[str]:
method _serialize (line 203) | def _serialize(self, store_bimaps_pointer: bool) -> Dict[str, Any]:
method _deserialize (line 220) | def _deserialize(cls, data: Dict[str, Any], **kwargs) -> Document:
function document_token_attr (line 246) | def document_token_attr(d: Document,
function document_from_attrs (line 394) | def document_from_attrs(bimaps: Dict[str, bidict],
function _chop_along_sentences (line 533) | def _chop_along_sentences(tok: Union[list, np.ndarray],
FILE: tmtoolkit/corpus/_nltk_extras.py
function stem (line 14) | def stem(docs: Corpus, /, language: Optional[str] = None,
FILE: tmtoolkit/corpus/visualize.py
function plot_doc_lengths_hist (line 27) | def plot_doc_lengths_hist(fig: plt.Figure, ax: plt.Axes, docs: Corpus,
function plot_vocab_counts_hist (line 66) | def plot_vocab_counts_hist(fig: plt.Figure, ax: plt.Axes, docs: Corpus,
function plot_doc_frequencies_hist (line 106) | def plot_doc_frequencies_hist(fig: plt.Figure, ax: plt.Axes, docs: Corpus,
function plot_num_sents_hist (line 151) | def plot_num_sents_hist(fig: plt.Figure, ax: plt.Axes, docs: Corpus,
function plot_sent_lengths_hist (line 189) | def plot_sent_lengths_hist(fig: plt.Figure, ax: plt.Axes, docs: Corpus,
function plot_token_lengths_hist (line 227) | def plot_token_lengths_hist(fig: plt.Figure, ax: plt.Axes, docs: Corpus,
function plot_num_sents_vs_sent_length (line 265) | def plot_num_sents_vs_sent_length(fig: plt.Figure, ax: plt.Axes, docs: C...
function plot_ranked_vocab_counts (line 339) | def plot_ranked_vocab_counts(fig: plt.Figure, ax: plt.Axes, docs: Corpus,
function _add_axis_scale_info (line 443) | def _add_axis_scale_info(axislbl: str, log: bool):
function _plot_hist (line 450) | def _plot_hist(fig: plt.Figure, ax: plt.Axes, x: np.ndarray,
FILE: tmtoolkit/tokenseq.py
function numbertoken_to_magnitude (line 35) | def numbertoken_to_magnitude(numbertoken: str, char: str = '0', firstcha...
function simplify_unicode_chars (line 90) | def simplify_unicode_chars(token: str, method: str = 'icu', ascii_encodi...
function strip_tags (line 126) | def strip_tags(value: str) -> str:
function unique_chars (line 151) | def unique_chars(tokens: Iterable[str]) -> Set[str]:
function token_lengths (line 164) | def token_lengths(tokens: Union[Iterable[str], np.ndarray]) -> List[int]:
function collapse_tokens (line 174) | def collapse_tokens(tokens: Union[Iterable[str], np.ndarray], collapse: ...
function pmi (line 186) | def pmi(x: np.ndarray, y: np.ndarray, xy: np.ndarray, n_total: Optional[...
function simple_collocation_counts (line 234) | def simple_collocation_counts(x: Optional[np.ndarray], y: Optional[np.nd...
function token_collocations (line 248) | def token_collocations(sentences: List[List[StrOrInt]], threshold: Optio...
function token_match (line 357) | def token_match(pattern: Any, tokens: Union[List[str], np.ndarray],
function token_match_multi_pattern (line 422) | def token_match_multi_pattern(search_tokens: Any, tokens: Union[List[str...
function token_match_subsequent (line 452) | def token_match_subsequent(patterns: Sequence, tokens: Union[list, np.nd...
function token_join_subsequent (line 524) | def token_join_subsequent(tokens: Union[List[str], np.ndarray], matches:...
function token_ngrams (line 622) | def token_ngrams(tokens: Sequence, n: int, join: bool = True, join_str: ...
function index_windows_around_matches (line 693) | def index_windows_around_matches(matches: np.ndarray, left: int, right: ...
class _MLStripper (line 757) | class _MLStripper(HTMLParser):
method __init__ (line 761) | def __init__(self):
method handle_data (line 766) | def handle_data(self, d):
method get_data (line 769) | def get_data(self):
function _strip_once (line 773) | def _strip_once(value):
FILE: tmtoolkit/topicmod/_eval_tools.py
function split_dtm_for_cross_validation (line 11) | def split_dtm_for_cross_validation(dtm, n_folds, shuffle_docs=True):
class FakedGensimDict (line 59) | class FakedGensimDict:
method __init__ (line 63) | def __init__(self, data):
method from_vocab (line 71) | def from_vocab(vocab):
method __iter__ (line 74) | def __iter__(self):
method keys (line 78) | def keys(self):
FILE: tmtoolkit/topicmod/evaluate.py
function metric_held_out_documents_wallach09 (line 25) | def metric_held_out_documents_wallach09(dtm_test, theta_test, phi_train,...
function metric_cao_juan_2009 (line 140) | def metric_cao_juan_2009(topic_word_distrib):
function metric_arun_2010 (line 158) | def metric_arun_2010(topic_word_distrib, doc_topic_distrib, doc_lengths):
function metric_griffiths_2004 (line 196) | def metric_griffiths_2004(logliks):
function metric_coherence_mimno_2011 (line 226) | def metric_coherence_mimno_2011(topic_word_distrib, dtm, top_n=20, eps=1...
function metric_coherence_gensim (line 302) | def metric_coherence_gensim(measure, topic_word_distrib=None, gensim_mod...
function results_by_parameter (line 427) | def results_by_parameter(res, param, sort_by=None, sort_desc=False):
FILE: tmtoolkit/topicmod/model_io.py
function ldamodel_top_topic_words (line 20) | def ldamodel_top_topic_words(topic_word_distrib, vocab, top_n=10, val_fm...
function ldamodel_top_word_topics (line 55) | def ldamodel_top_word_topics(topic_word_distrib, vocab, top_n=10, val_fm...
function ldamodel_top_doc_topics (line 91) | def ldamodel_top_doc_topics(doc_topic_distrib, doc_labels, top_n=3, val_...
function ldamodel_top_topic_docs (line 127) | def ldamodel_top_topic_docs(doc_topic_distrib, doc_labels, top_n=3, val_...
function ldamodel_full_topic_words (line 164) | def ldamodel_full_topic_words(topic_word_distrib, vocab, colname_rowinde...
function ldamodel_full_doc_topics (line 191) | def ldamodel_full_doc_topics(doc_topic_distrib, doc_labels, colname_rowi...
function print_ldamodel_distribution (line 219) | def print_ldamodel_distribution(distrib, row_labels, val_labels, top_n=10):
function print_ldamodel_topic_words (line 245) | def print_ldamodel_topic_words(topic_word_distrib, vocab, top_n=10, row_...
function print_ldamodel_doc_topics (line 262) | def print_ldamodel_doc_topics(doc_topic_distrib, doc_labels, top_n=3, va...
function save_ldamodel_summary_to_excel (line 280) | def save_ldamodel_summary_to_excel(excel_file, topic_word_distrib, doc_t...
function save_ldamodel_to_pickle (line 382) | def save_ldamodel_to_pickle(picklefile, model, vocab, doc_labels, dtm=No...
function load_ldamodel_from_pickle (line 398) | def load_ldamodel_from_pickle(picklefile, **kwargs):
FILE: tmtoolkit/topicmod/model_stats.py
function marginal_topic_distrib (line 24) | def marginal_topic_distrib(doc_topic_distrib, doc_lengths):
function marginal_word_distrib (line 39) | def marginal_word_distrib(topic_word_distrib, p_t):
function most_probable_words (line 52) | def most_probable_words(vocab, topic_word_distrib, doc_topic_distrib, do...
function least_probable_words (line 71) | def least_probable_words(vocab, topic_word_distrib, doc_topic_distrib, d...
function _words_by_marginal_word_prob (line 90) | def _words_by_marginal_word_prob(vocab, topic_word_distrib, doc_topic_di...
function _words_by_score (line 100) | def _words_by_score(words, score, least_to_most, n=None):
function word_saliency (line 126) | def word_saliency(topic_word_distrib, doc_topic_distrib, doc_lengths):
function _words_by_salience_score (line 142) | def _words_by_salience_score(vocab, topic_word_distrib, doc_topic_distri...
function most_salient_words (line 148) | def most_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc...
function least_salient_words (line 166) | def least_salient_words(vocab, topic_word_distrib, doc_topic_distrib, do...
function word_distinctiveness (line 187) | def word_distinctiveness(topic_word_distrib, p_t):
function _words_by_distinctiveness_score (line 205) | def _words_by_distinctiveness_score(vocab, topic_word_distrib, doc_topic...
function most_distinct_words (line 214) | def most_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, do...
function least_distinct_words (line 232) | def least_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, d...
function topic_word_relevance (line 254) | def topic_word_relevance(topic_word_distrib, doc_topic_distrib, doc_leng...
function _check_relevant_words_for_topic_args (line 279) | def _check_relevant_words_for_topic_args(vocab, rel_mat, topic):
function most_relevant_words_for_topic (line 290) | def most_relevant_words_for_topic(vocab, rel_mat, topic, n=None):
function least_relevant_words_for_topic (line 307) | def least_relevant_words_for_topic(vocab, rel_mat, topic, n=None):
function generate_topic_labels_from_top_words (line 327) | def generate_topic_labels_from_top_words(topic_word_distrib, doc_topic_d...
function top_n_from_distribution (line 384) | def top_n_from_distribution(distrib, top_n=10, row_labels=None, col_labe...
function top_words_for_topics (line 454) | def top_words_for_topics(topic_word_distrib, top_n=None, vocab=None, ret...
function _join_value_and_label_dfs (line 514) | def _join_value_and_label_dfs(vals, labels, top_n, val_fmt=None, row_lab...
function filter_topics (line 555) | def filter_topics(search_pattern, vocab, topic_word_distrib, top_n=None,...
function exclude_topics (line 647) | def exclude_topics(excl_topic_indices, doc_topic_distrib, topic_word_dis...
FILE: tmtoolkit/topicmod/parallel.py
class MultiprocModelsRunner (line 27) | class MultiprocModelsRunner:
method __init__ (line 32) | def __init__(self, worker_class, data, varying_parameters=None, consta...
method __del__ (line 78) | def __del__(self):
method shutdown_workers (line 82) | def shutdown_workers(self):
method run (line 99) | def run(self):
method _setup_workers (line 173) | def _setup_workers(self, worker_class):
method _new_worker (line 187) | def _new_worker(self, worker_class, i, task_queue, results_queue, data):
method _prepare_data (line 192) | def _prepare_data(data):
class MultiprocModelsWorkerABC (line 226) | class MultiprocModelsWorkerABC(mp.Process):
method __init__ (line 233) | def __init__(self, worker_id, tasks_queue, results_queue, data,
method run (line 270) | def run(self):
method fit_model (line 290) | def fit_model(self, data, params):
method send_results (line 300) | def send_results(self, doc, params, results):
class MultiprocEvaluationRunner (line 314) | class MultiprocEvaluationRunner(MultiprocModelsRunner):
method __init__ (line 319) | def __init__(self, worker_class, available_metrics, data, varying_para...
method _new_worker (line 369) | def _new_worker(self, worker_class, i, task_queue, results_queue, data):
class MultiprocEvaluationWorkerABC (line 375) | class MultiprocEvaluationWorkerABC(MultiprocModelsWorkerABC):
method __init__ (line 380) | def __init__(self, worker_id,
function _merge_params (line 414) | def _merge_params(varying_parameters, constant_parameters):
FILE: tmtoolkit/topicmod/tm_gensim.py
class MultiprocModelsWorkerGensim (line 44) | class MultiprocModelsWorkerGensim(MultiprocModelsWorkerABC):
method fit_model (line 51) | def fit_model(self, data, params, return_data=False):
class MultiprocEvaluationWorkerGensim (line 77) | class MultiprocEvaluationWorkerGensim(MultiprocEvaluationWorkerABC, Mult...
method fit_model (line 82) | def fit_model(self, data, params, return_data=False):
function compute_models_parallel (line 154) | def compute_models_parallel(data, varying_parameters=None, constant_para...
function evaluate_topic_models (line 182) | def evaluate_topic_models(data, varying_parameters, constant_parameters=...
function _get_model_perplexity (line 225) | def _get_model_perplexity(model, eval_corpus):
FILE: tmtoolkit/topicmod/tm_lda.py
class MultiprocModelsWorkerLDA (line 57) | class MultiprocModelsWorkerLDA(MultiprocModelsWorkerABC):
method fit_model (line 64) | def fit_model(self, data, params):
class MultiprocEvaluationWorkerLDA (line 72) | class MultiprocEvaluationWorkerLDA(MultiprocEvaluationWorkerABC, Multipr...
method fit_model (line 77) | def fit_model(self, data, params):
function compute_models_parallel (line 179) | def compute_models_parallel(data, varying_parameters=None, constant_para...
function evaluate_topic_models (line 207) | def evaluate_topic_models(data, varying_parameters, constant_parameters=...
FILE: tmtoolkit/topicmod/tm_sklearn.py
class MultiprocModelsWorkerSklearn (line 63) | class MultiprocModelsWorkerSklearn(MultiprocModelsWorkerABC):
method fit_model (line 70) | def fit_model(self, data, params, return_data=False):
class MultiprocEvaluationWorkerSklearn (line 88) | class MultiprocEvaluationWorkerSklearn(MultiprocEvaluationWorkerABC, Mul...
method fit_model (line 93) | def fit_model(self, data, params, return_data=False):
function compute_models_parallel (line 182) | def compute_models_parallel(data, varying_parameters=None, constant_para...
function evaluate_topic_models (line 211) | def evaluate_topic_models(data, varying_parameters, constant_parameters=...
function _get_normalized_topic_word_distrib (line 254) | def _get_normalized_topic_word_distrib(lda_instance):
FILE: tmtoolkit/topicmod/visualize.py
function _wordcloud_color_func_black (line 26) | def _wordcloud_color_func_black(word, font_size, position, orientation, ...
function write_wordclouds_to_folder (line 40) | def write_wordclouds_to_folder(wordclouds, folder, file_name_fmt='{label...
function generate_wordclouds_for_topic_words (line 61) | def generate_wordclouds_for_topic_words(topic_word_distrib, vocab, top_n...
function generate_wordclouds_for_document_topics (line 85) | def generate_wordclouds_for_document_topics(doc_topic_distrib, doc_label...
function generate_wordclouds_from_distribution (line 110) | def generate_wordclouds_from_distribution(distrib, row_labels, val_label...
function generate_wordcloud_from_probabilities_and_words (line 152) | def generate_wordcloud_from_probabilities_and_words(prob, words, return_...
function generate_wordcloud_from_weights (line 180) | def generate_wordcloud_from_weights(weights, return_image=True, wordclou...
function plot_topic_word_ranked_prob (line 215) | def plot_topic_word_ranked_prob(fig, ax, topic_word_distrib, n,
function plot_doc_topic_ranked_prob (line 244) | def plot_doc_topic_ranked_prob(fig, ax, doc_topic_distrib, n,
function plot_prob_distrib_ranked_prob (line 273) | def plot_prob_distrib_ranked_prob(fig, ax, data, x_limit, log_scale=True...
function plot_doc_topic_heatmap (line 372) | def plot_doc_topic_heatmap(fig, ax, doc_topic_distrib, doc_labels, topic...
function plot_topic_word_heatmap (line 453) | def plot_topic_word_heatmap(fig, ax, topic_word_distrib, vocab, topic_la...
function plot_heatmap (line 533) | def plot_heatmap(fig, ax, data,
function plot_eval_results (line 627) | def plot_eval_results(eval_results, metric=None, param=None,
function parameters_for_ldavis (line 844) | def parameters_for_ldavis(topic_word_distrib, doc_topic_distrib, dtm, vo...
FILE: tmtoolkit/utils.py
function enable_logging (line 30) | def enable_logging(level: int = logging.INFO, fmt: str = '%(asctime)s:%(...
function set_logging_level (line 67) | def set_logging_level(level: int) -> None:
function disable_logging (line 81) | def disable_logging() -> None:
function pickle_data (line 95) | def pickle_data(data: Any, picklefile: str, **kwargs) -> None:
function unpickle_file (line 111) | def unpickle_file(picklefile: str, **kwargs) -> Any:
function path_split (line 129) | def path_split(path: str, base: Optional[List[str]] = None) -> List[str]:
function read_text_file (line 157) | def read_text_file(fpath: str, encoding: str, read_size: int = -1, force...
function linebreaks_win2unix (line 176) | def linebreaks_win2unix(text: str) -> str:
function empty_chararray (line 192) | def empty_chararray() -> np.ndarray:
function as_chararray (line 201) | def as_chararray(x: Union[np.ndarray, Sequence]) -> np.ndarray:
function mat2d_window_from_indices (line 222) | def mat2d_window_from_indices(mat: np.ndarray,
function combine_sparse_matrices_columnwise (line 261) | def combine_sparse_matrices_columnwise(matrices: Sequence,
function dict2df (line 385) | def dict2df(data: dict, key_name: str = 'key', value_name: str = 'value'...
function applychain (line 417) | def applychain(funcs: Iterable[Callable], initial_arg: Any) -> Any:
function flatten_list (line 435) | def flatten_list(l: Iterable[Iterable]) -> list:
function _merge_updatable (line 452) | def _merge_updatable(containers: Sequence, init_fn: Callable, safe: bool...
function merge_dicts (line 463) | def merge_dicts(dicts: Sequence[dict], sort_keys: bool = False, safe: bo...
function merge_sets (line 480) | def merge_sets(sets: Sequence[set], safe: bool = False) -> set:
function sample_dict (line 491) | def sample_dict(d: dict, n: int) -> dict:
function greedy_partitioning (line 502) | def greedy_partitioning(elems_dict: Dict[str, Union[int, float]], k: int...
function argsort (line 548) | def argsort(seq: Sequence) -> List[int]:
function split_func_args (line 558) | def split_func_args(fn: Callable, args: Dict[str, Any]) -> Tuple[Dict[st...
Condensed preview — 99 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (8,521K chars).
[
{
"path": ".github/workflows/runtests.yml",
"chars": 1908,
"preview": "# GitHub actions workflow for testing tmtoolkit\n# Runs tests on Ubuntu, MacOS and Windows with Python versions 3.8, 3.9 "
},
{
"path": ".github/workflows/stale.yml",
"chars": 706,
"preview": "name: Close inactive issues\non:\n schedule:\n - cron: \"23 3 * * *\"\n\njobs:\n close-issues:\n runs-on: ubuntu-latest\n "
},
{
"path": ".gitignore",
"chars": 307,
"preview": ".cache/\n.idea/\n**/__pycache__\n*.pyc\n.hypothesis\nbuild/\ndist/\n*.egg-info/\n.~lock.*\nexamples/data/*.pickle\n!examples/data/"
},
{
"path": ".readthedocs.yaml",
"chars": 512,
"preview": "# .readthedocs.yml\n# Read the Docs configuration file\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html fo"
},
{
"path": "AUTHORS.md",
"chars": 345,
"preview": "# Authors\n\n## Maintainer / main developer\n\n[Markus Konrad](https://github.com/internaut) @ [WZB](https://github.com/WZBS"
},
{
"path": "LICENSE",
"chars": 10173,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "MANIFEST.in",
"chars": 192,
"preview": "include AUTHORS.md\ninclude conftest.py\ninclude LICENSE\ninclude README.rst\ninclude requirements.txt\ninclude requirements_"
},
{
"path": "Makefile",
"chars": 460,
"preview": "run_tests:\n\tPYTHONPATH=. pytest -l tests/\n\ncov_tests:\n\tPYTHONPATH=. pytest --cov-report html:.covreport --cov=tmtoolkit "
},
{
"path": "README.rst",
"chars": 9960,
"preview": "**This repository is archived. Further development of tmtoolkit has moved to https://github.com/internaut/tmtoolkit.**\n\n"
},
{
"path": "conftest.py",
"chars": 562,
"preview": "\"\"\"\nConfiguration for tests with pytest\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nfrom hypothesis impor"
},
{
"path": "doc/Makefile",
"chars": 791,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
},
{
"path": "doc/source/api.rst",
"chars": 3898,
"preview": ".. _api:\n\nAPI\n===\n\ntmtoolkit.bow\n-------------\n\ntmtoolkit.bow.bow_stats\n^^^^^^^^^^^^^^^^^^^^^^^\n\n.. automodule:: tmtoolk"
},
{
"path": "doc/source/bow.ipynb",
"chars": 68154,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Working with the Bag-of-Words rep"
},
{
"path": "doc/source/conf.py",
"chars": 2527,
"preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
},
{
"path": "doc/source/data/corpus_example/sample1.txt",
"chars": 128,
"preview": "This is the first example file. ☺ We showcase NER by just randomly listing famous people like Missy Elliott or George Ha"
},
{
"path": "doc/source/data/corpus_example/sample2.txt",
"chars": 142,
"preview": "Here comes the second example (with HTML <i>tags</i> & entities).\n\nThis one contains three lines of plain text which"
},
{
"path": "doc/source/data/corpus_example/sample3.txt",
"chars": 142,
"preview": "And here we go with the third and final example file.\nAnother line of text.\n\n§2.\nThis is the second paragraph.\n\nThe thir"
},
{
"path": "doc/source/data/tm_wordclouds/.gitignore",
"chars": 72,
"preview": "# Ignore everything in this directory\n*\n# Except this file\n!.gitignore\n\n"
},
{
"path": "doc/source/development.rst",
"chars": 17068,
"preview": ".. _development:\n\nDevelopment\n===========\n\nThis part of the documentation serves as developer documentation, i.e. a help"
},
{
"path": "doc/source/getting_started.ipynb",
"chars": 23489,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Getting started\\n\",\n \"\\n\",\n "
},
{
"path": "doc/source/index.rst",
"chars": 549,
"preview": ".. tmtoolkit documentation master file, created by\n sphinx-quickstart on Tue Aug 27 11:30:06 2019.\n You can adapt th"
},
{
"path": "doc/source/install.rst",
"chars": 4802,
"preview": ".. _install:\n\nInstallation\n============\n\nRequirements\n------------\n\n**tmtoolkit works with Python 3.8 or newer (tested u"
},
{
"path": "doc/source/intro.rst",
"chars": 8859,
"preview": "tmtoolkit: Text mining and topic modeling toolkit\n=================================================\n\n|pypi| |pypi_downlo"
},
{
"path": "doc/source/license_note.rst",
"chars": 202,
"preview": "License\n=======\n\nCode licensed under `Apache License 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.\nSee `LICENSE <"
},
{
"path": "doc/source/preprocessing.ipynb",
"chars": 400459,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Text preprocessing and basic text"
},
{
"path": "doc/source/text_corpora.ipynb",
"chars": 69652,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Working with text corpora\\n\",\n "
},
{
"path": "doc/source/topic_modeling.ipynb",
"chars": 387959,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Topic modeling\\n\",\n \"\\n\",\n "
},
{
"path": "doc/source/version_history.rst",
"chars": 8869,
"preview": ".. _changes:\n\nVersion history\n===============\n\n0.11.2 - 2022-03-11\n-------------------\n\n- updated `Arun et al. 2010 <htt"
},
{
"path": "examples/README.md",
"chars": 367,
"preview": "# Examples\n\nThis folder contains very few examples for *tmtoolkit*. The majority of examples is available as Jupyter Not"
},
{
"path": "examples/__init__.py",
"chars": 67,
"preview": "\"\"\"\ntmtoolkit – examples\n\nMarkus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n"
},
{
"path": "examples/_benchmarktools.py",
"chars": 509,
"preview": "from datetime import datetime\n\n\ntimings = []\ntiming_labels = []\n\n\ndef add_timing(label):\n timings.append(datetime.tod"
},
{
"path": "examples/benchmark_en_newsarticles.py",
"chars": 1660,
"preview": "\"\"\"\nBenchmarking script that loads and processes English language test corpus with Corpus in parallel.\n\nThis examples re"
},
{
"path": "examples/bundestag18_tfidf.py",
"chars": 8638,
"preview": "\"\"\"\nExample script that loads and processes the proceedings of the 18th German Bundestag and generates a tf-idf matrix.\n"
},
{
"path": "examples/gensim_evaluation.py",
"chars": 2683,
"preview": "\"\"\"\nAn example for topic modeling evaluation with gensim.\n\nPlease note that this is just an example for showing how to p"
},
{
"path": "examples/minimal_tfidf.py",
"chars": 977,
"preview": "\"\"\"\nA minimal example to showcase a few features of tmtoolkit.\n\nMarkus Konrad <markus.konrad@wzb.eu>\nFeb. 2022\n\"\"\"\n\nfrom"
},
{
"path": "examples/topicmod_ap_nips_eval.py",
"chars": 2382,
"preview": "\"\"\"\nTopic model evaluation for AP and NIPS datasets (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words).\n\nThis example"
},
{
"path": "examples/topicmod_lda.py",
"chars": 5671,
"preview": "\"\"\"\nAn example for topic modeling with LDA with focus on the new plotting functions in `tmtoolkit.corpus.visualize` and\n"
},
{
"path": "requirements.txt",
"chars": 220,
"preview": "# requirements.txt\n#\n# installs dependencies from ./setup.py, and the package itself,\n# in editable mode for development"
},
{
"path": "requirements_doc.txt",
"chars": 140,
"preview": "# requirements_doc.txt\n#\n# installs doc dependencies from ./setup.py, and the package itself,\n# in editable mode for dev"
},
{
"path": "scripts/fulldata/.gitignore",
"chars": 85,
"preview": "# Ignore everything in this directory\n*\n# Except these files\n!.gitignore\n!README.md\n\n"
},
{
"path": "scripts/fulldata/README.md",
"chars": 582,
"preview": "This folder stores the full datasets from which the sample datasets for the built-in corpora in tmtoolkit are generated:"
},
{
"path": "scripts/nips_data.py",
"chars": 1356,
"preview": "\"\"\"\nConvert NIPS data from http://archive.ics.uci.edu/ml/datasets/Bag+of+Words to sparse DTM format stored as pickle fil"
},
{
"path": "scripts/prepare_corpora.R",
"chars": 1039,
"preview": "set.seed(20200511)\nSAMPLE_N <- 1000\nOUTPUT_PATH <- '../tmtoolkit/data/'\nFILE_PREFIX <- 'parlspeech-v2-sample-'\n\nsample_r"
},
{
"path": "scripts/tmp/.gitignore",
"chars": 72,
"preview": "# Ignore everything in this directory\n*\n# Except this file\n!.gitignore\n\n"
},
{
"path": "setup.py",
"chars": 3148,
"preview": "\"\"\"\ntmtoolkit setuptools based setup module\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nimport os\nfrom co"
},
{
"path": "tests/__init__.py",
"chars": 74,
"preview": "\"\"\"\ntmtoolkit – automated tests\n\nMarkus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n"
},
{
"path": "tests/_testtextdata.py",
"chars": 48978,
"preview": "\"\"\"\nTest corpora for different languages.\n\"\"\"\n\nimport random\n\nrandom.seed(20200203)\n\ntextdata_sm = {\n 'en': {\n "
},
{
"path": "tests/_testtools.py",
"chars": 1761,
"preview": "import string\n\nfrom hypothesis import strategies as st\nfrom hypothesis.extra.numpy import arrays, array_shapes\n\n\ndef str"
},
{
"path": "tests/data/.gitignore",
"chars": 22,
"preview": "test_pickle_unpickle*\n"
},
{
"path": "tests/data/100NewsArticles.csv",
"chars": 374135,
"preview": "article_id,publish_date,article_source_link,title,subtitle,text,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"
},
{
"path": "tests/data/bt18_speeches_sample.csv",
"chars": 5587071,
"preview": ",speaker_fp,text\n10942,christian-flisek,\"Herr Präsident! Meine Damen und Herren! Kolleginnen und Kollegen! Nach dem ehem"
},
{
"path": "tests/data/gutenberg/kafka_verwandlung.txt",
"chars": 142694,
"preview": "\r\n\r\n\r\n\r\n\r\n\r\n DIE VERWANDLUNG\r\n\r\n VON\r\n\r\n "
},
{
"path": "tests/data/gutenberg/werther/goethe_werther1.txt",
"chars": 110996,
"preview": "Die Leiden des jungen Werther von Johann Wolfgang von Goethe\r\n\r\n\r\nHamburger Ausgabe, Band 6\r\n\r\n\r\n\r\n\r\nErstes Buch\r\n\r\n\r\nAm"
},
{
"path": "tests/data/gutenberg/werther/goethe_werther2.txt",
"chars": 134675,
"preview": "Die Leiden des jungen Werther\r\nvon Johann Wolfgang von Goethe.\r\n\r\n\r\n\r\nHamburger Ausgabe, Band 6\r\n\r\n\r\n\r\n\r\nZweites Buch\r\n\r"
},
{
"path": "tests/test_bow.py",
"chars": 18857,
"preview": "import numpy as np\nimport pandas as pd\nimport pytest\nfrom hypothesis import settings, given, strategies as st\nfrom scipy"
},
{
"path": "tests/test_corpus.py",
"chars": 153571,
"preview": "\"\"\"\nTests for tmtoolkit.corpus module.\n\nPlease see the special notes under \"tests setup\".\n\n.. codeauthor:: Markus Konrad"
},
{
"path": "tests/test_corpusimport.py",
"chars": 661,
"preview": "\"\"\"\nTests for importing optional tmtoolkit.corpus module.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nfro"
},
{
"path": "tests/test_tokenseq.py",
"chars": 20489,
"preview": "\"\"\"\nTests for tmtoolkit.tokenseq module.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nimport string\nfrom c"
},
{
"path": "tests/test_topicmod__eval_tools.py",
"chars": 1355,
"preview": "import pytest\nfrom scipy.sparse import coo_matrix, issparse\nfrom hypothesis import given, strategies as st\n\nfrom ._testt"
},
{
"path": "tests/test_topicmod_evaluate.py",
"chars": 22402,
"preview": "import random\n\nimport numpy as np\nimport pytest\nfrom hypothesis import given, strategies as st\n\ntry:\n import gensim\n "
},
{
"path": "tests/test_topicmod_model_io.py",
"chars": 8085,
"preview": "import os.path\nimport tempfile\nfrom collections import OrderedDict\n\nimport pytest\nfrom hypothesis import given, strategi"
},
{
"path": "tests/test_topicmod_model_stats.py",
"chars": 20458,
"preview": "import os.path\nimport random\nimport string\n\nimport numpy as np\nimport pytest\nfrom hypothesis import settings, given, str"
},
{
"path": "tests/test_topicmod_visualize.py",
"chars": 6625,
"preview": "import os\n\nimport pytest\nfrom hypothesis import given, strategies as st, settings\n\nimport numpy as np\nimport matplotlib."
},
{
"path": "tests/test_utils.py",
"chars": 15197,
"preview": "import logging\nimport math\nimport os.path\nimport string\nfrom datetime import date\n\nimport pytest\nimport hypothesis.strat"
},
{
"path": "tmtoolkit/__init__.py",
"chars": 563,
"preview": "\"\"\"\ntmtoolkit – Text Mining and Topic Modeling Toolkit for Python\n\nMarkus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nfrom import"
},
{
"path": "tmtoolkit/__main__.py",
"chars": 4494,
"preview": "\"\"\"\ntmtoolkit – Text Mining and Topic Modeling Toolkit for Python\n\nCLI module\n\nMarkus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n"
},
{
"path": "tmtoolkit/bow/__init__.py",
"chars": 227,
"preview": "\"\"\"\nBag-of-Words (BoW) sub-package with modules for generating document-term-matrices (DTMs) and some common statistics "
},
{
"path": "tmtoolkit/bow/bow_stats.py",
"chars": 17866,
"preview": "\"\"\"\nCommon statistics from bag-of-words (BoW) matrices.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nimpor"
},
{
"path": "tmtoolkit/bow/dtm.py",
"chars": 7147,
"preview": "\"\"\"\nFunctions for creating a document-term matrix (DTM) and some compatibility functions for Gensim.\n\n.. codeauthor:: Ma"
},
{
"path": "tmtoolkit/corpus/__init__.py",
"chars": 2881,
"preview": "\"\"\"\nModule for processing text as token sequences in labelled documents. A set of documents is represented as *corpus*\nu"
},
{
"path": "tmtoolkit/corpus/_common.py",
"chars": 4516,
"preview": "\"\"\"\nInternal module with common functions and constants for text processing in the :mod:`tmtoolkit.corpus` module.\n\n.. c"
},
{
"path": "tmtoolkit/corpus/_corpus.py",
"chars": 45863,
"preview": "\"\"\"\nInternal module that implements :class:`Corpus` class representing a set of texts as token sequences in labelled\ndoc"
},
{
"path": "tmtoolkit/corpus/_corpusfuncs.py",
"chars": 208640,
"preview": "\"\"\"\nInternal module that implements functions that operate on :class:`~tmtoolkit.corpus.Corpus` objects.\n\nThe source is "
},
{
"path": "tmtoolkit/corpus/_document.py",
"chars": 24025,
"preview": "\"\"\"\nInternal module that implements :class:`Document` class representing a text document as token sequence.\n\n.. codeauth"
},
{
"path": "tmtoolkit/corpus/_nltk_extras.py",
"chars": 1599,
"preview": "\"\"\"\nInternal module with some additional functions that are only available when the `NLTK <https://www.nltk.org/>`_ pack"
},
{
"path": "tmtoolkit/corpus/visualize.py",
"chars": 18562,
"preview": "\"\"\"\nFunctions to visualize corpus summary statistics.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nimport "
},
{
"path": "tmtoolkit/tokenseq.py",
"chars": 34716,
"preview": "\"\"\"\nModule for functions that work with text represented as *token sequences*, e.g. ``[\"A\", \"test\", \"document\", \".\"]``\na"
},
{
"path": "tmtoolkit/topicmod/__init__.py",
"chars": 890,
"preview": "\"\"\"\nTopic modeling sub-package with modules for model evaluation, model I/O, model statistics, parallel computation and\n"
},
{
"path": "tmtoolkit/topicmod/_common.py",
"chars": 243,
"preview": "\"\"\"\nCommon constants and functions for topic modeling sub-package.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>"
},
{
"path": "tmtoolkit/topicmod/_eval_tools.py",
"chars": 2394,
"preview": "\"\"\"\nCommon utility functions for LDA model evaluation.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nimport"
},
{
"path": "tmtoolkit/topicmod/evaluate.py",
"chars": 21447,
"preview": "\"\"\"\nMetrics for topic model evaluation.\n\nIn order to run model evaluations in parallel use one of the modules :mod:`~tmt"
},
{
"path": "tmtoolkit/topicmod/model_io.py",
"chars": 24387,
"preview": "\"\"\"\nFunctions for printing/exporting topic model results.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\nimpo"
},
{
"path": "tmtoolkit/topicmod/model_stats.py",
"chars": 32452,
"preview": "\"\"\"\nCommon statistics and tools for topic models.\n\n.. [SievertShirley2014] Sievert, C., & Shirley, K. (2014, June). LDAv"
},
{
"path": "tmtoolkit/topicmod/parallel.py",
"chars": 18512,
"preview": "\"\"\"\nBase classes for parallel model fitting and evaluation. See the specific functions and classes in\n:mod:`~tmtoolkit.t"
},
{
"path": "tmtoolkit/topicmod/tm_gensim.py",
"chars": 10509,
"preview": "\"\"\"\nParallel model computation and evaluation using the `Gensim package <https://radimrehurek.com/gensim/>`_.\n\nAvailable"
},
{
"path": "tmtoolkit/topicmod/tm_lda.py",
"chars": 12505,
"preview": "\"\"\"\nParallel model computation and evaluation using the `lda package <https://github.com/lda-project/lda>`_.\n\nAvailable "
},
{
"path": "tmtoolkit/topicmod/tm_sklearn.py",
"chars": 12189,
"preview": "\"\"\"\nParallel model computation and evaluation using the `scikit-learn package <https://scikit-learn.org/>`_.\n\nAvailable "
},
{
"path": "tmtoolkit/topicmod/visualize.py",
"chars": 39508,
"preview": "\"\"\"\nFunctions to visualize topic models and topic model evaluation results.\n\n.. codeauthor:: Markus Konrad <markus.konra"
},
{
"path": "tmtoolkit/types.py",
"chars": 276,
"preview": "\"\"\"\nModule with common types used in type annotations throughout this project.\n\n.. codeauthor:: Markus Konrad <markus.ko"
},
{
"path": "tmtoolkit/utils.py",
"chars": 20224,
"preview": "\"\"\"\nMisc. utility functions.\n\n.. codeauthor:: Markus Konrad <markus.konrad@wzb.eu>\n\"\"\"\n\nimport codecs\nimport logging\nimp"
},
{
"path": "tox.ini",
"chars": 1170,
"preview": "# tox (https://tox.readthedocs.io/) is a tool for running tests\n# in multiple virtualenvs. This configuration file will "
}
]
// ... and 8 more files (download for full content)
About this extraction
This page contains the full source code of the WZBSocialScienceCenter/tmtoolkit GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 99 files (20.4 MB), approximately 2.1M tokens, and a symbol index with 542 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.