Showing preview only (453K chars total). Download the full file or copy to clipboard to get everything.
Repository: artefactory/NLPretext
Branch: main
Commit: 0d2cc4fe9e5d
Files: 72
Total size: 410.9 KB
Directory structure:
gitextract_i1k0jy7m/
├── .dockerignore
├── .editorconfig
├── .github/
│ ├── .stale.yml
│ ├── CODEOWNERS
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.md
│ │ ├── config.yml
│ │ ├── feature_request.md
│ │ └── question.md
│ ├── PULL_REQUEST_TEMPLATE.md
│ ├── dependabot.yml
│ ├── release-drafter.yml
│ └── workflows/
│ ├── cd.yml
│ ├── ci.yml
│ ├── greetings.yml
│ └── release-drafter.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── SECURITY.md
├── datasets/
│ └── external/
│ ├── get_language_dataset.sh
│ └── get_stanfordtweets.sh
├── docker/
│ ├── Dockerfile
│ └── README.md
├── docs/
│ ├── Makefile
│ ├── make.bat
│ ├── scripts/
│ │ └── buildsite.sh
│ └── source/
│ ├── _templates/
│ │ ├── module.rst_t
│ │ ├── package.rst_t
│ │ └── versions.html
│ ├── conf.py
│ ├── index.rst
│ └── tutorials/
│ ├── basic_notebook.ipynb
│ └── index.rst
├── nlpretext/
│ ├── __init__.py
│ ├── _config/
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── constants.py
│ │ └── stopwords.py
│ ├── _utils/
│ │ ├── __init__.py
│ │ ├── daskloader.py
│ │ ├── file_loader.py
│ │ ├── pandasloader.py
│ │ ├── phone_number.py
│ │ └── stopwords.py
│ ├── augmentation/
│ │ ├── __init__.py
│ │ └── text_augmentation.py
│ ├── basic/
│ │ ├── __init__.py
│ │ └── preprocess.py
│ ├── cli/
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ └── preprocess.py
│ ├── preprocessor.py
│ ├── py.typed
│ ├── social/
│ │ ├── __init__.py
│ │ └── preprocess.py
│ ├── textloader.py
│ └── token/
│ ├── __init__.py
│ ├── preprocess.py
│ └── tokenizer.py
├── pyproject.toml
├── references/
│ └── .gitkeep
└── tests/
├── __init__.py
├── test_data_augmentation.py
├── test_file_loader.py
├── test_phone_number.py
├── test_preprocessor.py
├── test_textloader.py
└── test_tokenizer.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .dockerignore
================================================
# Git
.git
.gitignore
.github
# Docker
.dockerignore
docker/
# IDE
.idea
.vscode
# Byte-compiled / optimized / DLL files
__pycache__/
**/__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.py[cod]
*$py.class
.pytest_cache/
..mypy_cache/
# poetry
.venv
# C extensions
*.so
# Virtual environment
.venv
venv
.DS_Store
.AppleDouble
.LSOverride
._*
================================================
FILE: .editorconfig
================================================
# Check http://editorconfig.org for more information
# This is the main config file for this project:
root = true
[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
indent_style = space
indent_size = 2
trim_trailing_whitespace = true
[*.{py, pyi}]
indent_style = space
indent_size = 4
[Makefile]
indent_style = tab
[*.md]
trim_trailing_whitespace = false
[*.{diff,patch}]
trim_trailing_whitespace = false
================================================
FILE: .github/.stale.yml
================================================
# Number of days of inactivity before an issue becomes stale
daysUntilStale: 60
# Number of days of inactivity before a stale issue is closed
daysUntilClose: 7
# Issues with these labels will never be considered stale
exemptLabels:
- pinned
- security
# Label to use when marking an issue as stale
staleLabel: wontfix
# Comment to post when marking an issue as stale. Set to `false` to disable
markComment: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
# Comment to post when closing a stale issue. Set to `false` to disable
closeComment: false
================================================
FILE: .github/CODEOWNERS
================================================
# https://help.github.com/en/articles/about-code-owners
* @julesbertrand @amaleelhamri @hugovasselin @Guillaume6606
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: 🐛 Bug report
about: If something isn't working 🔧
title: ''
labels: bug
assignees:
---
## 🐛 Bug Report
<!-- A clear and concise description of what the bug is. -->
## 🔬 How To Reproduce
Steps to reproduce the behavior:
1. ...
### Code sample
<!-- If applicable, attach a minimal code sample to reproduce the decried issue. -->
### Environment
* OS: [e.g. Linux / Windows / macOS]
* Python version, get it with:
```bash
python --version
```
### Screenshots
<!-- If applicable, add screenshots to help explain your problem. -->
## 📈 Expected behavior
<!-- A clear and concise description of what you expected to happen. -->
## 📎 Additional context
<!-- Add any other context about the problem here. -->
================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
# Configuration: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository
blank_issues_enabled: false
================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: 🚀 Feature request
about: Suggest an idea for this project 🏖
title: ''
labels: enhancement
assignees:
---
## 🚀 Feature Request
<!-- A clear and concise description of the feature proposal. -->
## 🔈 Motivation
<!-- Please describe the motivation for this proposal. -->
## 🛰 Alternatives
<!-- A clear and concise description of any alternative solutions or features you've considered. -->
## 📎 Additional context
<!-- Add any other context or screenshots about the feature request here. -->
================================================
FILE: .github/ISSUE_TEMPLATE/question.md
================================================
---
name: ❓ Question
about: Ask a question about this project 🎓
title: ''
labels: question
assignees:
---
## Checklist
<!-- Mark with an `x` all the checkboxes that apply (like `[x]`) -->
- [ ] I've searched the project's [`issues`](https://github.com/artefactory/NLPretext}/issues?q=is%3Aissue).
## ❓ Question
<!-- What is your question -->
How can I [...]?
Is it possible to [...]?
## 📎 Additional context
<!-- Add any other context or screenshots about the feature request here. -->
================================================
FILE: .github/PULL_REQUEST_TEMPLATE.md
================================================
## Description
<!-- Add a more detailed description of the changes if needed. -->
## Related Issue
<!-- If your PR refers to a related issue, link it here. -->
## Type of Change
<!-- Mark with an `x` all the checkboxes that apply (like `[x]`) -->
- [ ] 📚 Examples / docs / tutorials / dependencies update
- [ ] 🔧 Bug fix (non-breaking change which fixes an issue)
- [ ] 🥂 Improvement (non-breaking change which improves an existing feature)
- [ ] 🚀 New feature (non-breaking change which adds functionality)
- [ ] 💥 Breaking change (fix or feature that would cause existing functionality to change)
- [ ] 🔐 Security fix
## Checklist
<!-- Mark with an `x` all the checkboxes that apply (like `[x]`) -->
- [ ] I've read the [`CODE_OF_CONDUCT.md`](https://github.com/artefactory/NLPretext}/blob/main/CODE_OF_CONDUCT.md) document.
- [ ] I've read the [`CONTRIBUTING.md`](https://github.com/artefactory/NLPretext}/blob/main/CONTRIBUTING.md) guide.
- [ ] I've updated the code style using `make format-code`.
- [ ] I've written tests for all new methods and classes that I created.
- [ ] I've written the docstring in Google format for all the methods and classes that I used.
================================================
FILE: .github/dependabot.yml
================================================
# Configuration: https://dependabot.com/docs/config-file/
# Docs: https://docs.github.com/en/github/administering-a-repository/keeping-your-dependencies-updated-automatically
version: 2
updates:
- package-ecosystem: "pip"
directory: "/"
schedule:
interval: "weekly"
day: "monday"
time: "09:00"
allow:
- dependency-type: "all"
ignore:
- dependency-name: "*"
update-types: ["version-update:semver-patch"]
labels:
- draft
- dependencies
- python
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
day: "monday"
time: "09:00"
allow:
- dependency-type: "all"
labels:
- draft
- dependencies
- github_actions
- package-ecosystem: "docker"
directory: "/docker/"
schedule:
interval: "weekly"
day: "monday"
time: "09:00"
allow:
- dependency-type: "all"
labels:
- draft
- dependencies
- docker
================================================
FILE: .github/release-drafter.yml
================================================
# Release drafter configuration https://github.com/release-drafter/release-drafter#configuration
# Emojis were chosen to match the https://gitmoji.carloscuesta.me/
name-template: "$NEXT_PATCH_VERSION"
tag-template: "$NEXT_PATCH_VERSION"
categories:
- title: ":rocket: Features"
labels: [enhancement, feature]
- title: ":wrench: Fixes & Refactoring"
labels: [bug, refactoring, bugfix, fix]
- title: ":package: Build System & CI/CD"
labels: [build, ci, testing]
- title: ":boom: Breaking Changes"
labels: [breaking]
- title: ":pencil: Documentation"
labels: [documentation]
- title: ":arrow_up: Dependencies updates"
labels: [dependencies]
template: |
## What’s Changed
$CHANGES
## :busts_in_silhouette: List of contributors
$CONTRIBUTORS
================================================
FILE: .github/workflows/cd.yml
================================================
name: Continuous Deployment
on:
release:
types: [published]
jobs:
docker:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Github Container Registry
uses: docker/login-action@v3
with:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
registry: ghcr.io
- name: Set tag name
id: tag
run: echo "tag_name=${GITHUB_REF//\//-}" >> $GITHUB_OUTPUT
env:
GITHUB_REF: ${{ github.ref }}
- name: Build and push
uses: docker/build-push-action@v4
with:
context: .
file: ./docker/Dockerfile
push: true
tags: |
ghcr.io/artefactory/nlpretext:${{ steps.tag.outputs.tag_name }}
ghcr.io/artefactory/nlpretext:latest
cache-from: type=registry,ref=ghcr.io/artefactory/nlpretext:latest
cache-to: type=inline
- name: Scan image
uses: anchore/scan-action@v3
id: scan
with:
image: "ghcr.io/artefactory/nlpretext:${{ steps.tag.outputs.tag_name }}"
output-format: table
- name: upload Anchore scan SARIF report
if: success() || failure()
uses: github/codeql-action/upload-sarif@v1
with:
sarif_file: ${{ steps.scan.outputs.sarif }}
documentation_and_package:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8"]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install poetry and pandoc
run: |
sudo apt-get install pandoc
make download-poetry
- name: Set up cache
uses: actions/cache@v3.3.2
with:
path: ~/.cache/pypoetry/virtualenvs
key: venv-${{ matrix.python-version }}-${{ hashFiles('pyproject.toml') }}-${{ hashFiles('poetry.lock') }}
- name: Set Poetry Path
run: |
echo "$HOME/.poetry/bin" >> $GITHUB_PATH
- name: Install dependencies
run: |
poetry install -E torch -E dask
- name: Publish to PyPI
env:
PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
run: |
poetry config pypi-token.pypi $PYPI_TOKEN
poetry publish --build
- name: Run build script for Sphinx pages
run: |
poetry run git config --global user.name "Github-Pages Bot"
poetry run git config --global user.email "github-pages@artefactory.com"
poetry run sh docs/scripts/buildsite.sh
shell: bash
================================================
FILE: .github/workflows/ci.yml
================================================
# GNU Lesser General Public License v3.0 only
# Copyright (C) 2020 Artefact
# licence-information@artefact.com
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
name: Continuous Integration
on:
push:
branches:
- main
pull_request:
branches:
- '*'
jobs:
ci:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10"]
if: ${{ !contains(github.event.pull_request.labels.*.name, 'draft') }}
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install poetry
run: make download-poetry
- name: Set up pip cache
uses: actions/cache@v3.3.2
with:
path: ~/.cache/pypoetry/virtualenvs
key: venv-${{ matrix.python-version }}-${{ hashFiles('pyproject.toml') }}-${{ hashFiles('poetry.lock') }}
- name: Set up mypy cache
uses: actions/cache@v3.2.4
with:
path: ${{ github.workspace }}/.mypy_cache
key: mypy-${{ matrix.python-version }}
- name: Set Poetry Path
run: |
echo "$HOME/.poetry/bin" >> $GITHUB_PATH
- name: Install dependencies
run: |
poetry run pip install --upgrade pip
poetry install -E torch -E dask
- name: Run safety checks
run: |
STRICT=1 make check-safety
- name: Lint and format
run: |
make format-code
- name: Run tests
run: |
make test
================================================
FILE: .github/workflows/greetings.yml
================================================
name: Greetings
on:
pull_request:
types:
- opened
- reopened
- edited
- labeled
- unlabeled
- synchronize
issues:
jobs:
greeting:
runs-on: ubuntu-latest
if: ${{ !contains(github.head_ref, 'dependabot/') }}
steps:
- uses: actions/first-interaction@v1
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
pr-message: 'Hello @${{ github.actor }}, thank you for submitting a PR! We will respond as soon as possible.'
issue-message: |
Hello @${{ github.actor }}, thank you for your interest in our work!
If this is a bug report, please provide screenshots and **minimum viable code to reproduce your issue**, otherwise we can not help you.
================================================
FILE: .github/workflows/release-drafter.yml
================================================
name: Release Drafter
on:
push:
# branches to consider in the event; optional, defaults to all
branches:
- main
jobs:
update_release_draft:
runs-on: ubuntu-latest
steps:
# Drafts your next Release notes as Pull Requests are merged into "main"
- uses: release-drafter/release-drafter@v5.22.0
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
================================================
FILE: .gitignore
================================================
# Created by https://www.gitignore.io/api/osx,python,pycharm,windows,visualstudio,visualstudiocode
# Edit at https://www.gitignore.io/?templates=osx,python,pycharm,windows,visualstudio,visualstudiocode
### OSX ###
# General
.DS_Store
.AppleDouble
.LSOverride
# Icon must end with two \r
Icon
# Thumbnails
._*
# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent
# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk
### PyCharm ###
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User-specific stuff
.idea/
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr
# CMake
cmake-build-*/
# Mongo Explorer plugin
.idea/**/mongoSettings.xml
# File-based project format
*.iws
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
# Editor-based Rest Client
.idea/httpRequests
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser
### PyCharm Patch ###
# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721
# *.iml
# modules.xml
# .idea/misc.xml
# *.ipr
# Sonarlint plugin
.idea/**/sonarlint/
# SonarQube Plugin
.idea/**/sonarIssues.xml
# Markdown Navigator plugin
.idea/**/markdown-navigator.xml
.idea/**/markdown-navigator/
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
.ruff_cache/
# Translations
*.mo
*.pot
# Scrapy stuff:
.scrapy
# Django stuff:
*.log
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# pyenv
.python-version
# poetry
.venv
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# Plugins
.secrets.baseline
### VisualStudioCode ###
.vscode/*
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
### VisualStudioCode Patch ###
# Ignore all local history of files
.history
### Windows ###
# Windows thumbnail cache files
Thumbs.db
Thumbs.db:encryptable
ehthumbs.db
ehthumbs_vista.db
# Dump file
*.stackdump
# Folder config file
[Dd]esktop.ini
# Recycle Bin used on file shares
$RECYCLE.BIN/
# Windows Installer files
*.cab
*.msi
*.msix
*.msm
*.msp
# Windows shortcuts
*.lnk
### VisualStudio ###
## Ignore Visual Studio temporary files, build results, and
## files generated by popular Visual Studio add-ons.
##
## Get latest from https://github.com/github/gitignore/blob/master/VisualStudio.gitignore
# User-specific files
*.rsuser
*.suo
*.user
*.userosscache
*.sln.docstates
# User-specific files (MonoDevelop/Xamarin Studio)
*.userprefs
# Mono auto generated files
mono_crash.*
# Build results
[Dd]ebug/
[Dd]ebugPublic/
[Rr]elease/
[Rr]eleases/
x64/
x86/
[Aa][Rr][Mm]/
[Aa][Rr][Mm]64/
bld/
[Bb]in/
[Oo]bj/
[Ll]og/
# Visual Studio 2015/2017 cache/options directory
.vs/
# Uncomment if you have tasks that create the project's static files in wwwroot
#wwwroot/
# Visual Studio 2017 auto generated files
Generated\ Files/
# MSTest test Results
[Tt]est[Rr]esult*/
[Bb]uild[Ll]og.*
# NUnit
*.VisualState.xml
TestResult.xml
nunit-*.xml
# Build Results of an ATL Project
[Dd]ebugPS/
[Rr]eleasePS/
dlldata.c
# Benchmark Results
BenchmarkDotNet.Artifacts/
# .NET Core
project.lock.json
project.fragment.lock.json
artifacts/
# StyleCop
StyleCopReport.xml
# Files built by Visual Studio
*_i.c
*_p.c
*_h.h
*.ilk
*.obj
*.iobj
*.pch
*.pdb
*.ipdb
*.pgc
*.pgd
*.rsp
*.sbr
*.tlb
*.tli
*.tlh
*.tmp
*.tmp_proj
*_wpftmp.csproj
*.vspscc
*.vssscc
.builds
*.pidb
*.svclog
*.scc
# Chutzpah Test files
_Chutzpah*
# Visual C++ cache files
ipch/
*.aps
*.ncb
*.opendb
*.opensdf
*.sdf
*.cachefile
*.VC.db
*.VC.VC.opendb
# Visual Studio profiler
*.psess
*.vsp
*.vspx
*.sap
# Visual Studio Trace Files
*.e2e
# TFS 2012 Local Workspace
$tf/
# Guidance Automation Toolkit
*.gpState
# ReSharper is a .NET coding add-in
_ReSharper*/
*.[Rr]e[Ss]harper
*.DotSettings.user
# JustCode is a .NET coding add-in
.JustCode
# TeamCity is a build add-in
_TeamCity*
# DotCover is a Code Coverage Tool
*.dotCover
# AxoCover is a Code Coverage Tool
.axoCover/*
!.axoCover/settings.json
# Visual Studio code coverage results
*.coverage
*.coveragexml
# NCrunch
_NCrunch_*
.*crunch*.local.xml
nCrunchTemp_*
# MightyMoose
*.mm.*
AutoTest.Net/
# Web workbench (sass)
.sass-cache/
# Installshield output folder
[Ee]xpress/
# DocProject is a documentation generator add-in
DocProject/buildhelp/
DocProject/Help/*.HxT
DocProject/Help/*.HxC
DocProject/Help/*.hhc
DocProject/Help/*.hhk
DocProject/Help/*.hhp
DocProject/Help/Html2
DocProject/Help/html
# Click-Once directory
publish/
# Publish Web Output
*.[Pp]ublish.xml
*.azurePubxml
# Note: Comment the next line if you want to checkin your web deploy settings,
# but database connection strings (with potential passwords) will be unencrypted
*.pubxml
*.publishproj
# Microsoft Azure Web App publish settings. Comment the next line if you want to
# checkin your Azure Web App publish settings, but sensitive information contained
# in these scripts will be unencrypted
PublishScripts/
# NuGet Packages
*.nupkg
# NuGet Symbol Packages
*.snupkg
# The packages folder can be ignored because of Package Restore
**/[Pp]ackages/*
# except build/, which is used as an MSBuild target.
!**/[Pp]ackages/build/
# Uncomment if necessary however generally it will be regenerated when needed
#!**/[Pp]ackages/repositories.config
# NuGet v3's project.json files produces more ignorable files
*.nuget.props
*.nuget.targets
# Microsoft Azure Build Output
csx/
*.build.csdef
# Microsoft Azure Emulator
ecf/
rcf/
# Windows Store app package directories and files
AppPackages/
BundleArtifacts/
Package.StoreAssociation.xml
_pkginfo.txt
*.appx
*.appxbundle
*.appxupload
# Visual Studio cache files
# files ending in .cache can be ignored
*.[Cc]ache
# but keep track of directories ending in .cache
!?*.[Cc]ache/
# Others
ClientBin/
~$*
*~
*.dbmdl
*.dbproj.schemaview
*.jfm
*.pfx
*.publishsettings
orleans.codegen.cs
# Including strong name files can present a security risk
# (https://github.com/github/gitignore/pull/2483#issue-259490424)
#*.snk
# Since there are multiple workflows, uncomment next line to ignore bower_components
# (https://github.com/github/gitignore/pull/1529#issuecomment-104372622)
#bower_components/
# RIA/Silverlight projects
Generated_Code/
# Backup & report files from converting an old project file
# to a newer Visual Studio version. Backup files are not needed,
# because we have git ;-)
_UpgradeReport_Files/
Backup*/
UpgradeLog*.XML
UpgradeLog*.htm
ServiceFabricBackup/
*.rptproj.bak
# SQL Server files
*.mdf
*.ldf
*.ndf
# Business Intelligence projects
*.rdl.data
*.bim.layout
*.bim_*.settings
*.rptproj.rsuser
*- [Bb]ackup.rdl
*- [Bb]ackup ([0-9]).rdl
*- [Bb]ackup ([0-9][0-9]).rdl
# Microsoft Fakes
FakesAssemblies/
# GhostDoc plugin setting file
*.GhostDoc.xml
# Node.js Tools for Visual Studio
.ntvs_analysis.dat
node_modules/
# Visual Studio 6 build log
*.plg
# Visual Studio 6 workspace options file
*.opt
# Visual Studio 6 auto-generated workspace file (contains which files were open etc.)
*.vbw
# Visual Studio LightSwitch build output
**/*.HTMLClient/GeneratedArtifacts
**/*.DesktopClient/GeneratedArtifacts
**/*.DesktopClient/ModelManifest.xml
**/*.Server/GeneratedArtifacts
**/*.Server/ModelManifest.xml
_Pvt_Extensions
# Paket dependency manager
.paket/paket.exe
paket-files/
# FAKE - F# Make
.fake/
# CodeRush personal settings
.cr/personal
# Python Tools for Visual Studio (PTVS)
*.pyc
# Cake - Uncomment if you are using it
# tools/**
# !tools/packages.config
# Tabs Studio
*.tss
# Telerik's JustMock configuration file
*.jmconfig
# BizTalk build output
*.btp.cs
*.btm.cs
*.odx.cs
*.xsd.cs
# OpenCover UI analysis results
OpenCover/
# Azure Stream Analytics local run output
ASALocalRun/
# MSBuild Binary and Structured Log
*.binlog
# NVidia Nsight GPU debugger configuration file
*.nvuser
# MFractors (Xamarin productivity tool) working folder
.mfractor/
# Local History for Visual Studio
.localhistory/
# BeatPulse healthcheck temp database
healthchecksdb
# Backup folder for Package Reference Convert tool in Visual Studio 2017
MigrationBackup/
# DotEnv configuration
.env
# Database
*.db
*.rdb
# Pycharm
.idea
venv/
# VS Code
.vscode/
# Spyder
.spyproject/
# Jupyter NB Checkpoints
.ipynb_checkpoints/
# exclude data from source control by default
# vim
*.swp
*.swo
data/
================================================
FILE: .pre-commit-config.yaml
================================================
default_language_version:
python: python3.10
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-toml
- id: check-json
- id: check-added-large-files
- repo: local
hooks:
- id: isort
name: isort
entry: poetry run isort --settings-path pyproject.toml
types: [python]
language: system
stages: [commit, push]
- id: pyupgrade
name: pyupgrade
entry: poetry run pyupgrade --py38-plus
types: [python]
language: system
stages: [commit, push]
- id: black
name: black
entry: poetry run black --config pyproject.toml
types: [python]
language: system
stages: [commit, push]
- id: ruff
name: ruf
entry: poetry run ruff check --config pyproject.toml
types: [python]
language: system
stages: [commit, push]
- id: mypy
name: mypy
entry: poetry run mypy
require_serial: true
types: [python]
language: system
stages: [push]
- id: gitleaks
name: gitleaks
entry: make gitleaks
require_serial: true
types: [file]
language: system
pass_filenames: false
stages: [push]
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at rafaelle.aygalenq@artefact.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
================================================
FILE: CONTRIBUTING.md
================================================
NLPretext
==============================
# How to contribute
## Dependencies
We use `poetry` to manage the [dependencies](https://github.com/python-poetry/poetry).
If you dont have `poetry` installed, you should run the command below.
```bash
make download-poetry; export PATH="$HOME/.local/bin:$PATH"
```
To install dependencies and prepare [`pre-commit`](https://pre-commit.com/) hooks you would need to run `install` command:
```bash
make install
```
To activate your `virtualenv` run `poetry shell`.
## Codestyle
After you run `make install` you can execute the automatic code formatting.
```bash
make format-code
```
### Checks
Many checks are configured for this project. Command `make check-style` will run black diffs, darglint docstring style and mypy.
The `make check-safety` command will look at the security of your code.
You can also use `STRICT=1` flag to make the check be strict.
### Before submitting
Before submitting your code please do the following steps:
1. Add any changes you want
1. Add tests for the new changes
1. Edit documentation if you have changed something significant
1. Run `make format-code` to format your changes.
1. Run `STRICT=1 make check-style` to ensure that types and docs are correct
1. Run `STRICT=1 make check-safety` to ensure that security of your code is correct
## Other help
You can contribute by spreading a word about this library.
It would also be a huge contribution to write
a short article on how you are using this project.
You can also share your best practices with us.
# Docstring format
We chose to use **Numpydoc** over the several [standards](https://stackoverflow.com/questions/3898572/what-is-the-standard-python-docstring-format)
```
"""
My numpydoc description of a kind
of very exhautive numpydoc format docstring.
Parameters
----------
first : array_like
the 1st param name `first`
second :
the 2nd param
third : {'value', 'other'}, optional
the 3rd param, by default 'value'
Returns
-------
string
a value in a string
Raises
------
KeyError
when a key error
OtherError
when an other error
"""
```
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
© 2021 GitHub, Inc.
Terms
Privacy
Security
Status
Docs
Contact GitHub
Pricing
API
Training
Blog
About
================================================
FILE: Makefile
================================================
SHELL := /usr/bin/env bash
IMAGE := nlpretext
VERSION := latest
NO_CHECK_FLAG = || true
ifeq ($(STRICT), 1)
POETRY_COMMAND_FLAG =
PIP_COMMAND_FLAG =
SAFETY_COMMAND_FLAG =
BANDIT_COMMAND_FLAG =
SECRETS_COMMAND_FLAG =
BLACK_COMMAND_FLAG =
DARGLINT_COMMAND_FLAG =
ISORT_COMMAND_FLAG =
MYPY_COMMAND_FLAG =
else
POETRY_COMMAND_FLAG = $(NO_CHECK_FLAG)
PIP_COMMAND_FLAG = $(NO_CHECK_FLAG)
SAFETY_COMMAND_FLAG = $(NO_CHECK_FLAG)
BANDIT_COMMAND_FLAG = $(NO_CHECK_FLAG)
SECRETS_COMMAND_FLAG = $(NO_CHECK_FLAG)
BLACK_COMMAND_FLAG = $(NO_CHECK_FLAG)
DARGLINT_COMMAND_FLAG = $(NO_CHECK_FLAG)
ISORT_COMMAND_FLAG = $(NO_CHECK_FLAG)
MYPY_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(POETRY_STRICT), 1)
POETRY_COMMAND_FLAG =
else ifeq ($(POETRY_STRICT), 0)
POETRY_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(PIP_STRICT), 1)
PIP_COMMAND_FLAG =
else ifeq ($(PIP_STRICT), 0)
PIP_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(SAFETY_STRICT), 1)
SAFETY_COMMAND_FLAG =
else ifeq ($(SAFETY_STRICT), 0)
SAFETY_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(BANDIT_STRICT), 1)
BANDIT_COMMAND_FLAG =
else ifeq ($(BANDIT_STRICT), 0)
BANDIT_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(SECRETS_STRICT), 1)
SECRETS_COMMAND_FLAG =
else ifeq ($(SECRETS_STRICT), 0)
SECRETS_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(BLACK_STRICT), 1)
BLACK_COMMAND_FLAG =
else ifeq ($(BLACK_STRICT), 0)
BLACK_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(DARGLINT_STRICT), 1)
DARGLINT_COMMAND_FLAG =
else ifeq ($(DARGLINT_STRICT), 0)
DARGLINT_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(ISORT_STRICT), 1)
ISORT_COMMAND_FLAG =
else ifeq ($(ISORT_STRICT), 0)
ISORT_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
ifeq ($(MYPY_STRICT), 1)
MYPY_COMMAND_FLAG =
else ifeq ($(MYPY_STRICT), 0)
MYPY_COMMAND_FLAG = $(NO_CHECK_FLAG)
endif
.PHONY: download-poetry
download-poetry:
curl -sSL https://install.python-poetry.org | python3 -
.PHONY: install
install:
poetry env use python3.10
poetry lock -n
poetry install -n
ifneq ($(NO_PRE_COMMIT), 1)
poetry run pre-commit install -t pre-commit -t pre-push
endif
.PHONY: check-safety
check-safety:
poetry check$(POETRY_COMMAND_FLAG) && \
poetry run pip check$(PIP_COMMAND_FLAG) && \
poetry run safety check --full-report$(SAFETY_COMMAND_FLAG) && \
poetry run bandit -r nlpretext/$(BANDIT_COMMAND_FLAG)
.PHONY: gitleaks
gitleaks:
commits="$$(git rev-list --ancestry-path $$(git rev-parse $$(git branch -r --sort=committerdate | tail -1))..$$(git rev-parse HEAD))"; \
if [ "$${commits}" != "" ]; then docker run --rm -v $$(pwd):/code/ zricethezav/gitleaks --path=/code/ -v --commits=$$(echo $${commits} | paste -s -d, -)$(SECRETS_COMMAND_FLAG); fi;
.PHONY: format-code
format-code:
poetry run pre-commit run --all
.PHONY: test
test:
poetry run pytest
.PHONY: lint
lint: check-safety format-code test
# Example: make docker VERSION=latest
# Example: make docker IMAGE=some_name VERSION=1.0.4
.PHONY: docker
docker:
@echo Building docker $(IMAGE):$(VERSION) ...
docker build \
-t $(IMAGE):$(VERSION) . \
-f ./docker/Dockerfile
# Example: make clean_docker VERSION=latest
# Example: make clean_docker IMAGE=some_name VERSION=1.0.4
.PHONY: clean_docker
clean_docker:
@echo Removing docker $(IMAGE):$(VERSION) ...
docker rmi -f $(IMAGE):$(VERSION)
.PHONY: clean_build
clean_build:
rm -rf build/
.PHONY: clean
clean: clean_build clean_docker
================================================
FILE: README.md
================================================
# NLPretext
<p align="center">
<img src="/references/logo_nlpretext.png" />
</p>
<div align="center">
[](https://github.com/artefactory/NLPretext/actions/workflows/ci.yml?query=branch%3Amain)
[](https://github.com/artefactory/NLPretext/actions/workflows/cd.yml?query=event%3Arelease)
[](#supported-python-versions)
[](https://github.com/artefactory/NLPretext}/pulls?utf8=%E2%9C%93&q=is%3Apr%20author%3Aapp%2Fdependabot)
[](https://github.com/psf/black)
[](https://github.com/PyCQA/bandit)
[](https://github.com/artefactory/NLPretext}/blob/main/.pre-commit-config.yaml)
[](https://github.com/artefactory/NLPretext/releases)
[](https://github.com/artefactory/NLPretext}/tree/main/docs)
[](https://github.com/artefactory/NLPretext}/blob/main/LICENSE)
All the goto functions you need to handle NLP use-cases, integrated in NLPretext
</div>
# TL;DR
> *Working on an NLP project and tired of always looking for the same silly preprocessing functions on the web?* :tired_face:
> *Need to efficiently extract email adresses from a document? Hashtags from tweets? Remove accents from a French post?* :disappointed_relieved:
**NLPretext got you covered!** :rocket:
NLPretext packages in a **unique** library all the text **preprocessing** functions you need to **ease** your NLP project.
:mag: Quickly explore below our preprocessing pipelines and individual functions referential.
* [Default preprocessing pipeline](#default_pipeline)
* [Custom preprocessing pipeline](#custom_pipeline)
* [Replacing phone numbers](#replace_phone_numbers)
* [Removing hashtags](#remove_hashtags)
* [Extracting emojis](#extract_emojis)
* [Data augmentation](#data_augmentation)
Cannot find what you were looking for? Feel free to open an [issue]((https://github.com/artefactory/nlpretext/issues) ).
# Installation
### Supported Python Versions
- Main version supported : `3.8`
- Other supported versions : `3.9`, `3.10`
We strongly advise you to do the remaining steps in a virtual environnement.
To install this library from PyPi, run the following command:
```bash
pip install nlpretext
```
or with `Poetry`
```bash
poetry add nlpretext
```
# Usage
## Default pipeline <a name="default_pipeline"></a>
Need to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:
```python
from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)
print(text)
# "I just got the best dinner in my life!!! I recommend"
```
## Create your custom pipeline <a name="custom_pipeline"></a>
Another possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:
```python
from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,
remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text)
print(text)
# "dinner life recommend"
```
Take a look at all the functions that are available [here](https://github.com/artefactory/NLPretext/tree/master/nlpretext) in the ```preprocess.py``` scripts in the different folders: basic, social, token.
## Load text data
Pre-processing text data is useful only if you have loaded data to process! Importing text data as strings in your code can be really simple if you have short texts contained in a local .txt, but it can quickly become difficult if you want to load a lot of texts, stored in multiple formats and divided in multiple files. Hopefully, you can use NLPretext's TextLoader class to easily import text data.
while it is not mandatory our textLoader work best with dask, make sure to have the librairy installed if you want the best performances.
```python
from nlpretext.textloader import TextLoader
files_path = "local_folder/texts/text.txt"
text_loader = TextLoader(use_dask=True)
text_dataframe = text_loader.read_text(files_path)
print(text_dataframe.text.values.tolist())
# ["I just got the best dinner in my life!!!", "I recommend", "It was awesome"]
```
File path can be provided as string, list of strings, with or without wildcards. It also supports imports from cloud providers, if your machine is authentified on a project.
```python
text_loader = TextLoader(text_column="name_of_text_column_in_your_data")
local_file_path = "local_folder/texts/text.csv" # File from local folder
local_corpus_path = ["local_folder/texts/text_1.csv", "local_folder/texts/text_2.csv", "local_folder/texts/text_3.csv"] # Multiple files from local folder
gcs_file_path = "gs://my-bucket/texts/text.json" # File from GCS
s3_file_path = "s3://my-bucket/texts/text.json" # File from S3
hdfs_file_path = "hdfs://folder/texts/text.txt" # File from HDFS
azure_file_path = "az://my-bucket/texts/text.parquet" # File from Azure
gcs_corpus_path = "gs://my-bucket/texts/text_*.json" # Multiple files from GCS with wildcard
text_dataframe_1 = text_loader.read_text(local_file_path)
text_dataframe_2 = text_loader.read_text(local_corpus_path)
text_dataframe_3 = text_loader.read_text(gcs_file_path)
text_dataframe_4 = text_loader.read_text(s3_file_path)
text_dataframe_5 = text_loader.read_text(hdfs_file_path)
text_dataframe_6 = text_loader.read_text(azure_file_path)
text_dataframe_7 = text_loader.read_text(gcs_corpus_path)
```
You can also specify a Preprocessor if you want your data to be directly pre-processed when loaded.
```python
text_loader = TextLoader(text_column="text_col")
preprocessor = Preprocessor()
file_path = "local_folder/texts/text.csv" # File from local folder
raw_text_dataframe = text_loader.read_text(local_file_path)
preprocessed_text_dataframe = text_loader.read_text(local_file_path, preprocessor=preprocessor)
print(raw_text_dataframe.text_col.values.tolist())
# ["These texts are not preprocessed", "This is bad ## "]
print(preprocessed_text_dataframe.text_col.values.tolist())
# ["These texts are not preprocessed", "This is bad"]
```
## Individual Functions
### Replacing emails <a name="replace_emails"></a>
```python
from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to obama@whitehouse.gov"
example = replace_emails(example, replace_with="*EMAIL*")
print(example)
# "I have forwarded this email to *EMAIL*"
```
### Replacing phone numbers <a name="replace_phone_numbers"></a>
```python
from nlpretext.basic.preprocess import replace_phone_numbers
example = "My phone number is 0606060606"
example = replace_phone_numbers(example, country_to_detect=["FR"], replace_with="*PHONE*")
print(example)
# "My phone number is *PHONE*"
```
### Removing Hashtags <a name="remove_hashtags"></a>
```python
from nlpretext.social.preprocess import remove_hashtag
example = "This restaurant was amazing #food #foodie #foodstagram #dinner"
example = remove_hashtag(example)
print(example)
# "This restaurant was amazing"
```
### Extracting emojis <a name="extract_emojis"></a>
```python
from nlpretext.social.preprocess import extract_emojis
example = "I take care of my skin 😀"
example = extract_emojis(example)
print(example)
# [':grinning_face:']
```
## Data augmentation <a name="data_augmentation"></a>
The augmentation module helps you to **generate new texts** based on your given examples by modifying some words in the initial ones and to **keep associated entities unchanged**, if any, in the case of **NER tasks**. If you want words other than entities to remain unchanged, you can specify it within the `stopwords` argument. Modifications depend on the chosen method, the ones currently supported by the module are **substitutions with synonyms** using Wordnet or BERT from the [`nlpaug`](https://github.com/makcedward/nlpaug) library.
```python
from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=”wordnet_synonym”, entities=entities)
print(example)
# "I need to buy a small black pocketbook please."
```
# 📈 Releases
You can see the list of available releases on the [GitHub Releases](https://github.com/artefactory/NLPretext}/releases) page.
We follow [Semantic Versions](https://semver.org/) specification.
We use [`Release Drafter`](https://github.com/marketplace/actions/release-drafter). As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you’re ready. With the categories option, you can categorize pull requests in release notes using labels.
For Pull Requests, these labels are configured, by default:
| **Label** | **Title in Releases** |
| :-----------------------------------: | :---------------------: |
| `enhancement`, `feature` | 🚀 Features |
| `bug`, `refactoring`, `bugfix`, `fix` | 🔧 Fixes & Refactoring |
| `build`, `ci`, `testing` | 📦 Build System & CI/CD |
| `breaking` | 💥 Breaking Changes |
| `documentation` | 📝 Documentation |
| `dependencies` | ⬆️ Dependencies updates |
GitHub creates the `bug`, `enhancement`, and `documentation` labels automatically. Dependabot creates the `dependencies` label. Create the remaining labels on the Issues tab of the GitHub repository, when needed.## 🛡 License
[](https://github.com/artefactory/NLPretext}/blob/main/LICENSE)
This project is licensed under the terms of the `Apache Software License 2.0` license. See [LICENSE](https://github.com/artefactory/NLPretext}/blob/main/LICENSE) for more details.## 📃 Citation
```
@misc{nlpretext,
author = {artefactory},
title = {All the goto functions you need to handle NLP use-cases, integrated in NLPretext},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/artefactory/NLPretext}}}
}
```
# Project Organization
------------
.
├── .github/workflows <- Where the CI and CD lives
├── datasets/external <- Bash scripts to download external datasets
├── docker <- All you need to build a Docker image from that package
├── docs <- Sphinx HTML documentation
├── nlpretext <- Main Package. This is where the code lives
│ ├── preprocessor.py <- Main preprocessing script
│ ├── text_loader.py <- Main loading script
│ ├── augmentation <- Text augmentation script
│ ├── basic <- Basic text preprocessing
│ ├── cli <- Command lines that can be used
│ ├── social <- Social text preprocessing
│ ├── token <- Token text preprocessing
│ ├── textloader <- File loading
│ ├── _config <- Where the configuration and constants live
│ └── _utils <- Where preprocessing utils scripts lives
├── references <- assets
├── tests <- Where the tests lives
├── .gitignore
├── .pre-commit-config.yaml <- Pre-commit configuration
├── CODE_OF_CONDUCT.md <- Code of conduct guidelines
├── CONTRIBUTING.md <- Contribution guidelines
├── LICENSE
├── Makefile
├── pyproject.toml <- Package build configuration
├── README.md <- The top-level README for developers using this project.
└── SECURITY.md
# Credits
- [textacy](https://github.com/chartbeat-labs/textacy) for the following basic preprocessing functions:
- `fix_bad_unicode`
- `normalize_whitespace`
- `unpack_english_contractions`
- `replace_urls`
- `replace_emails`
- `replace_numbers`
- `replace_currency_symbols`
- `remove_punct`
- `remove_accents`
- `replace_phone_numbers` *(with some modifications of our own)*
================================================
FILE: SECURITY.md
================================================
# Security
## 🔐 Reporting Security Issues
> Do not open issues that might have security implications!
> It is critical that security related issues are reported privately so we have time to address them before they become public knowledge.
Vulnerabilities can be reported by emailing core members:
- artefactory [jules.bertrand@artefact.com](mailto:jules.bertrand@artefact.com)
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
- Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
- Full paths of source file(s) related to the manifestation of the issue
- The location of the affected source code (tag/branch/commit or direct URL)
- Any special configuration required to reproduce the issue
- Environment (e.g. Linux / Windows / macOS)
- Step-by-step instructions to reproduce the issue
- Proof-of-concept or exploit code (if possible)
- Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
## Preferred Languages
We prefer all communications to be in English.
================================================
FILE: datasets/external/get_language_dataset.sh
================================================
# GNU Lesser General Public License v3.0 only
# Copyright (C) 2020 Artefact
# licence-information@artefact.com
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#!/bin/bash
wget -O wili.zip https://zenodo.org/record/841984/files/wili-2018.zip?download=1
mkdir -p wili && cp wili.zip wili && cd wili && unzip wili.zip && cd ..
================================================
FILE: datasets/external/get_stanfordtweets.sh
================================================
# GNU Lesser General Public License v3.0 only
# Copyright (C) 2020 Artefact
# licence-information@artefact.com
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#!/bin/bash
wget -O trainingandtestdata.zip http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip trainingandtestdata.zip
mkdir -p tweets_sentiment && cp trainingandtestdata.zip tweets_sentiment && cd tweets_sentiment && unzip trainingandtestdata.zip
================================================
FILE: docker/Dockerfile
================================================
FROM python:3.10-slim-buster
ENV LANG=C.UTF-8 \
LC_ALL=C.UTF-8
RUN apt-get update && \
apt-get install -y --no-install-recommends \
curl coreutils \
&& rm -rf /var/lib/apt/lists/*
# Install Poetry
ENV POETRY_VERSION=1.5.1
RUN pip install --upgrade pip
RUN python3 -m pip install "poetry==$POETRY_VERSION"
WORKDIR /home/workspace
COPY pyproject.toml ./
RUN poetry config virtualenvs.create false \
&& poetry lock \
&& poetry install --no-root --no-dev --no-interaction
COPY . /home/docker_user/workspace/
ENTRYPOINT ["poetry", "run", "nlpretext"]
================================================
FILE: docker/README.md
================================================
# Docker for nlpretext
## Installation
To create Docker you need to run:
```bash
make docker
```
which is equivalent to:
```bash
make docker VERSION=latest
```
You could also provide name and version for the image itself.
Default name is `IMAGE := nlpretext`.
Default version is `VERSION := latest`.
```bash
make docker IMAGE=some_name VERSION=1.0.4
```
## Usage
```bash
docker run -it --rm \
-v $(pwd):/workspace \
nlpretext bash
```
## How to clean up
To uninstall docker image run `make clean_docker` with `VERSION`:
```bash
make clean_docker VERSION=1.0.4
```
like in installation, you can also choose the image name
```bash
make clean_docker IMAGE=some_name VERSION=latest
```
If you want to clean all, including `build` run `make clean`
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= poetry run sphinx-build
SPHINXAPIBUILD ?= poetry run sphinx-apidoc
SPHINXMULTIVERSION ?= poetry run sphinx-multiversion
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
.PHONY: help Makefile
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
multiversion:
@$(SPHINXMULTIVERSION) $(SOURCEDIR) $(BUILDDIR)/html
apidoc:
@$(SPHINXAPIBUILD) -f -o source/apidoc/ ../nlpretext/ --implicit-namespaces -M -t source/_templates
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set BUILDDIR=build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .
if NOT "%PAPER%" == "" (
set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
)
if "%1" == "" goto help
if "%1" == "help" (
:help
echo.Please use `make ^<target^>` where ^<target^> is one of
echo. html to make standalone HTML files
echo. dirhtml to make HTML files named index.html in directories
echo. singlehtml to make a single large HTML file
echo. pickle to make pickle files
echo. json to make JSON files
echo. htmlhelp to make HTML files and a HTML help project
echo. qthelp to make HTML files and a qthelp project
echo. devhelp to make HTML files and a Devhelp project
echo. epub to make an epub
echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
echo. text to make text files
echo. man to make manual pages
echo. changes to make an overview over all changed/added/deprecated items
echo. linkcheck to check all external links for integrity
echo. doctest to run all doctests embedded in the documentation if enabled
goto end
)
if "%1" == "clean" (
for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
del /q /s %BUILDDIR%\*
goto end
)
if "%1" == "html" (
%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/html.
goto end
)
if "%1" == "dirhtml" (
%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
goto end
)
if "%1" == "singlehtml" (
%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
goto end
)
if "%1" == "pickle" (
%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
echo.
echo.Build finished; now you can process the pickle files.
goto end
)
if "%1" == "json" (
%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
echo.
echo.Build finished; now you can process the JSON files.
goto end
)
if "%1" == "htmlhelp" (
%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
echo.
echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
goto end
)
if "%1" == "qthelp" (
%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
echo.
echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
echo.^> qcollectiongenerator %BUILDDIR%\qthelp\Mapnik.qhcp
echo.To view the help file:
echo.^> assistant -collectionFile %BUILDDIR%\qthelp\Mapnik.ghc
goto end
)
if "%1" == "devhelp" (
%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
echo.
echo.Build finished.
goto end
)
if "%1" == "epub" (
%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
echo.
echo.Build finished. The epub file is in %BUILDDIR%/epub.
goto end
)
if "%1" == "latex" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
echo.
echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "text" (
%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
echo.
echo.Build finished. The text files are in %BUILDDIR%/text.
goto end
)
if "%1" == "man" (
%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
echo.
echo.Build finished. The manual pages are in %BUILDDIR%/man.
goto end
)
if "%1" == "changes" (
%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
echo.
echo.The overview file is in %BUILDDIR%/changes.
goto end
)
if "%1" == "linkcheck" (
%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
echo.
echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
goto end
)
if "%1" == "doctest" (
%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
echo.
echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
goto end
)
:end
================================================
FILE: docs/scripts/buildsite.sh
================================================
#!/bin/bash
export SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct)
##############
# BUILD DOCS #
##############
# Python Sphinx, configured with source/conf.py
# See https://www.sphinx-doc.org/
cd docs/
current_tag=$(git symbolic-ref -q --short HEAD || git describe --tags --exact-match)
current_tag_message=$(git cat-file -p $(git rev-parse $(git tag -l | tail -n1)) | tail -n +6)
make clean
make apidoc
git add .
git commit -m "Commit needed for multiversioning"
git pull --tags
git tag -a latest -m "Latest version of the package"
make multiversion
#######################
# Update GitHub Pages #
#######################
docroot=`mktemp -d`
cp -r build/html/* ${docroot}
cd ..
git branch -d gh-pages
git checkout --orphan gh-pages
git rm --cached -r .
git clean -fdx
# Adds .nojekyll file to the root to signal to GitHub that
# directories that start with an underscore (_) can remain
touch .nojekyll
# Add index.html
cat > index.html <<EOF
<!DOCTYPE html>
<html>
<head>
<title>Redirecting to the latest release</title>
<meta charset="utf-8">
<meta http-equiv="refresh" content="0; url=./latest/index.html">
<link rel="canonical" href="./latest/index.html">
</head>
</html>
EOF
# Add README
cat > README.md <<EOF
# README for the GitHub Pages Branch
This branch is simply a cache for the website and is not intended to be viewed on github.com.
EOF
# Copy the resulting html pages built from Sphinx to the gh-pages branch
cp -r ${docroot}/* .
git add .
# Make a commit with changes and any new files
msg="Updating Docs for commit ${GITHUB_SHA} made on `date -d"@${SOURCE_DATE_EPOCH}" --iso-8601=seconds` from ${GITHUB_REF} by ${GITHUB_ACTOR}"
git commit -m "${msg}"
# overwrite the contents of the gh-pages branch on our github.com repo
git push origin gh-pages --force
# exit cleanly
exit 0
================================================
FILE: docs/source/_templates/module.rst_t
================================================
{%- if show_headings %}
{{- [basename] | join(' ') | e | heading }}
{% endif -%}
.. automodule:: {{ qualname }}
{%- for option in automodule_options %}
:{{ option }}:
{%- endfor %}
================================================
FILE: docs/source/_templates/package.rst_t
================================================
{%- macro automodule(modname, options) -%}
.. automodule:: {{ modname }}
{%- for option in options %}
:{{ option }}:
{%- endfor %}
{%- endmacro %}
{%- macro toctree(docnames) -%}
.. toctree::
:maxdepth: {{ maxdepth }}
{% for docname in docnames %}
{{ docname }}
{%- endfor %}
{%- endmacro %}
{%- if is_namespace %}
{{- ["**", pkgname, "**"] | join("") | heading }}
{% else %}
{% set pkg_list = pkgname.split('.') %}
{{- ["**", pkg_list[-1], "**"] | join("") | heading }}
{% endif %}
{%- if modulefirst and not is_namespace %}
{{ automodule(pkgname, automodule_options) }}
{% endif %}
{%- if subpackages %}
{{ toctree(subpackages) }}
{% endif %}
{%- if submodules %}
{% if separatemodules %}
{{ toctree(submodules) }}
{% else %}
{%- for submodule in submodules %}
{% if show_headings %}
{% set submodule_list = submodule.split('.') %}
{{- [submodule_list[-1]] | join(" ") | e | heading(2) }}
{% endif %}
{{ automodule(submodule, automodule_options) }}
{% endfor %}
{%- endif %}
{%- endif %}
{%- if not modulefirst and not is_namespace %}
{{ automodule(pkgname, automodule_options) }}
{% endif %}
================================================
FILE: docs/source/_templates/versions.html
================================================
{%- if current_version %}
<div class="rst-versions" data-toggle="rst-versions" role="note" aria-label="versions">
<span class="rst-current-version" data-toggle="rst-current-version">
<span class="fa fa-book"> Other Versions</span>
v: {{ current_version.name }}
<span class="fa fa-caret-down"></span>
</span>
<div class="rst-other-versions">
{%- if versions.tags %}
<dl>
<dt>Tags</dt>
{%- for item in versions.tags %}
<dd><a href="{{ item.url }}">{{ item.name }}</a></dd>
{%- endfor %}
</dl>
{%- endif %}
{%- if versions.branches %}
<dl>
<dt>Branches</dt>
{%- for item in versions.branches %}
<dd><a href="{{ item.url }}">{{ item.name }}</a></dd>
{%- endfor %}
</dl>
{%- endif %}
</div>
</div>
{%- endif %}
================================================
FILE: docs/source/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath(".."))
# -- Project information -----------------------------------------------------
project = "nlpretext"
author = "artefactory"
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.intersphinx",
"sphinx.ext.mathjax",
"sphinx.ext.napoleon",
"sphinx.ext.todo",
"sphinx.ext.viewcode",
"recommonmark",
"nbsphinx",
"sphinx_multiversion",
"sphinx_autodoc_typehints",
"sphinx_rtd_theme",
]
source_suffix = {
".rst": "restructuredtext",
".txt": "restructuredtext",
".md": "markdown",
}
source_parsers = {".md": "recommonmark.parser.CommonMarkParser"}
nbsphinx_execute = "never"
github_url = "https://github.com/artefactory/NLPretext"
smv_prefer_remote_refs = False
smv_remote_whitelist = None
smv_prebuild_command = (
"poetry run sphinx-apidoc -f -o source/apidoc/ "
"../nlpretext/ "
"--implicit-namespaces -M -t source/_templates"
)
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# Autodoc parameters
always_document_param_types = True
add_module_names = False
autodoc_member_order = "bysource"
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = "sphinx_rtd_theme"
github_url = "https://www.github.com/artefactory/NLPretext}"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# Font packages
"fontpkg": "\\usepackage{amsmath, amsfonts, amssymb, amsthm}"
}
================================================
FILE: docs/source/index.rst
================================================
=========
NLPretext
=========
Welcome to NLPretext's documentation!
========================================
The NLPretext library aimed to be a meta-library to be used to help you get started on handling your NLP use-case preprocessing.
# Installation
Beware, this package has been tested on Python `3.8`, `3.9` & `3.10` and will probably not be working under python **2.7** as **Python2.7** EOL is scheduled for December 2019.
To install this library you should first clone the repository:
pip install nlpretext
.. toctree::
:maxdepth: 4
:caption: Tutorials:
./tutorials/index
.. toctree::
:maxdepth: 2
:caption: API Reference:
./apidoc/modules
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: docs/source/tutorials/basic_notebook.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to use the package in a notebook\n",
"\n",
"<div align=\"center\">\n",
"\n",
"<div style=\"width: 25%; min-width: 150px; padding: 20px\">\n",
"\n",
"\n",
"\n",
"</div>\n",
"\n",
"### *nlpretext*\n",
"\n",
"</div>\n",
"\n",
"## Installing from the main branch\n",
"\n",
"To install the library from the main branch, you can run the following cell :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"%pip install git+ssh://git@github.com/artefactory/NLPretext.git@main"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing from a specific release\n",
"\n",
"To install the library from a specific release, you can run the following cell :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"%pip install git+ssh://git@github.com/artefactory/NLPretext.git@v1.0.5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using the package\n",
"\n",
"You can now import and run whatever is in the package :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from nlpretext.basic.preprocess import replace_emails\n",
"\n",
"example = \"I have forwarded this email to obama@whitehouse.gov\"\n",
"example = replace_emails(example, replace_with=\"*EMAIL*\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"print(example)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
================================================
FILE: docs/source/tutorials/index.rst
================================================
Tutorials
=========
.. toctree::
:maxdepth: 4
:glob:
basic_notebook
================================================
FILE: nlpretext/__init__.py
================================================
# GNU Lesser General Public License v3.0 only
# Copyright (C) 2020 Artefact
# licence-information@artefact.com
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
# mypy: disable-error-code="attr-defined"
# mypy: disable-error-code="assignment"
"""All the goto functions you need to handle NLP use-cases, integrated in NLPretext."""
from importlib.metadata import PackageNotFoundError, version
from nlpretext.preprocessor import Preprocessor
try:
__version__ = version(__name__)
except PackageNotFoundError: # pragma: no cover
__version__ = "unknown"
__all__ = ["Preprocessor"]
================================================
FILE: nlpretext/_config/__init__.py
================================================
# GNU Lesser General Public License v3.0 only
# Copyright (C) 2020 Artefact
# licence-information@artefact.com
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
================================================
FILE: nlpretext/_config/config.py
================================================
# GNU Lesser General Public License v3.0 only
# Copyright (C) 2020 Artefact
# licence-information@artefact.com
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#!/usr/local/bin/python3
from typing import List, Optional
import os
import phonenumbers as _phonenumbers
ROOT_FOLDER = os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))
# Country config
COUNTRY_MAPPING_ISO = {
"af": "Afghanistan",
"ax": "Åland Islands",
"al": "Albania",
"dz": "Algeria",
"as": "American Samoa",
"ad": "Andorra",
"ao": "Angola",
"ai": "Anguilla",
"aq": "Antarctica",
"ag": "Antigua and Barbuda",
"ar": "Argentina",
"am": "Armenia",
"aw": "Aruba",
"au": "Australia",
"at": "Austria",
"az": "Azerbaijan",
"bs": "Bahamas",
"bh": "Bahrain",
"bd": "Bangladesh",
"bb": "Barbados",
"by": "Belarus",
"be": "Belgium",
"bz": "Belize",
"bj": "Benin",
"bm": "Bermuda",
"bt": "Bhutan",
"bo": "Bolivia (Plurinational State of)",
"bq": "Bonaire, Sint Eustatius and Saba",
"ba": "Bosnia and Herzegovina",
"bw": "Botswana",
"bv": "Bouvet Island",
"br": "Brazil",
"io": "British Indian Ocean Territory",
"bn": "Brunei Darussalam",
"bg": "Bulgaria",
"bf": "Burkina Faso",
"bi": "Burundi",
"cv": "Cabo Verde",
"kh": "Cambodia",
"cm": "Cameroon",
"ca": "Canada",
"ky": "Cayman Islands",
"cf": "Central African Republic",
"td": "Chad",
"cl": "Chile",
"cn": "China",
"cx": "Christmas Island",
"cc": "Cocos (Keeling) Islands",
"co": "Colombia",
"km": "Comoros",
"cg": "Congo",
"cd": "Congo, Democratic Republic of the",
"ck": "Cook Islands",
"cr": "Costa Rica",
"ci": "Côte d'Ivoire",
"hr": "Croatia",
"cu": "Cuba",
"cw": "Curaçao",
"cy": "Cyprus",
"cz": "Czechia",
"dk": "Denmark",
"dj": "Djibouti",
"dm": "Dominica",
"do": "Dominican Republic",
"ec": "Ecuador",
"eg": "Egypt",
"sv": "El Salvador",
"gq": "Equatorial Guinea",
"er": "Eritrea",
"ee": "Estonia",
"sz": "Eswatini",
"et": "Ethiopia",
"fk": "Falkland Islands (Malvinas)",
"fo": "Faroe Islands",
"fj": "Fiji",
"fi": "Finland",
"fr": "France",
"gf": "French Guiana",
"pf": "French Polynesia",
"tf": "French Southern Territories",
"ga": "Gabon",
"gm": "Gambia",
"ge": "Georgia",
"de": "Germany",
"gh": "Ghana",
"gi": "Gibraltar",
"gr": "Greece",
"gl": "Greenland",
"gd": "Grenada",
"gp": "Guadeloupe",
"gu": "Guam",
"gt": "Guatemala",
"gg": "Guernsey",
"gn": "Guinea",
"gw": "Guinea-Bissau",
"gy": "Guyana",
"ht": "Haiti",
"hm": "Heard Island and McDonald Islands",
"va": "Holy See",
"hn": "Honduras",
"hk": "Hong Kong",
"hu": "Hungary",
"is": "Iceland",
"in": "India",
"id": "Indonesia",
"ir": "Iran (Islamic Republic of)",
"iq": "Iraq",
"ie": "Ireland",
"im": "Isle of Man",
"il": "Israel",
"it": "Italy",
"jm": "Jamaica",
"jp": "Japan",
"je": "Jersey",
"jo": "Jordan",
"kz": "Kazakhstan",
"ke": "Kenya",
"ki": "Kiribati",
"kp": "Korea (Democratic People's Republic of)",
"kr": "Korea, Republic of",
"kw": "Kuwait",
"kg": "Kyrgyzstan",
"la": "Lao People's Democratic Republic",
"lv": "Latvia",
"lb": "Lebanon",
"ls": "Lesotho",
"lr": "Liberia",
"ly": "Libya",
"li": "Liechtenstein",
"lt": "Lithuania",
"lu": "Luxembourg",
"mo": "Macao",
"mg": "Madagascar",
"mw": "Malawi",
"my": "Malaysia",
"mv": "Maldives",
"ml": "Mali",
"mt": "Malta",
"mh": "Marshall Islands",
"mq": "Martinique",
"mr": "Mauritania",
"mu": "Mauritius",
"yt": "Mayotte",
"mx": "Mexico",
"fm": "Micronesia (Federated States of)",
"md": "Moldova, Republic of",
"mc": "Monaco",
"mn": "Mongolia",
"me": "Montenegro",
"ms": "Montserrat",
"ma": "Morocco",
"mz": "Mozambique",
"mm": "Myanmar",
"na": "Namibia",
"nr": "Nauru",
"np": "Nepal",
"nl": "Netherlands",
"nc": "New Caledonia",
"nz": "New Zealand",
"ni": "Nicaragua",
"ne": "Niger",
"ng": "Nigeria",
"nu": "Niue",
"nf": "Norfolk Island",
"mk": "North Macedonia",
"mp": "Northern Mariana Islands",
"no": "Norway",
"om": "Oman",
"pk": "Pakistan",
"pw": "Palau",
"ps": "Palestine, State of",
"pa": "Panama",
"pg": "Papua New Guinea",
"py": "Paraguay",
"pe": "Peru",
"ph": "Philippines",
"pn": "Pitcairn",
"pl": "Poland",
"pt": "Portugal",
"pr": "Puerto Rico",
"qa": "Qatar",
"re": "Réunion",
"ro": "Romania",
"ru": "Russian Federation",
"rw": "Rwanda",
"bl": "Saint Barthélemy",
"sh": "Saint Helena, Ascension and Tristan da Cunha",
"kn": "Saint Kitts and Nevis",
"lc": "Saint Lucia",
"mf": "Saint Martin (French part)",
"pm": "Saint Pierre and Miquelon",
"vc": "Saint Vincent and the Grenadines",
"ws": "Samoa",
"sm": "San Marino",
"st": "Sao Tome and Principe",
"sa": "Saudi Arabia",
"sn": "Senegal",
"rs": "Serbia",
"sc": "Seychelles",
"sl": "Sierra Leone",
"sg": "Singapore",
"sx": "Sint Maarten (Dutch part)",
"sk": "Slovakia",
"si": "Slovenia",
"sb": "Solomon Islands",
"so": "Somalia",
"za": "South Africa",
"gs": "South Georgia and the South Sandwich Islands",
"ss": "South Sudan",
"es": "Spain",
"lk": "Sri Lanka",
"sd": "Sudan",
"sr": "Suriname",
"sj": "Svalbard and Jan Mayen",
"se": "Sweden",
"ch": "Switzerland",
"sy": "Syrian Arab Republic",
"tw": "Taiwan, Province of China",
"tj": "Tajikistan",
"tz": "Tanzania, United Republic of",
"th": "Thailand",
"tl": "Timor-Leste",
"tg": "Togo",
"tk": "Tokelau",
"to": "Tonga",
"tt": "Trinidad and Tobago",
"tn": "Tunisia",
"tr": "Turkey",
"tm": "Turkmenistan",
"tc": "Turks and Caicos Islands",
"tv": "Tuvalu",
"ug": "Uganda",
"ua": "Ukraine",
"ae": "United Arab Emirates",
"gb": "United Kingdom of Great Britain and Northern Ireland",
"us": "United States of America",
"um": "United States Minor Outlying Islands",
"uy": "Uruguay",
"uz": "Uzbekistan",
"vu": "Vanuatu",
"ve": "Venezuela (Bolivarian Republic of)",
"vn": "Viet Nam",
"vg": "Virgin Islands (British)",
"vi": "Virgin Islands (U.S.)",
"wf": "Wallis and Futuna",
"eh": "Western Sahara",
"ye": "Yemen",
"zm": "Zambia",
"zw": "Zimbabwe",
}
# Phone numbers config
SUPPORTED_COUNTRY: List[Optional[str]] = [
None,
"US",
"AG",
"AI",
"AS",
"BB",
"BM",
"BS",
"CA",
"DM",
"GD",
"GU",
"JM",
"KN",
"KY",
"LC",
"MP",
"MS",
"PR",
"SX",
"TC",
"TT",
"VC",
"VG",
"VI",
"RU",
"KZ",
"EG",
"ZA",
"GR",
"NL",
"BE",
"FR",
"ES",
"HU",
"IT",
"VA",
"RO",
"CH",
"AT",
"GB",
"GG",
"IM",
"JE",
"DK",
"SE",
"NO",
"SJ",
"PL",
"DE",
"PE",
"MX",
"CU",
"AR",
"BR",
"CL",
"CO",
"VE",
"MY",
"AU",
"CC",
"CX",
"ID",
"PH",
"NZ",
"SG",
"TH",
"JP",
"KR",
"VN",
"CN",
"TR",
"IN",
"PK",
"AF",
"LK",
"MM",
"IR",
"SS",
"MA",
"EH",
"DZ",
"TN",
"LY",
"GM",
"SN",
"MR",
"ML",
"GN",
"CI",
"BF",
"NE",
"TG",
"BJ",
"MU",
"LR",
"SL",
"GH",
"NG",
"TD",
"CF",
"CM",
"CV",
"ST",
"GQ",
"GA",
"CG",
"CD",
"AO",
"GW",
"IO",
"AC",
"SC",
"SD",
"RW",
"ET",
"SO",
"DJ",
"KE",
"TZ",
"UG",
"BI",
"MZ",
"ZM",
"MG",
"RE",
"YT",
"ZW",
"NA",
"MW",
"LS",
"BW",
"SZ",
"KM",
"SH",
"TA",
"ER",
"AW",
"FO",
"GL",
"GI",
"PT",
"LU",
"IE",
"IS",
"AL",
"MT",
"CY",
"FI",
"AX",
"BG",
"LT",
"LV",
"EE",
"MD",
"AM",
"BY",
"AD",
"MC",
"SM",
"UA",
"RS",
"ME",
"XK",
"HR",
"SI",
"BA",
"MK",
"CZ",
"SK",
"LI",
"FK",
"BZ",
"GT",
"SV",
"HN",
"NI",
"CR",
"PA",
"PM",
"HT",
"GP",
"BL",
"MF",
"BO",
"GY",
"EC",
"GF",
"PY",
"MQ",
"SR",
"UY",
"CW",
"BQ",
"TL",
"NF",
"BN",
"NR",
"PG",
"TO",
"SB",
"VU",
"FJ",
"PW",
"WF",
"CK",
"NU",
"WS",
"KI",
"NC",
"TV",
"PF",
"TK",
"FM",
"MH",
"KP",
"HK",
"MO",
"KH",
"LA",
"BD",
"TW",
"MV",
"LB",
"JO",
"SY",
"IQ",
"KW",
"SA",
"YE",
"OM",
"PS",
"AE",
"IL",
"BH",
"QA",
"BT",
"MN",
"NP",
"TJ",
"TM",
"AZ",
"GE",
"KG",
"UZ",
"DO",
]
FORMAT_NUMBERS = {
"E164": _phonenumbers.PhoneNumberFormat.E164,
"INTERNATIONAL": _phonenumbers.PhoneNumberFormat.INTERNATIONAL,
"NATIONAL": _phonenumbers.PhoneNumberFormat.NATIONAL,
"RFC3966": _phonenumbers.PhoneNumberFormat.RFC3966,
}
================================================
FILE: nlpretext/_config/constants.py
================================================
# Copyright (C) 2020 Artefact
# licence-information@artefact.com
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
# mypy: disable-error-code="attr-defined"
"""
Collection of regular expressions and other (small, generally useful) constants.
Credits to textacy for some of them: https://github.com/chartbeat-labs/textacy.
"""
import re
import sys
import unicodedata
import regex
NUMERIC_NE_TYPES = {
"ORDINAL",
"CARDINAL",
"MONEY",
"QUANTITY",
"PERCENT",
"TIME",
"DATE",
}
SUBJ_DEPS = {"agent", "csubj", "csubjpass", "expl", "nsubj", "nsubjpass"}
OBJ_DEPS = {"attr", "dobj", "dative", "oprd"}
AUX_DEPS = {"aux", "auxpass", "neg"}
REPORTING_VERBS = {
"according",
"accuse",
"acknowledge",
"add",
"admit",
"agree",
"allege",
"announce",
"argue",
"ask",
"assert",
"believe",
"blame",
"charge",
"cite",
"claim",
"complain",
"concede",
"conclude",
"confirm",
"contend",
"criticize",
"declare",
"decline",
"deny",
"describe",
"disagree",
"disclose",
"estimate",
"explain",
"fear",
"hope",
"insist",
"maintain",
"mention",
"note",
"observe",
"order",
"predict",
"promise",
"recall",
"recommend",
"reply",
"report",
"say",
"state",
"stress",
"suggest",
"tell",
"testify",
"think",
"urge",
"warn",
"worry",
"write",
}
CURRENCIES = {
"$": "USD",
"zł": "PLN",
"£": "GBP",
"¥": "JPY",
"฿": "THB",
"₡": "CRC",
"₦": "NGN",
"₩": "KRW",
"₪": "ILS",
"₫": "VND",
"€": "EUR",
"₱": "PHP",
"₲": "PYG",
"₴": "UAH",
"₹": "INR",
}
POS_REGEX_PATTERNS = {
"en": {
"NP": r"<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+",
"PP": r"<ADP> <DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN> <PART>?)+",
"VP": r"<AUX>* <ADV>* <VERB>",
}
}
PUNCT_TRANSLATE_UNICODE = dict.fromkeys(
(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith("P")),
" ",
)
ACRONYM_REGEX = re.compile(
r"(?:^|(?<=\W))(?:(?:(?:(?:[A-Z]\.?)+[a-z0-9&/-]?)+(?:[A-Z][s.]?|[0-9]s?))|(?:[0-9](?:\-?[A-Z])+))(?:$|(?=\W))",
flags=re.UNICODE,
)
EMAIL_REGEX = re.compile(
r"(?:^|(?<=[^\w@.)]))([\w+-](\.(?!\.))?)*?[\w+-]@(?:\w-?)*?\w+(\.([a-z]{2,})){1,3}(?:$|(?=\b))",
flags=re.IGNORECASE | re.UNICODE,
)
PHONE_REGEX = re.compile(
r"(?:^|(?<=[^\w)]))(\+?1[ .-]?)?(\(?\d{3}\)?[ .-]?)?(\d{3}[ .-]?\d{4})(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))" # noqa: E501
)
NUMBERS_REGEX = re.compile(
r"(?:^|(?<=[^\w,.]))[+–-]?(([1-9]\d{0,2}(,\d{3})+(\.\d*)?)|([1-9]\d{0,2}([ .]\d{3})+(,\d*)?)|"
r"(\d*?[.,]\d+)|\d+)(?:|(?=\b))"
)
CURRENCY_REGEX = re.compile("({})+".format("|".join(re.escape(c) for c in CURRENCIES)))
LINEBREAK_REGEX = re.compile(r"((\r\n)|[\n\v])+")
NONBREAKING_SPACE_REGEX = re.compile(r"(?!\n)\s+")
URL_REGEX = re.compile(
r"(?:|(?<![\w/.]))"
# protocol identifier
# r"(?:(?:https?|ftp)://)" <-- alt?
r"(?:(?:https?://|mailto:|ftp://|www\d{0,3}\.))"
# user:pass authentication
r"(?:\S+(?::\S*)?@)?" r"(?:"
# IP address exclusion
# private & local networks
r"(?!(?:10|127)(?:\.\d{1,3}){3})"
r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
# IP address dotted notation octets
# excludes loopback network 0.0.0.0
# excludes reserved space >= 224.0.0.0
# excludes network & broadcast addresses
# (first & last IP address of each class)
r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
r"|"
# host name
r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
# domain name
r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
# TLD identifier
r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))" r")"
# port number
r"(?::\d{2,5})?"
# resource path
r"(?:/\S*)?" r"(?:$|(?![\w?!+&/]))",
flags=re.UNICODE | re.IGNORECASE,
) # source: https://gist.github.com/dperini/729294
SHORT_URL_REGEX = re.compile(
r"(?:^|(?<![\w/.]))"
# optional scheme
r"(?:(?:https?://)?)"
# domain
r"(?:\w-?)*?\w+(?:\.[a-z]{2,12}){1,3}" r"/"
# hash
r"[^\s.,?!'\"|+]{2,12}" r"(?:$|(?![\w?!+&/]))",
flags=re.IGNORECASE,
)
# regexes for cleaning up crufty terms
DANGLING_PARENS_TERM_RE = re.compile(
r"(?:\s|^)(\()\s{1,2}(.*?)\s{1,2}(\))(?:\s|$)", flags=re.UNICODE
)
LEAD_TAIL_CRUFT_TERM_RE = re.compile(r"^([^\w(-] ?)+|([^\w).!?] ?)+$", flags=re.UNICODE)
LEAD_HYPHEN_TERM_RE = re.compile(r"^-([^\W\d_])", flags=re.UNICODE)
NEG_DIGIT_TERM_RE = re.compile(r"(-) (\d)", flags=re.UNICODE)
WEIRD_HYPHEN_SPACE_TERM_RE = re.compile(r"(?<=[^\W\d]) (-[^\W\d])", flags=re.UNICODE)
WEIRD_APOSTR_SPACE_TERM_RE = re.compile(r"([^\W\d]+) ('[a-z]{1,2}\b)", flags=re.UNICODE)
LATIN_CHARACTERS_RE = regex.compile(r"[^\p{Latin}1-9]")
# ENGLISH CONTRACTIONS
CONTRACTION_NT_NOT = re.compile(
r"(\b)(are|could|did|does|do|had|has|have|is|might|must|should|were|would)n't", re.IGNORECASE
)
CONTRACTION_LL_WILL = re.compile(r"(\b)(he|i|she|they|we|what|who|you)'ll", re.IGNORECASE)
CONTRACTION_RE_ARE = re.compile(r"(\b)(they|we|what|who|you)'re", re.IGNORECASE)
CONTRACTION_VE_HAVE = re.compile(r"(\b)(i|should|they|we|what|who|would|you)'ve", re.IGNORECASE)
CONTRACTION_CANT_CANNOT = re.compile(r"(\b)(ca)n't", re.IGNORECASE)
CONTRACTION_M_AM = re.compile(r"(\b)(i)'m", re.IGNORECASE)
CONTRACTION_LET_LETUS = re.compile(r"(\b)(let)'s", re.IGNORECASE)
CONTRACTION_WONT_WILLNOT = re.compile(r"(\b)(w)on't", re.IGNORECASE)
CONTRACTION_SHANT_SHALLNOT = re.compile(r"(\b)(s)han't", re.IGNORECASE)
CONTRACTION_YALL_YOUALL = re.compile(r"(\b)(y)(?:'all|a'll)", re.IGNORECASE)
# SOCIAL DATA
HASHTAG_PATTERN = re.compile(r"#\w*")
AT_PATTERN = re.compile(r"@\w*")
HTML_TAG_PATTERN = re.compile(r"<.*?>")
# TEXT LOADER
TEXT_FILE_FORMATS_PATTERN = re.compile(r"^.*\.(json|csv|txt|parquet)(\.gz|\.zip)*$")
================================================
FILE: nlpretext/_config/stopwords.py
================================================
STOPWORDS = {
"af": [
"'n",
"aan",
"af",
"al",
"as",
"baie",
"by",
"daar",
"dag",
"dat",
"die",
"dit",
"een",
"ek",
"en",
"gaan",
"gesê",
"haar",
"het",
"hom",
"hulle",
"hy",
"in",
"is",
"jou",
"jy",
"kan",
"kom",
"ma",
"maar",
"met",
"my",
"na",
"nie",
"om",
"ons",
"op",
"saam",
"sal",
"se",
"sien",
"so",
"sy",
"te",
"toe",
"uit",
"van",
"vir",
"was",
"wat",
"ʼn",
],
"ha": [
"a",
"amma",
"ba",
"ban",
"ce",
"cikin",
"da",
"don",
"ga",
"in",
"ina",
"ita",
"ji",
"ka",
"ko",
"kuma",
"lokacin",
"ma",
"mai",
"na",
"ne",
"ni",
"sai",
"shi",
"su",
"suka",
"sun",
"ta",
"tafi",
"take",
"tana",
"wani",
"wannan",
"wata",
"ya",
"yake",
"yana",
"yi",
"za",
],
"so": [
"aad",
"albaabkii",
"atabo",
"ay",
"ayaa",
"ayee",
"ayuu",
"dhan",
"hadana",
"in",
"inuu",
"isku",
"jiray",
"jirtay",
"ka",
"kale",
"kasoo",
"ku",
"kuu",
"lakin",
"markii",
"oo",
"si",
"soo",
"uga",
"ugu",
"uu",
"waa",
"waxa",
"waxuu",
],
"st": [
"a",
"ba",
"bane",
"bona",
"e",
"ea",
"eaba",
"empa",
"ena",
"ha",
"hae",
"hape",
"ho",
"hore",
"ka",
"ke",
"la",
"le",
"li",
"me",
"mo",
"moo",
"ne",
"o",
"oa",
"re",
"sa",
"se",
"tloha",
"tsa",
"tse",
],
"sw": [
"akasema",
"alikuwa",
"alisema",
"baada",
"basi",
"bila",
"cha",
"chini",
"hadi",
"hapo",
"hata",
"hivyo",
"hiyo",
"huku",
"huo",
"ili",
"ilikuwa",
"juu",
"kama",
"karibu",
"katika",
"kila",
"kima",
"kisha",
"kubwa",
"kutoka",
"kuwa",
"kwa",
"kwamba",
"kwenda",
"kwenye",
"la",
"lakini",
"mara",
"mdogo",
"mimi",
"mkubwa",
"mmoja",
"moja",
"muda",
"mwenye",
"na",
"naye",
"ndani",
"ng",
"ni",
"nini",
"nonkungu",
"pamoja",
"pia",
"sana",
"sasa",
"sauti",
"tafadhali",
"tena",
"tu",
"vile",
"wa",
"wakati",
"wake",
"walikuwa",
"wao",
"watu",
"wengine",
"wote",
"ya",
"yake",
"yangu",
"yao",
"yeye",
"yule",
"za",
"zaidi",
"zake",
],
"yo": [
"a",
"an",
"bá",
"bí",
"bẹ̀rẹ̀",
"fún",
"fẹ́",
"gbogbo",
"inú",
"jù",
"jẹ",
"jẹ́",
"kan",
"kì",
"kí",
"kò",
"láti",
"lè",
"lọ",
"mi",
"mo",
"máa",
"mọ̀",
"ni",
"náà",
"ní",
"nígbà",
"nítorí",
"nǹkan",
"o",
"padà",
"pé",
"púpọ̀",
"pẹ̀lú",
"rẹ̀",
"sì",
"sí",
"sínú",
"ṣ",
"ti",
"tí",
"wà",
"wá",
"wọn",
"wọ́n",
"yìí",
"àti",
"àwọn",
"é",
"í",
"òun",
"ó",
"ń",
"ńlá",
"ṣe",
"ṣé",
"ṣùgbọ́n",
"ẹmọ́",
"ọjọ́",
"ọ̀pọ̀lọpọ̀",
],
"zu": [
"futhi",
"kahle",
"kakhulu",
"kanye",
"khona",
"kodwa",
"kungani",
"kusho",
"la",
"lakhe",
"lapho",
"mina",
"ngesikhathi",
"nje",
"phansi",
"phezulu",
"u",
"ukuba",
"ukuthi",
"ukuze",
"uma",
"wahamba",
"wakhe",
"wami",
"wase",
"wathi",
"yakhe",
"zakhe",
"zonke",
],
"da": [
"af",
"alle",
"andet",
"andre",
"at",
"begge",
"da",
"de",
"den",
"denne",
"der",
"deres",
"det",
"dette",
"dig",
"din",
"dog",
"du",
"ej",
"eller",
"en",
"end",
"ene",
"eneste",
"enhver",
"et",
"fem",
"fire",
"flere",
"fleste",
"for",
"fordi",
"forrige",
"fra",
"få",
"før",
"god",
"han",
"hans",
"har",
"hendes",
"her",
"hun",
"hvad",
"hvem",
"hver",
"hvilken",
"hvis",
"hvor",
"hvordan",
"hvorfor",
"hvornår",
"i",
"ikke",
"ind",
"ingen",
"intet",
"jeg",
"jeres",
"kan",
"kom",
"kommer",
"lav",
"lidt",
"lille",
"man",
"mand",
"mange",
"med",
"meget",
"men",
"mens",
"mere",
"mig",
"ned",
"ni",
"nogen",
"noget",
"ny",
"nyt",
"nær",
"næste",
"næsten",
"og",
"op",
"otte",
"over",
"på",
"se",
"seks",
"ses",
"som",
"stor",
"store",
"syv",
"ti",
"til",
"to",
"tre",
"ud",
"var",
],
"de": [
"Ernst",
"Ordnung",
"Schluss",
"a",
"ab",
"aber",
"ach",
"acht",
"achte",
"achten",
"achter",
"achtes",
"ag",
"alle",
"allein",
"allem",
"allen",
"aller",
"allerdings",
"alles",
"allgemeinen",
"als",
"also",
"am",
"an",
"andere",
"anderen",
"andern",
"anders",
"au",
"auch",
"auf",
"aus",
"ausser",
"ausserdem",
"außer",
"außerdem",
"b",
"bald",
"bei",
"beide",
"beiden",
"beim",
"beispiel",
"bekannt",
"bereits",
"besonders",
"besser",
"besten",
"bin",
"bis",
"bisher",
"bist",
"c",
"d",
"d.h",
"da",
"dabei",
"dadurch",
"dafür",
"dagegen",
"daher",
"dahin",
"dahinter",
"damals",
"damit",
"danach",
"daneben",
"dank",
"dann",
"daran",
"darauf",
"daraus",
"darf",
"darfst",
"darin",
"darum",
"darunter",
"darüber",
"das",
"dasein",
"daselbst",
"dass",
"dasselbe",
"davon",
"davor",
"dazu",
"dazwischen",
"daß",
"dein",
"deine",
"deinem",
"deiner",
"dem",
"dementsprechend",
"demgegenüber",
"demgemäss",
"demgemäß",
"demselben",
"demzufolge",
"den",
"denen",
"denn",
"denselben",
"der",
"deren",
"derjenige",
"derjenigen",
"dermassen",
"dermaßen",
"derselbe",
"derselben",
"des",
"deshalb",
"desselben",
"dessen",
"deswegen",
"dich",
"die",
"diejenige",
"diejenigen",
"dies",
"diese",
"dieselbe",
"dieselben",
"diesem",
"diesen",
"dieser",
"dieses",
"dir",
"doch",
"dort",
"drei",
"drin",
"dritte",
"dritten",
"dritter",
"drittes",
"du",
"durch",
"durchaus",
"durfte",
"durften",
"dürfen",
"dürft",
"e",
"eben",
"ebenso",
"ehrlich",
"ei",
"ei,",
"eigen",
"eigene",
"eigenen",
"eigener",
"eigenes",
"ein",
"einander",
"eine",
"einem",
"einen",
"einer",
"eines",
"einige",
"einigen",
"einiger",
"einiges",
"einmal",
"eins",
"elf",
"en",
"ende",
"endlich",
"entweder",
"er",
"erst",
"erste",
"ersten",
"erster",
"erstes",
"es",
"etwa",
"etwas",
"euch",
"euer",
"eure",
"f",
"folgende",
"früher",
"fünf",
"fünfte",
"fünften",
"fünfter",
"fünftes",
"für",
"g",
"gab",
"ganz",
"ganze",
"ganzen",
"ganzer",
"ganzes",
"gar",
"gedurft",
"gegen",
"gegenüber",
"gehabt",
"gehen",
"geht",
"gekannt",
"gekonnt",
"gemacht",
"gemocht",
"gemusst",
"genug",
"gerade",
"gern",
"gesagt",
"geschweige",
"gewesen",
"gewollt",
"geworden",
"gibt",
"ging",
"gleich",
"gott",
"gross",
"grosse",
"grossen",
"grosser",
"grosses",
"groß",
"große",
"großen",
"großer",
"großes",
"gut",
"gute",
"guter",
"gutes",
"h",
"habe",
"haben",
"habt",
"hast",
"hat",
"hatte",
"hatten",
"hattest",
"hattet",
"heisst",
"her",
"heute",
"hier",
"hin",
"hinter",
"hoch",
"hätte",
"hätten",
"i",
"ich",
"ihm",
"ihn",
"ihnen",
"ihr",
"ihre",
"ihrem",
"ihren",
"ihrer",
"ihres",
"im",
"immer",
"in",
"indem",
"infolgedessen",
"ins",
"irgend",
"ist",
"j",
"ja",
"jahr",
"jahre",
"jahren",
"je",
"jede",
"jedem",
"jeden",
"jeder",
"jedermann",
"jedermanns",
"jedes",
"jedoch",
"jemand",
"jemandem",
"jemanden",
"jene",
"jenem",
"jenen",
"jener",
"jenes",
"jetzt",
"k",
"kam",
"kann",
"kannst",
"kaum",
"kein",
"keine",
"keinem",
"keinen",
"keiner",
"kleine",
"kleinen",
"kleiner",
"kleines",
"kommen",
"kommt",
"konnte",
"konnten",
"kurz",
"können",
"könnt",
"könnte",
"l",
"lang",
"lange",
"leicht",
"leide",
"lieber",
"los",
"m",
"machen",
"macht",
"machte",
"mag",
"magst",
"mahn",
"mal",
"man",
"manche",
"manchem",
"manchen",
"mancher",
"manches",
"mann",
"mehr",
"mein",
"meine",
"meinem",
"meinen",
"meiner",
"meines",
"mensch",
"menschen",
"mich",
"mir",
"mit",
"mittel",
"mochte",
"mochten",
"morgen",
"muss",
"musst",
"musste",
"mussten",
"muß",
"mußt",
"möchte",
"mögen",
"möglich",
"mögt",
"müssen",
"müsst",
"müßt",
"n",
"na",
"nach",
"nachdem",
"nahm",
"natürlich",
"neben",
"nein",
"neue",
"neuen",
"neun",
"neunte",
"neunten",
"neunter",
"neuntes",
"nicht",
"nichts",
"nie",
"niemand",
"niemandem",
"niemanden",
"noch",
"nun",
"nur",
"o",
"ob",
"oben",
"oder",
"offen",
"oft",
"ohne",
"p",
"q",
"r",
"recht",
"rechte",
"rechten",
"rechter",
"rechtes",
"richtig",
"rund",
"s",
"sa",
"sache",
"sagt",
"sagte",
"sah",
"satt",
"schlecht",
"schon",
"sechs",
"sechste",
"sechsten",
"sechster",
"sechstes",
"sehr",
"sei",
"seid",
"seien",
"sein",
"seine",
"seinem",
"seinen",
"seiner",
"seines",
"seit",
"seitdem",
"selbst",
"sich",
"sie",
"sieben",
"siebente",
"siebenten",
"siebenter",
"siebentes",
"sind",
"so",
"solang",
"solche",
"solchem",
"solchen",
"solcher",
"solches",
"soll",
"sollen",
"sollst",
"sollt",
"sollte",
"sollten",
"sondern",
"sonst",
"soweit",
"sowie",
"später",
"startseite",
"statt",
"steht",
"suche",
"t",
"tag",
"tage",
"tagen",
"tat",
"teil",
"tel",
"tritt",
"trotzdem",
"tun",
"u",
"uhr",
"um",
"und",
"und?",
"uns",
"unser",
"unsere",
"unserer",
"unter",
"v",
"vergangenen",
"viel",
"viele",
"vielem",
"vielen",
"vielleicht",
"vier",
"vierte",
"vierten",
"vierter",
"viertes",
"vom",
"von",
"vor",
"w",
"wahr?",
"wann",
"war",
"waren",
"wart",
"warum",
"was",
"wegen",
"weil",
"weit",
"weiter",
"weitere",
"weiteren",
"weiteres",
"welche",
"welchem",
"welchen",
"welcher",
"welches",
"wem",
"wen",
"wenig",
"wenige",
"weniger",
"weniges",
"wenigstens",
"wenn",
"wer",
"werde",
"werden",
"werdet",
"weshalb",
"wessen",
"wie",
"wieder",
"wieso",
"will",
"willst",
"wir",
"wird",
"wirklich",
"wirst",
"wissen",
"wo",
"wohl",
"wollen",
"wollt",
"wollte",
"wollten",
"worden",
"wurde",
"wurden",
"während",
"währenddem",
"währenddessen",
"wäre",
"würde",
"würden",
"x",
"y",
"z",
"z.b",
"zehn",
"zehnte",
"zehnten",
"zehnter",
"zehntes",
"zeit",
"zu",
"zuerst",
"zugleich",
"zum",
"zunächst",
"zur",
"zurück",
"zusammen",
"zwanzig",
"zwar",
"zwei",
"zweite",
"zweiten",
"zweiter",
"zweites",
"zwischen",
"zwölf",
"über",
"überhaupt",
"übrigens",
],
"es": [
"a",
"actualmente",
"acuerdo",
"adelante",
"ademas",
"además",
"adrede",
"afirmó",
"agregó",
"ahi",
"ahora",
"ahí",
"al",
"algo",
"alguna",
"algunas",
"alguno",
"algunos",
"algún",
"alli",
"allí",
"alrededor",
"ambos",
"ampleamos",
"antano",
"antaño",
"ante",
"anterior",
"antes",
"apenas",
"aproximadamente",
"aquel",
"aquella",
"aquellas",
"aquello",
"aquellos",
"aqui",
"aquél",
"aquélla",
"aquéllas",
"aquéllos",
"aquí",
"arriba",
"arribaabajo",
"aseguró",
"asi",
"así",
"atras",
"aun",
"aunque",
"ayer",
"añadió",
"aún",
"b",
"bajo",
"bastante",
"bien",
"breve",
"buen",
"buena",
"buenas",
"bueno",
"buenos",
"c",
"cada",
"casi",
"cerca",
"cierta",
"ciertas",
"cierto",
"ciertos",
"cinco",
"claro",
"comentó",
"como",
"con",
"conmigo",
"conocer",
"conseguimos",
"conseguir",
"considera",
"consideró",
"consigo",
"consigue",
"consiguen",
"consigues",
"contigo",
"contra",
"cosas",
"creo",
"cual",
"cuales",
"cualquier",
"cuando",
"cuanta",
"cuantas",
"cuanto",
"cuantos",
"cuatro",
"cuenta",
"cuál",
"cuáles",
"cuándo",
"cuánta",
"cuántas",
"cuánto",
"cuántos",
"cómo",
"d",
"da",
"dado",
"dan",
"dar",
"de",
"debajo",
"debe",
"deben",
"debido",
"decir",
"dejó",
"del",
"delante",
"demasiado",
"demás",
"dentro",
"deprisa",
"desde",
"despacio",
"despues",
"después",
"detras",
"detrás",
"dia",
"dias",
"dice",
"dicen",
"dicho",
"dieron",
"diferente",
"diferentes",
"dijeron",
"dijo",
"dio",
"donde",
"dos",
"durante",
"día",
"días",
"dónde",
"e",
"ejemplo",
"el",
"ella",
"ellas",
"ello",
"ellos",
"embargo",
"empleais",
"emplean",
"emplear",
"empleas",
"empleo",
"en",
"encima",
"encuentra",
"enfrente",
"enseguida",
"entonces",
"entre",
"era",
"eramos",
"eran",
"eras",
"eres",
"es",
"esa",
"esas",
"ese",
"eso",
"esos",
"esta",
"estaba",
"estaban",
"estado",
"estados",
"estais",
"estamos",
"estan",
"estar",
"estará",
"estas",
"este",
"esto",
"estos",
"estoy",
"estuvo",
"está",
"están",
"ex",
"excepto",
"existe",
"existen",
"explicó",
"expresó",
"f",
"fin",
"final",
"fue",
"fuera",
"fueron",
"fui",
"fuimos",
"g",
"general",
"gran",
"grandes",
"gueno",
"h",
"ha",
"haber",
"habia",
"habla",
"hablan",
"habrá",
"había",
"habían",
"hace",
"haceis",
"hacemos",
"hacen",
"hacer",
"hacerlo",
"haces",
"hacia",
"haciendo",
"hago",
"han",
"hasta",
"hay",
"haya",
"he",
"hecho",
"hemos",
"hicieron",
"hizo",
"horas",
"hoy",
"hubo",
"i",
"igual",
"incluso",
"indicó",
"informo",
"informó",
"intenta",
"intentais",
"intentamos",
"intentan",
"intentar",
"intentas",
"intento",
"ir",
"j",
"junto",
"k",
"l",
"la",
"lado",
"largo",
"las",
"le",
"lejos",
"les",
"llegó",
"lleva",
"llevar",
"lo",
"los",
"luego",
"lugar",
"m",
"mal",
"manera",
"manifestó",
"mas",
"mayor",
"me",
"mediante",
"medio",
"mejor",
"mencionó",
"menos",
"menudo",
"mi",
"mia",
"mias",
"mientras",
"mio",
"mios",
"mis",
"misma",
"mismas",
"mismo",
"mismos",
"modo",
"momento",
"mucha",
"muchas",
"mucho",
"muchos",
"muy",
"más",
"mí",
"mía",
"mías",
"mío",
"míos",
"n",
"nada",
"nadie",
"ni",
"ninguna",
"ningunas",
"ninguno",
"ningunos",
"ningún",
"no",
"nos",
"nosotras",
"nosotros",
"nuestra",
"nuestras",
"nuestro",
"nuestros",
"nueva",
"nuevas",
"nuevo",
"nuevos",
"nunca",
"o",
"ocho",
"os",
"otra",
"otras",
"otro",
"otros",
"p",
"pais",
"para",
"parece",
"parte",
"partir",
"pasada",
"pasado",
"paìs",
"peor",
"pero",
"pesar",
"poca",
"pocas",
"poco",
"pocos",
"podeis",
"podemos",
"poder",
"podria",
"podriais",
"podriamos",
"podrian",
"podrias",
"podrá",
"podrán",
"podría",
"podrían",
"poner",
"por",
"porque",
"posible",
"primer",
"primera",
"primero",
"primeros",
"principalmente",
"pronto",
"propia",
"propias",
"propio",
"propios",
"proximo",
"próximo",
"próximos",
"pudo",
"pueda",
"puede",
"pueden",
"puedo",
"pues",
"q",
"qeu",
"que",
"quedó",
"queremos",
"quien",
"quienes",
"quiere",
"quiza",
"quizas",
"quizá",
"quizás",
"quién",
"quiénes",
"qué",
"r",
"raras",
"realizado",
"realizar",
"realizó",
"repente",
"respecto",
"s",
"sabe",
"sabeis",
"sabemos",
"saben",
"saber",
"sabes",
"salvo",
"se",
"sea",
"sean",
"segun",
"segunda",
"segundo",
"según",
"seis",
"ser",
"sera",
"será",
"serán",
"sería",
"señaló",
"si",
"sido",
"siempre",
"siendo",
"siete",
"sigue",
"siguiente",
"sin",
"sino",
"sobre",
"sois",
"sola",
"solamente",
"solas",
"solo",
"solos",
"somos",
"son",
"soy",
"soyos",
"su",
"supuesto",
"sus",
"suya",
"suyas",
"suyo",
"sé",
"sí",
"sólo",
"t",
"tal",
"tambien",
"también",
"tampoco",
"tan",
"tanto",
"tarde",
"te",
"temprano",
"tendrá",
"tendrán",
"teneis",
"tenemos",
"tener",
"tenga",
"tengo",
"tenido",
"tenía",
"tercera",
"ti",
"tiempo",
"tiene",
"tienen",
"toda",
"todas",
"todavia",
"todavía",
"todo",
"todos",
"total",
"trabaja",
"trabajais",
"trabajamos",
"trabajan",
"trabajar",
"trabajas",
"trabajo",
"tras",
"trata",
"través",
"tres",
"tu",
"tus",
"tuvo",
"tuya",
"tuyas",
"tuyo",
"tuyos",
"tú",
"u",
"ultimo",
"un",
"una",
"unas",
"uno",
"unos",
"usa",
"usais",
"usamos",
"usan",
"usar",
"usas",
"uso",
"usted",
"ustedes",
"v",
"va",
"vais",
"valor",
"vamos",
"van",
"varias",
"varios",
"vaya",
"veces",
"ver",
"verdad",
"verdadera",
"verdadero",
"vez",
"vosotras",
"vosotros",
"voy",
"vuestra",
"vuestras",
"vuestro",
"vuestros",
"w",
"x",
"y",
"ya",
"yo",
"z",
"él",
"ésa",
"ésas",
"ése",
"ésos",
"ésta",
"éstas",
"éste",
"éstos",
"última",
"últimas",
"último",
"últimos",
],
"et": [
"aga",
"ei",
"et",
"ja",
"jah",
"kas",
"kui",
"kõik",
"ma",
"me",
"mida",
"midagi",
"mind",
"minu",
"mis",
"mu",
"mul",
"mulle",
"nad",
"nii",
"oled",
"olen",
"oli",
"oma",
"on",
"pole",
"sa",
"seda",
"see",
"selle",
"siin",
"siis",
"ta",
"te",
"ära",
],
"fi": [
"aiemmin",
"aika",
"aikaa",
"aikaan",
"aikaisemmin",
"aikaisin",
"aikajen",
"aikana",
"aikoina",
"aikoo",
"aikovat",
"aina",
"ainakaan",
"ainakin",
"ainoa",
"ainoat",
"aiomme",
"aion",
"aiotte",
"aist",
"aivan",
"ajan",
"alas",
"alemmas",
"alkuisin",
"alkuun",
"alla",
"alle",
"aloitamme",
"aloitan",
"aloitat",
"aloitatte",
"aloitattivat",
"aloitettava",
"aloitettevaksi",
"aloitettu",
"aloitimme",
"aloitin",
"aloitit",
"aloititte",
"aloittaa",
"aloittamatta",
"aloitti",
"aloittivat",
"alta",
"aluksi",
"alussa",
"alusta",
"annettavaksi",
"annetteva",
"annettu",
"ansiosta",
"antaa",
"antamatta",
"antoi",
"aoua",
"apu",
"asia",
"asiaa",
"asian",
"asiasta",
"asiat",
"asioiden",
"asioihin",
"asioita",
"asti",
"avuksi",
"avulla",
"avun",
"avutta",
"edelle",
"edelleen",
"edellä",
"edeltä",
"edemmäs",
"edes",
"edessä",
"edestä",
"ehkä",
"ei",
"eikä",
"eilen",
"eivät",
"eli",
"ellei",
"elleivät",
"ellemme",
"ellen",
"ellet",
"ellette",
"emme",
"en",
"enemmän",
"eniten",
"ennen",
"ensi",
"ensimmäinen",
"ensimmäiseksi",
"ensimmäisen",
"ensimmäisenä",
"ensimmäiset",
"ensimmäisiksi",
"ensimmäisinä",
"ensimmäisiä",
"ensimmäistä",
"ensin",
"entinen",
"entisen",
"entisiä",
"entisten",
"entistä",
"enää",
"eri",
"erittäin",
"erityisesti",
"eräiden",
"eräs",
"eräät",
"esi",
"esiin",
"esillä",
"esimerkiksi",
"et",
"eteen",
"etenkin",
"etessa",
"ette",
"ettei",
"että",
"haikki",
"halua",
"haluaa",
"haluamatta",
"haluamme",
"haluan",
"haluat",
"haluatte",
"haluavat",
"halunnut",
"halusi",
"halusimme",
"halusin",
"halusit",
"halusitte",
"halusivat",
"halutessa",
"haluton",
"he",
"hei",
"heidän",
"heihin",
"heille",
"heiltä",
"heissä",
"heistä",
"heitä",
"helposti",
"heti",
"hetkellä",
"hieman",
"hitaasti",
"hoikein",
"huolimatta",
"huomenna",
"hyvien",
"hyviin",
"hyviksi",
"hyville",
"hyviltä",
"hyvin",
"hyvinä",
"hyvissä",
"hyvistä",
"hyviä",
"hyvä",
"hyvät",
"hyvää",
"hän",
"häneen",
"hänelle",
"hänellä",
"häneltä",
"hänen",
"hänessä",
"hänestä",
"hänet",
"ihan",
"ilman",
"ilmeisesti",
"itse",
"itsensä",
"itseään",
"ja",
"jo",
"johon",
"joiden",
"joihin",
"joiksi",
"joilla",
"joille",
"joilta",
"joissa",
"joista",
"joita",
"joka",
"jokainen",
"jokin",
"joko",
"joku",
"jolla",
"jolle",
"jolloin",
"jolta",
"jompikumpi",
"jonka",
"jonkin",
"jonne",
"joo",
"jopa",
"jos",
"joskus",
"jossa",
"josta",
"jota",
"jotain",
"joten",
"jotenkin",
"jotenkuten",
"jotka",
"jotta",
"jouduimme",
"jouduin",
"jouduit",
"jouduitte",
"joudumme",
"joudun",
"joudutte",
"joukkoon",
"joukossa",
"joukosta",
"joutua",
"joutui",
"joutuivat",
"joutumaan",
"joutuu",
"joutuvat",
"juuri",
"jälkeen",
"jälleen",
"jää",
"kahdeksan",
"kahdeksannen",
"kahdella",
"kahdelle",
"kahdelta",
"kahden",
"kahdessa",
"kahdesta",
"kahta",
"kahteen",
"kai",
"kaiken",
"kaikille",
"kaikilta",
"kaikkea",
"kaikki",
"kaikkia",
"kaikkiaan",
"kaikkialla",
"kaikkialle",
"kaikkialta",
"kaikkien",
"kaikkin",
"kaksi",
"kannalta",
"kannattaa",
"kanssa",
"kanssaan",
"kanssamme",
"kanssani",
"kanssanne",
"kanssasi",
"kauan",
"kauemmas",
"kaukana",
"kautta",
"kehen",
"keiden",
"keihin",
"keiksi",
"keille",
"keillä",
"keiltä",
"keinä",
"keissä",
"keistä",
"keitten",
"keittä",
"keitä",
"keneen",
"keneksi",
"kenelle",
"kenellä",
"keneltä",
"kenen",
"kenenä",
"kenessä",
"kenestä",
"kenet",
"kenettä",
"kennessästä",
"kenties",
"kerran",
"kerta",
"kertaa",
"keskellä",
"kesken",
"keskimäärin",
"ketkä",
"ketä",
"kiitos",
"kohti",
"koko",
"kokonaan",
"kolmas",
"kolme",
"kolmen",
"kolmesti",
"koska",
"koskaan",
"kovin",
"kuin",
"kuinka",
"kuinkan",
"kuitenkaan",
"kuitenkin",
"kuka",
"kukaan",
"kukin",
"kukka",
"kumpainen",
"kumpainenkaan",
"kumpi",
"kumpikaan",
"kumpikin",
"kun",
"kuten",
"kuuden",
"kuusi",
"kuutta",
"kylliksi",
"kyllä",
"kymmenen",
"kyse",
"liian",
"liki",
"lisäksi",
"lisää",
"lla",
"luo",
"luona",
"lähekkäin",
"lähelle",
"lähellä",
"läheltä",
"lähemmäs",
"lähes",
"lähinnä",
"lähtien",
"läpi",
"mahdollisimman",
"mahdollista",
"me",
"meidän",
"meille",
"meillä",
"melkein",
"melko",
"menee",
"meneet",
"menemme",
"menen",
"menet",
"menette",
"menevät",
"meni",
"menimme",
"menin",
"menit",
"menivät",
"mennessä",
"mennyt",
"menossa",
"mihin",
"mikin",
"miksi",
"mikä",
"mikäli",
"mikään",
"milloin",
"milloinkan",
"minne",
"minun",
"minut",
"minä",
"missä",
"mistä",
"miten",
"mitä",
"mitään",
"moi",
"molemmat",
"mones",
"monesti",
"monet",
"moni",
"moniaalla",
"moniaalle",
"moniaalta",
"monta",
"muassa",
"muiden",
"muita",
"muka",
"mukaan",
"mukaansa",
"mukana",
"mutta",
"muu",
"muualla",
"muualle",
"muualta",
"muuanne",
"muulloin",
"muun",
"muut",
"muuta",
"muutama",
"muutaman",
"muuten",
"myöhemmin",
"myös",
"myöskin",
"myöskään",
"myötä",
"ne",
"neljä",
"neljän",
"neljää",
"niiden",
"niin",
"niistä",
"niitä",
"noin",
"nopeammin",
"nopeasti",
"nopeiten",
"nro",
"nuo",
"nyt",
"näiden",
"näin",
"näissä",
"näissähin",
"näissälle",
"näissältä",
"näissästä",
"näitä",
"nämä",
"ohi",
"oikea",
"oikealla",
"oikein",
"ole",
"olemme",
"olen",
"olet",
"olette",
"oleva",
"olevan",
"olevat",
"oli",
"olimme",
"olin",
"olisi",
"olisimme",
"olisin",
"olisit",
"olisitte",
"olisivat",
"olit",
"olitte",
"olivat",
"olla",
"olleet",
"olli",
"ollut",
"oma",
"omaa",
"omaan",
"omaksi",
"omalle",
"omalta",
"oman",
"omassa",
"omat",
"omia",
"omien",
"omiin",
"omiksi",
"omille",
"omilta",
"omissa",
"omista",
"on",
"onkin",
"onko",
"ovat",
"paikoittain",
"paitsi",
"pakosti",
"paljon",
"paremmin",
"parempi",
"parhaillaan",
"parhaiten",
"perusteella",
"peräti",
"pian",
"pieneen",
"pieneksi",
"pienelle",
"pienellä",
"pieneltä",
"pienempi",
"pienestä",
"pieni",
"pienin",
"puolesta",
"puolestaan",
"päälle",
"runsaasti",
"saakka",
"sadam",
"sama",
"samaa",
"samaan",
"samalla",
"samallalta",
"samallassa",
"samallasta",
"saman",
"samat",
"samoin",
"sata",
"sataa",
"satojen",
"se",
"seitsemän",
"sekä",
"sen",
"seuraavat",
"siellä",
"sieltä",
"siihen",
"siinä",
"siis",
"siitä",
"sijaan",
"siksi",
"silloin",
"sillä",
"silti",
"sinne",
"sinua",
"sinulle",
"sinulta",
"sinun",
"sinussa",
"sinusta",
"sinut",
"sinä",
"sisäkkäin",
"sisällä",
"siten",
"sitten",
"sitä",
"ssa",
"sta",
"suoraan",
"suuntaan",
"suuren",
"suuret",
"suuri",
"suuria",
"suurin",
"suurten",
"taa",
"taas",
"taemmas",
"tahansa",
"tai",
"takaa",
"takaisin",
"takana",
"takia",
"tapauksessa",
"tarpeeksi",
"tavalla",
"tavoitteena",
"te",
"tietysti",
"todella",
"toinen",
"toisaalla",
"toisaalle",
"toisaalta",
"toiseen",
"toiseksi",
"toisella",
"toiselle",
"toiselta",
"toisemme",
"toisen",
"toisensa",
"toisessa",
"toisesta",
"toista",
"toistaiseksi",
"toki",
"tosin",
"tuhannen",
"tuhat",
"tule",
"tulee",
"tulemme",
"tulen",
"tulet",
"tulette",
"tulevat",
"tulimme",
"tulin",
"tulisi",
"tulisimme",
"tulisin",
"tulisit",
"tulisitte",
"tulisivat",
"tulit",
"tulitte",
"tulivat",
"tulla",
"tulleet",
"tullut",
"tuntuu",
"tuo",
"tuolla",
"tuolloin",
"tuolta",
"tuonne",
"tuskin",
"tykö",
"tähän",
"tällä",
"tällöin",
"tämä",
"tämän",
"tänne",
"tänä",
"tänään",
"tässä",
"tästä",
"täten",
"tätä",
"täysin",
"täytyvät",
"täytyy",
"täällä",
"täältä",
"ulkopuolella",
"usea",
"useasti",
"useimmiten",
"usein",
"useita",
"uudeksi",
"uudelleen",
"uuden",
"uudet",
"uusi",
"uusia",
"uusien",
"uusinta",
"uuteen",
"uutta",
"vaan",
"vahemmän",
"vai",
"vaiheessa",
"vaikea",
"vaikean",
"vaikeat",
"vaikeilla",
"vaikeille",
"vaikeilta",
"vaikeissa",
"vaikeista",
"vaikka",
"vain",
"varmasti",
"varsin",
"varsinkin",
"varten",
"vasen",
"vasenmalla",
"vasta",
"vastaan",
"vastakkain",
"vastan",
"verran",
"vielä",
"vierekkäin",
"vieressä",
"vieri",
"viiden",
"viime",
"viimeinen",
"viimeisen",
"viimeksi",
"viisi",
"voi",
"voidaan",
"voimme",
"voin",
"voisi",
"voit",
"voitte",
"voivat",
"vuoden",
"vuoksi",
"vuosi",
"vuosien",
"vuosina",
"vuotta",
"vähemmän",
"vähintään",
"vähiten",
"vähän",
"välillä",
"yhdeksän",
"yhden",
"yhdessä",
"yhteen",
"yhteensä",
"yhteydessä",
"yhteyteen",
"yhtä",
"yhtäälle",
"yhtäällä",
"yhtäältä",
"yhtään",
"yhä",
"yksi",
"yksin",
"yksittäin",
"yleensä",
"ylemmäs",
"yli",
"ylös",
"ympäri",
"älköön",
"älä",
],
"fr": [
"a",
"abord",
"absolument",
"afin",
"ah",
"ai",
"aie",
"ailleurs",
"ainsi",
"ait",
"allaient",
"allo",
"allons",
"allô",
"alors",
"anterieur",
"anterieure",
"anterieures",
"apres",
"après",
"as",
"assez",
"attendu",
"au",
"aucun",
"aucune",
"aujourd",
"aujourd'hui",
"aupres",
"auquel",
"aura",
"auraient",
"aurait",
"auront",
"aussi",
"autre",
"autrefois",
"autrement",
"autres",
"autrui",
"aux",
"auxquelles",
"auxquels",
"avaient",
"avais",
"avait",
"avant",
"avec",
"avoir",
"avons",
"ayant",
"b",
"bah",
"bas",
"basee",
"bat",
"beau",
"beaucoup",
"bien",
"bigre",
"boum",
"bravo",
"brrr",
"c",
"car",
"ce",
"ceci",
"cela",
"celle",
"celle-ci",
"celle-là",
"celles",
"celles-ci",
"celles-là",
"celui",
"celui-ci",
"celui-là",
"cent",
"cependant",
"certain",
"certaine",
"certaines",
"certains",
"certes",
"ces",
"cet",
"cette",
"ceux",
"ceux-ci",
"ceux-là",
"chacun",
"chacune",
"chaque",
"cher",
"chers",
"chez",
"chiche",
"chut",
"chère",
"chères",
"ci",
"cinq",
"cinquantaine",
"cinquante",
"cinquantième",
"cinquième",
"clac",
"clic",
"combien",
"comme",
"comment",
"comparable",
"comparables",
"compris",
"concernant",
"contre",
"couic",
"crac",
"d",
"da",
"dans",
"de",
"debout",
"dedans",
"dehors",
"deja",
"delà",
"depuis",
"dernier",
"derniere",
"derriere",
"derrière",
"des",
"desormais",
"desquelles",
"desquels",
"dessous",
"dessus",
"deux",
"deuxième",
"deuxièmement",
"devant",
"devers",
"devra",
"different",
"differentes",
"differents",
"différent",
"différente",
"différentes",
"différents",
"dire",
"directe",
"directement",
"dit",
"dite",
"dits",
"divers",
"diverse",
"diverses",
"dix",
"dix-huit",
"dix-neuf",
"dix-sept",
"dixième",
"doit",
"doivent",
"donc",
"dont",
"douze",
"douzième",
"dring",
"du",
"duquel",
"durant",
"dès",
"désormais",
"e",
"effet",
"egale",
"egalement",
"egales",
"eh",
"elle",
"elle-même",
"elles",
"elles-mêmes",
"en",
"encore",
"enfin",
"entre",
"envers",
"environ",
"es",
"est",
"et",
"etant",
"etc",
"etre",
"eu",
"euh",
"eux",
"eux-mêmes",
"exactement",
"excepté",
"extenso",
"exterieur",
"f",
"fais",
"faisaient",
"faisant",
"fait",
"façon",
"feront",
"fi",
"flac",
"floc",
"font",
"g",
"gens",
"h",
"ha",
"hein",
"hem",
"hep",
"hi",
"ho",
"holà",
"hop",
"hormis",
"hors",
"hou",
"houp",
"hue",
"hui",
"huit",
"huitième",
"hum",
"hurrah",
"hé",
"hélas",
"i",
"il",
"ils",
"importe",
"j",
"je",
"jusqu",
"jusque",
"juste",
"k",
"l",
"la",
"laisser",
"laquelle",
"las",
"le",
"lequel",
"les",
"lesquelles",
"lesquels",
"leur",
"leurs",
"longtemps",
"lors",
"lorsque",
"lui",
"lui-meme",
"lui-même",
"là",
"lès",
"m",
"ma",
"maint",
"maintenant",
"mais",
"malgre",
"malgré",
"maximale",
"me",
"meme",
"memes",
"merci",
"mes",
"mien",
"mienne",
"miennes",
"miens",
"mille",
"mince",
"minimale",
"moi",
"moi-meme",
"moi-même",
"moindres",
"moins",
"mon",
"moyennant",
"multiple",
"multiples",
"même",
"mêmes",
"n",
"na",
"naturel",
"naturelle",
"naturelles",
"ne",
"neanmoins",
"necessaire",
"necessairement",
"neuf",
"neuvième",
"ni",
"nombreuses",
"nombreux",
"non",
"nos",
"notamment",
"notre",
"nous",
"nous-mêmes",
"nouveau",
"nul",
"néanmoins",
"nôtre",
"nôtres",
"o",
"oh",
"ohé",
"ollé",
"olé",
"on",
"ont",
"onze",
"onzième",
"ore",
"ou",
"ouf",
"ouias",
"oust",
"ouste",
"outre",
"ouvert",
"ouverte",
"ouverts",
"o|",
"où",
"p",
"paf",
"pan",
"par",
"parce",
"parfois",
"parle",
"parlent",
"parler",
"parmi",
"parseme",
"partant",
"particulier",
"particulière",
"particulièrement",
"pas",
"passé",
"pendant",
"pense",
"permet",
"personne",
"peu",
"peut",
"peuvent",
"peux",
"pff",
"pfft",
"pfut",
"pif",
"pire",
"plein",
"plouf",
"plus",
"plusieurs",
"plutôt",
"possessif",
"possessifs",
"possible",
"possibles",
"pouah",
"pour",
"pourquoi",
"pourrais",
"pourrait",
"pouvait",
"prealable",
"precisement",
"premier",
"première",
"premièrement",
"pres",
"probable",
"probante",
"procedant",
"proche",
"près",
"psitt",
"pu",
"puis",
"puisque",
"pur",
"pure",
"q",
"qu",
"quand",
"quant",
"quant-à-soi",
"quanta",
"quarante",
"quatorze",
"quatre",
"quatre-vingt",
"quatrième",
"quatrièmement",
"que",
"quel",
"quelconque",
"quelle",
"quelles",
"quelqu'un",
"quelque",
"quelques",
"quels",
"qui",
"quiconque",
"quinze",
"quoi",
"quoique",
"r",
"rare",
"rarement",
"rares",
"relative",
"relativement",
"remarquable",
"rend",
"rendre",
"restant",
"reste",
"restent",
"restrictif",
"retour",
"revoici",
"revoilà",
"rien",
"s",
"sa",
"sacrebleu",
"sait",
"sans",
"sapristi",
"sauf",
"se",
"sein",
"seize",
"selon",
"semblable",
"semblaient",
"semble",
"semblent",
"sent",
"sept",
"septième",
"sera",
"seraient",
"serait",
"seront",
"ses",
"seul",
"seule",
"seulement",
"si",
"sien",
"sienne",
"siennes",
"siens",
"sinon",
"six",
"sixième",
"soi",
"soi-même",
"soit",
"soixante",
"son",
"sont",
"sous",
"souvent",
"specifique",
"specifiques",
"speculatif",
"stop",
"strictement",
"subtiles",
"suffisant",
"suffisante",
"suffit",
"suis",
"suit",
"suivant",
"suivante",
"suivantes",
"suivants",
"suivre",
"superpose",
"sur",
"surtout",
"t",
"ta",
"tac",
"tant",
"tardive",
"te",
"tel",
"telle",
"tellement",
"telles",
"tels",
"tenant",
"tend",
"tenir",
"tente",
"tes",
"tic",
"tien",
"tienne",
"tiennes",
"tiens",
"toc",
"toi",
"toi-même",
"ton",
"touchant",
"toujours",
"tous",
"tout",
"toute",
"toutefois",
"toutes",
"treize",
"trente",
"tres",
"trois",
"troisième",
"troisièmement",
"trop",
"très",
"tsoin",
"tsouin",
"tu",
"té",
"u",
"un",
"une",
"unes",
"uniformement",
"unique",
"uniques",
"uns",
"v",
"va",
"vais",
"vas",
"vers",
"via",
"vif",
"vifs",
"vingt",
"vivat",
"vive",
"vives",
"vlan",
"voici",
"voilà",
"vont",
"vos",
"votre",
"vous",
"vous-mêmes",
"vu",
"vé",
"vôtre",
"vôtres",
"w",
"x",
"y",
"z",
"zut",
"à",
"â",
"ça",
"ès",
"étaient",
"étais",
"était",
"étant",
"été",
"être",
"ô",
],
"hr": [
"a",
"ako",
"ali",
"bi",
"bih",
"bila",
"bili",
"bilo",
"bio",
"bismo",
"biste",
"biti",
"bumo",
"da",
"do",
"duž",
"ga",
"hoće",
"hoćemo",
"hoćete",
"hoćeš",
"hoću",
"i",
"iako",
"ih",
"ili",
"iz",
"ja",
"je",
"jedna",
"jedne",
"jedno",
"jer",
"jesam",
"jesi",
"jesmo",
"jest",
"jeste",
"jesu",
"jim",
"joj",
"još",
"ju",
"kada",
"kako",
"kao",
"koja",
"koje",
"koji",
"kojima",
"koju",
"kroz",
"li",
"me",
"mene",
"meni",
"mi",
"mimo",
"moj",
"moja",
"moje",
"mu",
"na",
"nad",
"nakon",
"nam",
"nama",
"nas",
"naš",
"naša",
"naše",
"našeg",
"ne",
"nego",
"neka",
"neki",
"nekog",
"neku",
"nema",
"netko",
"neće",
"nećemo",
"nećete",
"nećeš",
"neću",
"nešto",
"ni",
"nije",
"nikoga",
"nikoje",
"nikoju",
"nisam",
"nisi",
"nismo",
"niste",
"nisu",
"njega",
"njegov",
"njegova",
"njegovo",
"njemu",
"njezin",
"njezina",
"njezino",
"njih",
"njihov",
"njihova",
"njihovo",
"njim",
"njima",
"njoj",
"nju",
"no",
"o",
"od",
"odmah",
"on",
"ona",
"oni",
"ono",
"ova",
"pa",
"pak",
"po",
"pod",
"pored",
"prije",
"s",
"sa",
"sam",
"samo",
"se",
"sebe",
"sebi",
"si",
"smo",
"ste",
"su",
"sve",
"svi",
"svog",
"svoj",
"svoja",
"svoje",
"svom",
"ta",
"tada",
"taj",
"tako",
"te",
"tebe",
"tebi",
"ti",
"to",
"toj",
"tome",
"tu",
"tvoj",
"tvoja",
"tvoje",
"u",
"uz",
"vam",
"vama",
"vas",
"vaš",
"vaša",
"vaše",
"već",
"vi",
"vrlo",
"za",
"zar",
"će",
"ćemo",
"ćete",
"ćeš",
"ću",
"što",
],
"hu": [
"a",
"abba",
"abban",
"abból",
"addig",
"ahhoz",
"ahogy",
"ahol",
"aki",
"akik",
"akkor",
"akár",
"alapján",
"alatt",
"alatta",
"alattad",
"alattam",
"alattatok",
"alattuk",
"alattunk",
"alá",
"alád",
"alájuk",
"alám",
"alánk",
"alátok",
"alól",
"alóla",
"alólad",
"alólam",
"alólatok",
"alóluk",
"alólunk",
"amely",
"amelybol",
"amelyek",
"amelyekben",
"amelyeket",
"amelyet",
"amelyik",
"amelynek",
"ami",
"amikor",
"amit",
"amolyan",
"amott",
"amíg",
"annak",
"annál",
"arra",
"arról",
"attól",
"az",
"aznap",
"azok",
"azokat",
"azokba",
"azokban",
"azokból",
"azokhoz",
"azokig",
"azokkal",
"azokká",
"azoknak",
"azoknál",
"azokon",
"azokra",
"azokról",
"azoktól",
"azokért",
"azon",
"azonban",
"azonnal",
"azt",
"aztán",
"azután",
"azzal",
"azzá",
"azért",
"bal",
"balra",
"ban",
"be",
"belé",
"beléd",
"beléjük",
"belém",
"belénk",
"belétek",
"belül",
"belőle",
"belőled",
"belőlem",
"belőletek",
"belőlük",
"belőlünk",
"ben",
"benne",
"benned",
"bennem",
"bennetek",
"bennük",
"bennünk",
"bár",
"bárcsak",
"bármilyen",
"búcsú",
"cikk",
"cikkek",
"cikkeket",
"csak",
"csakhogy",
"csupán",
"de",
"dehogy",
"e",
"ebbe",
"ebben",
"ebből",
"eddig",
"egy",
"egyebek",
"egyebet",
"egyedül",
"egyelőre",
"egyes",
"egyet",
"egyetlen",
"egyik",
"egymás",
"egyre",
"egyszerre",
"egyéb",
"együtt",
"egész",
"egészen",
"ehhez",
"ekkor",
"el",
"eleinte",
"ellen",
"ellenes",
"elleni",
"ellenére",
"elmondta",
"első",
"elsők",
"elsősorban",
"elsőt",
"elé",
"eléd",
"elég",
"eléjük",
"elém",
"elénk",
"elétek",
"elő",
"előbb",
"elől",
"előle",
"előled",
"előlem",
"előletek",
"előlük",
"előlünk",
"először",
"előtt",
"előtte",
"előtted",
"előttem",
"előttetek",
"előttük",
"előttünk",
"előző",
"emilyen",
"engem",
"ennek",
"ennyi",
"ennél",
"enyém",
"erre",
"erről",
"esetben",
"ettől",
"ez",
"ezek",
"ezekbe",
"ezekben",
"ezekből",
"ezeken",
"ezeket",
"ezekhez",
"ezekig",
"ezekkel",
"ezekké",
"ezeknek",
"ezeknél",
"ezekre",
"ezekről",
"ezektől",
"ezekért",
"ezen",
"ezentúl",
"ezer",
"ezret",
"ezt",
"ezután",
"ezzel",
"ezzé",
"ezért",
"fel",
"fele",
"felek",
"felet",
"felett",
"felé",
"fent",
"fenti",
"fél",
"fölé",
"gyakran",
"ha",
"halló",
"hamar",
"hanem",
"harmadik",
"harmadikat",
"harminc",
"hat",
"hatodik",
"hatodikat",
"hatot",
"hatvan",
"helyett",
"hetedik",
"hetediket",
"hetet",
"hetven",
"hirtelen",
"hiszen",
"hiába",
"hogy",
"hogyan",
"hol",
"holnap",
"holnapot",
"honnan",
"hova",
"hozzá",
"hozzád",
"hozzájuk",
"hozzám",
"hozzánk",
"hozzátok",
"hurrá",
"huszadik",
"hány",
"hányszor",
"hármat",
"három",
"hát",
"hátha",
"hátulsó",
"hét",
"húsz",
"ide",
"ide-оda",
"idén",
"igazán",
"igen",
"ill",
"illetve",
"ilyen",
"ilyenkor",
"immár",
"inkább",
"is",
"ismét",
"ison",
"itt",
"jelenleg",
"jobban",
"jobbra",
"jó",
"jól",
"jólesik",
"jóval",
"jövőre",
"kell",
"kellene",
"kellett",
"kelljen",
"keressünk",
"keresztül",
"ketten",
"kettő",
"kettőt",
"kevés",
"ki",
"kiben",
"kiből",
"kicsit",
"kicsoda",
"kihez",
"kik",
"kikbe",
"kikben",
"kikből",
"kiken",
"kiket",
"kikhez",
"kikkel",
"kikké",
"kiknek",
"kiknél",
"kikre",
"kikről",
"kiktől",
"kikért",
"kilenc",
"kilencedik",
"kilencediket",
"kilencet",
"kilencven",
"kin",
"kinek",
"kinél",
"kire",
"kiről",
"kit",
"kitől",
"kivel",
"kivé",
"kié",
"kiért",
"korábban",
"képest",
"kérem",
"kérlek",
"kész",
"késő",
"később",
"későn",
"két",
"kétszer",
"kívül",
"körül",
"köszönhetően",
"köszönöm",
"közben",
"közel",
"közepesen",
"közepén",
"közé",
"között",
"közül",
"külön",
"különben",
"különböző",
"különbözőbb",
"különbözőek",
"lassan",
"le",
"legalább",
"legyen",
"lehet",
"lehetetlen",
"lehetett",
"lehetőleg",
"lehetőség",
"lenne",
"lenni",
"lennék",
"lennének",
"lesz",
"leszek",
"lesznek",
"leszünk",
"lett",
"lettek",
"lettem",
"lettünk",
"lévő",
"ma",
"maga",
"magad",
"magam",
"magatokat",
"magukat",
"magunkat",
"magát",
"mai",
"majd",
"majdnem",
"manapság",
"meg",
"megcsinál",
"megcsinálnak",
"megint",
"megvan",
"mellett",
"mellette",
"melletted",
"mellettem",
"mellettetek",
"mellettük",
"mellettünk",
"mellé",
"melléd",
"melléjük",
"mellém",
"mellénk",
"mellétek",
"mellől",
"mellőle",
"mellőled",
"mellőlem",
"mellőletek",
"mellőlük",
"mellőlünk",
"mely",
"melyek",
"melyik",
"mennyi",
"mert",
"mi",
"miatt",
"miatta",
"miattad",
"miattam",
"miattatok",
"miattuk",
"miattunk",
"mibe",
"miben",
"miből",
"mihez",
"mik",
"mikbe",
"mikben",
"mikből",
"miken",
"miket",
"mikhez",
"mikkel",
"mikké",
"miknek",
"miknél",
"mikor",
"mikre",
"mikről",
"miktől",
"mikért",
"milyen",
"min",
"mind",
"mindegyik",
"mindegyiket",
"minden",
"mindenesetre",
"mindenki",
"mindent",
"mindenütt",
"mindig",
"mindketten",
"minek",
"minket",
"mint",
"mintha",
"minél",
"mire",
"miről",
"mit",
"mitől",
"mivel",
"mivé",
"miért",
"mondta",
"most",
"mostanáig",
"már",
"más",
"másik",
"másikat",
"másnap",
"második",
"másodszor",
"mások",
"másokat",
"mást",
"még",
"mégis",
"míg",
"mögé",
"mögéd",
"mögéjük",
"mögém",
"mögénk",
"mögétek",
"mögött",
"mögötte",
"mögötted",
"mögöttem",
"mögöttetek",
"mögöttük",
"mögöttünk",
"mögül",
"mögüle",
"mögüled",
"mögülem",
"mögületek",
"mögülük",
"mögülünk",
"múltkor",
"múlva",
"na",
"nagy",
"nagyobb",
"nagyon",
"naponta",
"napot",
"ne",
"negyedik",
"negyediket",
"negyven",
"neked",
"nekem",
"neki",
"nekik",
"nektek",
"nekünk",
"nem",
"nemcsak",
"nemrég",
"nincs",
"nyolc",
"nyolcadik",
"nyolcadikat",
"nyolcat",
"nyolcvan",
"nála",
"nálad",
"nálam",
"nálatok",
"náluk",
"nálunk",
"négy",
"négyet",
"néha",
"néhány",
"nélkül",
"o",
"oda",
"ok",
"olyan",
"onnan",
"ott",
"pedig",
"persze",
"pár",
"például",
"rajta",
"rajtad",
"rajtam",
"rajtatok",
"rajtuk",
"rajtunk",
"rendben",
"rosszul",
"rá",
"rád",
"rájuk",
"rám",
"ránk",
"rátok",
"régen",
"régóta",
"részére",
"róla",
"rólad",
"rólam",
"rólatok",
"róluk",
"rólunk",
"rögtön",
"s",
"saját",
"se",
"sem",
"semmi",
"semmilyen",
"semmiség",
"senki",
"soha",
"sok",
"sokan",
"sokat",
"sokkal",
"sokszor",
"sokáig",
"során",
"stb.",
"szemben",
"szerbusz",
"szerint",
"szerinte",
"szerinted",
"szerintem",
"szerintetek",
"szerintük",
"szerintünk",
"szervusz",
"szinte",
"számára",
"száz",
"századik",
"százat",
"szépen",
"szét",
"szíves",
"szívesen",
"szíveskedjék",
"sőt",
"talán",
"tavaly",
"te",
"tegnap",
"tegnapelőtt",
"tehát",
"tele",
"teljes",
"tessék",
"ti",
"tied",
"titeket",
"tizedik",
"tizediket",
"tizenegy",
"tizenegyedik",
"tizenhat",
"tizenhárom",
"tizenhét",
"tizenkettedik",
"tizenkettő",
"tizenkilenc",
"tizenkét",
"tizennyolc",
"tizennégy",
"tizenöt",
"tizet",
"tovább",
"további",
"továbbá",
"távol",
"téged",
"tényleg",
"tíz",
"több",
"többi",
"többször",
"túl",
"tőle",
"tőled",
"tőlem",
"tőletek",
"tőlük",
"tőlünk",
"ugyanakkor",
"ugyanez",
"ugyanis",
"ugye",
"urak",
"uram",
"urat",
"utoljára",
"utolsó",
"után",
"utána",
"vagy",
"vagyis",
"vagyok",
"vagytok",
"vagyunk",
"vajon",
"valahol",
"valaki",
"valakit",
"valamelyik",
"valami",
"valamint",
"való",
"van",
"vannak",
"vele",
"veled",
"velem",
"veletek",
"velük",
"velünk",
"vissza",
"viszlát",
"viszont",
"viszontlátásra",
"volna",
"volnának",
"volnék",
"volt",
"voltak",
"voltam",
"voltunk",
"végre",
"végén",
"végül",
"által",
"általában",
"ám",
"át",
"éljen",
"én",
"éppen",
"érte",
"érted",
"értem",
"értetek",
"értük",
"értünk",
"és",
"év",
"évben",
"éve",
"évek",
"éves",
"évi",
"évvel",
"így",
"óta",
"ön",
"önbe",
"önben",
"önből",
"önhöz",
"önnek",
"önnel",
"önnél",
"önre",
"önről",
"önt",
"öntől",
"önért",
"önök",
"önökbe",
"önökben",
"önökből",
"önöket",
"önökhöz",
"önökkel",
"önöknek",
"önöknél",
"önökre",
"önökről",
"önöktől",
"önökért",
"önökön",
"önön",
"össze",
"öt",
"ötven",
"ötödik",
"ötödiket",
"ötöt",
"úgy",
"úgyis",
"úgynevezett",
"új",
"újabb",
"újra",
"úr",
"ő",
"ők",
"őket",
"őt",
],
"it": [
"IE",
"a",
"abbastanza",
"abbia",
"abbiamo",
"abbiano",
"abbiate",
"accidenti",
"ad",
"adesso",
"affinche",
"agl",
"agli",
"ahime",
"ahimè",
"ai",
"al",
"alcuna",
"alcuni",
"alcuno",
"all",
"alla",
"alle",
"allo",
"allora",
"altri",
"altrimenti",
"altro",
"altrove",
"altrui",
"anche",
"ancora",
"anni",
"anno",
"ansa",
"anticipo",
"assai",
"attesa",
"attraverso",
"avanti",
"avemmo",
"avendo",
"avente",
"aver",
"avere",
"averlo",
"avesse",
"avessero",
"avessi",
"avessimo",
"aveste",
"avesti",
"avete",
"aveva",
"avevamo",
"avevano",
"avevate",
"avevi",
"avevo",
"avrai",
"avranno",
"avrebbe",
"avrebbero",
"avrei",
"avremmo",
"avremo",
"avreste",
"avresti",
"avrete",
"avrà",
"avrò",
"avuta",
"avute",
"avuti",
"avuto",
"basta",
"bene",
"benissimo",
"berlusconi",
"brava",
"bravo",
"c",
"casa",
"caso",
"cento",
"certa",
"certe",
"certi",
"certo",
"che",
"chi",
"chicchessia",
"chiunque",
"ci",
"ciascuna",
"ciascuno",
"cima",
"cio",
"cioe",
"cioè",
"circa",
"citta",
"città",
"ciò",
"co",
"codesta",
"codesti",
"codesto",
"cogli",
"coi",
"col",
"colei",
"coll",
"coloro",
"colui",
"come",
"cominci",
"comunque",
"con",
"concernente",
"conciliarsi",
"conclusione",
"consiglio",
"contro",
"cortesia",
"cos",
"cosa",
"cosi",
"così",
"cui",
"d",
"da",
"dagl",
"dagli",
"dai",
"dal",
"dall",
"dalla",
"dalle",
"dallo",
"dappertutto",
"davanti",
"degl",
"degli",
"dei",
"del",
"dell",
"della",
"delle",
"dello",
"dentro",
"detto",
"deve",
"di",
"dice",
"dietro",
"dire",
"dirimpetto",
"diventa",
"diventare",
"diventato",
"dopo",
"dov",
"dove",
"dovra",
"dovrà",
"dovunque",
"due",
"dunque",
"durante",
"e",
"ebbe",
"ebbero",
"ebbi",
"ecc",
"ecco",
"ed",
"effettivamente",
"egli",
"ella",
"entrambi",
"eppure",
"era",
"erano",
"eravamo",
"eravate",
"eri",
"ero",
"esempio",
"esse",
"essendo",
"esser",
"essere",
"essi",
"ex",
"fa",
"faccia",
"facciamo",
"facciano",
"facciate",
"faccio",
"facemmo",
"facendo",
"facesse",
"facessero",
"facessi",
"facessimo",
"faceste",
"facesti",
"faceva",
"facevamo",
"facevano",
"facevate",
"facevi",
"facevo",
"fai",
"fanno",
"farai",
"faranno",
"fare",
"farebbe",
"farebbero",
"farei",
"faremmo",
"faremo",
"fareste",
"faresti",
"farete",
"farà",
"farò",
"fatto",
"favore",
"fece",
"fecero",
"feci",
"fin",
"finalmente",
"finche",
"fine",
"fino",
"forse",
"forza",
"fosse",
"fossero",
"fossi",
"fossimo",
"foste",
"fosti",
"fra",
"frattempo",
"fu",
"fui",
"fummo",
"fuori",
"furono",
"futuro",
"generale",
"gia",
"giacche",
"giorni",
"giorno",
"già",
"gli",
"gliela",
"gliele",
"glieli",
"glielo",
"gliene",
"governo",
"grande",
"grazie",
"gruppo",
"ha",
"haha",
"hai",
"hanno",
"ho",
"i",
"ieri",
"il",
"improvviso",
"in",
"inc",
"infatti",
"inoltre",
"insieme",
"intanto",
"intorno",
"invece",
"io",
"l",
"la",
"lasciato",
"lato",
"lavoro",
"le",
"lei",
"li",
"lo",
"lontano",
"loro",
"lui",
"lungo",
"luogo",
"là",
"ma",
"macche",
"magari",
"maggior",
"mai",
"male",
"malgrado",
"malissimo",
"mancanza",
"marche",
"me",
"medesimo",
"mediante",
"meglio",
"meno",
"mentre",
"mesi",
"mezzo",
"mi",
"mia",
"mie",
"miei",
"mila",
"miliardi",
"milioni",
"minimi",
"ministro",
"mio",
"modo",
"molti",
"moltissimo",
"molto",
"momento",
"mondo",
"mosto",
"nazionale",
"ne",
"negl",
"negli",
"nei",
"nel",
"nell",
"nella",
"nelle",
"nello",
"nemmeno",
"neppure",
"nessun",
"nessuna",
"nessuno",
"niente",
"no",
"noi",
"non",
"nondimeno",
"nonostante",
"nonsia",
"nostra",
"nostre",
"nostri",
"nostro",
"novanta",
"nove",
"nulla",
"nuovo",
"o",
"od",
"oggi",
"ogni",
"ognuna",
"ognuno",
"oltre",
"oppure",
"ora",
"ore",
"osi",
"ossia",
"ottanta",
"otto",
"paese",
"parecchi",
"parecchie",
"parecchio",
"parte",
"partendo",
"peccato",
"peggio",
"per",
"perche",
"perchè",
"perché",
"percio",
"perciò",
"perfino",
"pero",
"persino",
"persone",
"però",
"piedi",
"pieno",
"piglia",
"piu",
"piuttosto",
"più",
"po",
"pochissimo",
"poco",
"poi",
"poiche",
"possa",
"possedere",
"posteriore",
"posto",
"potrebbe",
"preferibilmente",
"presa",
"press",
"prima",
"primo",
"principalmente",
"probabilmente",
"proprio",
"puo",
"pure",
"purtroppo",
"può",
"qualche",
"qualcosa",
"qualcuna",
"qualcuno",
"quale",
"quali",
"qualunque",
"quando",
"quanta",
"quante",
"quanti",
"quanto",
"quantunque",
"quasi",
"quattro",
"quel",
"quella",
"quelle",
"quelli",
"quello",
"quest",
"questa",
"queste",
"questi",
"questo",
"qui",
"quindi",
"realmente",
"recente",
"recentemente",
"registrazione",
"relativo",
"riecco",
"salvo",
"sara",
"sarai",
"saranno",
"sarebbe",
"sarebbero",
"sarei",
"saremmo",
"saremo",
"sareste",
"saresti",
"sarete",
"sarà",
"sarò",
"scola",
"scopo",
"scorso",
"se",
"secondo",
"seguente",
"seguito",
"sei",
"sembra",
"sembrare",
"sembrato",
"sembri",
"sempre",
"senza",
"sette",
"si",
"sia",
"siamo",
"siano",
"siate",
"siete",
"sig",
"solito",
"solo",
"soltanto",
"sono",
"sopra",
"sotto",
"spesso",
"srl",
"sta",
"stai",
"stando",
"stanno",
"starai",
"staranno",
"starebbe",
"starebbero",
"starei",
"staremmo",
"staremo",
"stareste",
"staresti",
"starete",
"starà",
"starò",
"stata",
"state",
"stati",
"stato",
"stava",
"stavamo",
"stavano",
"stavate",
"stavi",
"stavo",
"stemmo",
"stessa",
"stesse",
"stessero",
"stessi",
"stessimo",
"stesso",
"steste",
"stesti",
"stette",
"stettero",
"stetti",
"stia",
"stiamo",
"stiano",
"stiate",
"sto",
"su",
"sua",
"subito",
"successivamente",
"successivo",
"sue",
"sugl",
"sugli",
"sui",
"sul",
"sull",
"sulla",
"sulle",
"sullo",
"suo",
"suoi",
"tale",
"tali",
"talvolta",
"tanto",
"te",
"tempo",
"ti",
"titolo",
"torino",
"tra",
"tranne",
"tre",
"trenta",
"troppo",
"trovato",
"tu",
"tua",
"tue",
"tuo",
"tuoi",
"tutta",
"tuttavia",
"tutte",
"tutti",
"tutto",
"uguali",
"ulteriore",
"ultimo",
"un",
"una",
"uno",
"uomo",
"va",
"vale",
"vari",
"varia",
"varie",
"vario",
"verso",
"vi",
"via",
"vicino",
"visto",
"vita",
"voi",
"volta",
"volte",
"vostra",
"vostre",
"vostri",
"vostro",
"è",
],
"ko": [
"!",
'"',
"$",
"%",
"&",
"'",
"(",
")",
"*",
"+",
",",
"-",
".",
"...",
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
";",
"<",
"=",
">",
"?",
"@",
"\\",
"^",
"_",
"`",
"|",
"~",
"·",
"—",
"——",
"‘",
"’",
"“",
"”",
"…",
"、",
"。",
"〈",
"〉",
"《",
"》",
"가",
"가까스로",
"가령",
"각",
"각각",
"각자",
"각종",
"갖고말하자면",
"같다",
"같이",
"개의치않고",
"거니와",
"거바",
"거의",
"것",
"것과 같이",
"것들",
"게다가",
"게우다",
"겨우",
"견지에서",
"결과에 이르다",
"결국",
"결론을 낼 수 있다",
"겸사겸사",
"고려하면",
"고로",
"곧",
"공동으로",
"과",
"과연",
"관계가 있다",
"관계없이",
"관련이 있다",
"관하여",
"관한",
"관해서는",
"구",
"구체적으로",
"구토하다",
"그",
"그들",
"그때",
"그래",
"그래도",
"그래서",
"그러나",
"그러니",
"그러니까",
"그러면",
"그러므로",
"그러한즉",
"그런 까닭에",
"그런데",
"그런즉",
"그럼",
"그럼에도 불구하고",
"그렇게 함으로써",
"그렇지",
"그렇지 않다면",
"그렇지 않으면",
"그렇지만",
"그렇지않으면",
"그리고",
"그리하여",
"그만이다",
"그에 따르는",
"그위에",
"그저",
"그중에서",
"그치지 않다",
"근거로",
"근거하여",
"기대여",
"기점으로",
"기준으로",
"기타",
"까닭으로",
"까악",
"까지",
"까지 미치다",
"까지도",
"꽈당",
"끙끙",
"끼익",
"나",
"나머지는",
"남들",
"남짓",
"너",
"너희",
"너희들",
"네",
"넷",
"년",
"논하지 않다",
"놀라다",
"누가 알겠는가",
"누구",
"다른",
"다른 방면으로",
"다만",
"다섯",
"다소",
"다수",
"다시 말하자면",
"다시말하면",
"다음",
"다음에",
"다음으로",
"단지",
"답다",
"당신",
"당장",
"대로 하다",
"대하면",
"대하여",
"대해 말하자면",
"대해서",
"댕그",
"더구나",
"더군다나",
"더라도",
"더불어",
"더욱더",
"더욱이는",
"도달하다",
"도착하다",
"동시에",
"동안",
"된바에야",
"된이상",
"두번째로",
"둘",
"둥둥",
"뒤따라",
"뒤이어",
"든간에",
"들",
"등",
"등등",
"딩동",
"따라",
"따라서",
"따위",
"따지지 않다",
"딱",
"때",
"때가 되어",
"때문에",
"또",
"또한",
"뚝뚝",
"라 해도",
"령",
"로",
"로 인하여",
"로부터",
"로써",
"륙",
"를",
"마음대로",
"마저",
"마저도",
"마치",
"막론하고",
"만 못하다",
"만약",
"만약에",
"만은 아니다",
"만이 아니다",
"만일",
"만큼",
"말하자면",
"말할것도 없고",
"매",
"매번",
"메쓰겁다",
"몇",
"모",
"모두",
"무렵",
"무릎쓰고",
"무슨",
"무엇",
"무엇때문에",
"물론",
"및",
"바꾸어말하면",
"바꾸어말하자면",
"바꾸어서 말하면",
"바꾸어서 한다면",
"바꿔 말하면",
"바로",
"바와같이",
"밖에 안된다",
"반대로",
"반대로 말하자면",
"반드시",
"버금",
"보는데서",
"보다더",
"보드득",
"본대로",
"봐",
"봐라",
"부류의 사람들",
"부터",
"불구하고",
"불문하고",
"붕붕",
"비걱거리다",
"비교적",
"비길수 없다",
"비로소",
"비록",
"비슷하다",
"비추어 보아",
"비하면",
"뿐만 아니라",
"뿐만아니라",
"뿐이다",
"삐걱",
"삐걱거리다",
"사",
"삼",
"상대적으로 말하자면",
"생각한대로",
"설령",
"설마",
"설사",
"셋",
"소생",
"소인",
"솨",
"쉿",
"습니까",
"습니다",
"시각",
"시간",
"시작하여",
"시초에",
"시키다",
"실로",
"심지어",
"아",
"아니",
"아니나다를가",
"아니라면",
"아니면",
"아니었다면",
"아래윗",
"아무거나",
"아무도",
"아야",
"아울러",
"아이",
"아이고",
"아이구",
"아이야",
"아이쿠",
"아하",
"아홉",
"안 그러면",
"않기 위하여",
"않기 위해서",
"알 수 있다",
"알았어",
"앗",
"앞에서",
"앞의것",
"야",
"약간",
"양자",
"어",
"어기여차",
"어느",
"어느 년도",
"어느것",
"어느곳",
"어느때",
"어느쪽",
"어느해",
"어디",
"어때",
"어떠한",
"어떤",
"어떤것",
"어떤것들",
"어떻게",
"어떻해",
"어이",
"어째서",
"어쨋든",
"어쩔수 없다",
"어찌",
"어찌됏든",
"어찌됏어",
"어찌하든지",
"어찌하여",
"언제",
"언젠가",
"얼마",
"얼마 안 되는 것",
"얼마간",
"얼마나",
"얼마든지",
"얼마만큼",
"얼마큼",
"엉엉",
"에",
"에 가서",
"에 달려 있다",
"에 대해",
"에 있다",
"에 한하다",
"에게",
"에서",
"여",
"여기",
"여덟",
"여러분",
"여보시오",
"여부",
"여섯",
"여전히",
"여차",
"연관되다",
"연이서",
"영",
"영차",
"옆사람",
"예",
"예를 들면",
"예를 들자면",
"예컨대",
"예하면",
"오",
"오로지",
"오르다",
"오자마자",
"오직",
"오호",
"오히려",
"와",
"와 같은 사람들",
"와르르",
"와아",
"왜",
"왜냐하면",
"외에도",
"요만큼",
"요만한 것",
"요만한걸",
"요컨대",
"우르르",
"우리",
"우리들",
"우선",
"우에 종합한것과같이",
"운운",
"월",
"위에서 서술한바와같이",
"위하여",
"위해서",
"윙윙",
"육",
"으로",
"으로 인하여",
"으로서",
"으로써",
"을",
"응",
"응당",
"의",
"의거하여",
"의지하여",
"의해",
"의해되다",
"의해서",
"이",
"이 되다",
"이 때문에",
"이 밖에",
"이 외에",
"이 정도의",
"이것",
"이곳",
"이때",
"이라면",
"이래",
"이러이러하다",
"이러한",
"이런",
"이럴정도로",
"이렇게 많은 것",
"이렇게되면",
"이렇게말하자면",
"이렇구나",
"이로 인하여",
"이르기까지",
"이리하여",
"이만큼",
"이번",
"이봐",
"이상",
"이어서",
"이었다",
"이와 같다",
"이와 같은",
"이와 반대로",
"이와같다면",
"이외에도",
"이용하여",
"이유만으로",
"이젠",
"이지만",
"이쪽",
"이천구",
"이천육",
"이천칠",
"이천팔",
"인 듯하다",
"인젠",
"일",
"일것이다",
"일곱",
"일단",
"일때",
"일반적으로",
"일지라도",
"임에 틀림없다",
"입각하여",
"입장에서",
"잇따라",
"있다",
"자",
"자기",
"자기집",
"자마자",
"자신",
"잠깐",
"잠시",
"저",
"저것",
"저것만큼",
"저기",
"저쪽",
"저희",
"전부",
"전자",
"전후",
"점에서 보아",
"정도에 이르다",
"제",
"제각기",
"제외하고",
"조금",
"조차",
"조차도",
"졸졸",
"좀",
"좋아",
"좍좍",
"주룩주룩",
"주저하지 않고",
"줄은 몰랏다",
"줄은모른다",
"중에서",
"중의하나",
"즈음하여",
"즉",
"즉시",
"지든지",
"지만",
"지말고",
"진짜로",
"쪽으로",
"차라리",
"참",
"참나",
"첫번째로",
"쳇",
"총적으로",
"총적으로 말하면",
"총적으로 보면",
"칠",
"콸콸",
"쾅쾅",
"쿵",
"타다",
"타인",
"탕탕",
"토하다",
"통하여",
"툭",
"퉤",
"틈타",
"팍",
"팔",
"퍽",
"펄렁",
"하",
"하게될것이다",
"하게하다",
"하겠는가",
"하고 있다",
"하고있었다",
"하곤하였다",
"하구나",
"하기 때문에",
"하기 위하여",
"하기는한데",
"하기만 하면",
"하기보다는",
"하기에",
"하나",
"하느니",
"하는 김에",
"하는 편이 낫다",
"하는것도",
"하는것만 못하다",
"하는것이 낫다",
"하는바",
"하더라도",
"하도다",
"하도록시키다",
"하도록하다",
"하든지",
"하려고하다",
"하마터면",
"하면 할수록",
"하면된다",
"하면서",
"하물며",
"하여금",
"하여야",
"하자마자",
"하지 않는다면",
"하지 않도록",
"하지마",
"하지마라",
"하지만",
"하하",
"한 까닭에",
"한 이유는",
"한 후",
"한다면",
"한다면 몰라도",
"한데",
"한마디",
"한적이있다",
"한켠으로는",
"한항목",
"할 따름이다",
"할 생각이다",
"할 줄 안다",
"할 지경이다",
"할 힘이 있다",
"할때",
"할만하다",
"할망정",
"할뿐",
"할수있다",
"할수있어",
"할줄알다",
"할지라도",
"할지언정",
"함께",
"해도된다",
"해도좋다",
"해봐요",
"해서는 안된다",
"해야한다",
"해요",
"했어요",
"향하다",
"향하여",
"향해서",
"허",
"허걱",
"허허",
"헉",
"헉헉",
"헐떡헐떡",
"형식으로 쓰여",
"혹시",
"혹은",
"혼자",
"훨씬",
"휘익",
"휴",
"흐흐",
"흥",
"힘입어",
"︿",
"!",
"#",
"$",
"%",
"&",
"(",
")",
"*",
"+",
",",
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
":",
";",
"<",
">",
"?",
"@",
"[",
"]",
"{",
"|",
"}",
"~",
"¥",
],
"nl": [
"aan",
"achte",
"achter",
"af",
"al",
"alle",
"alleen",
"alles",
"als",
"ander",
"anders",
"beetje",
"behalve",
"beide",
"beiden",
"ben",
"beneden",
"bent",
"bij",
"bijna",
"bijv",
"blijkbaar",
"blijken",
"boven",
"bv",
"daar",
"daardoor",
"daarin",
"daarna",
"daarom",
"daaruit",
"dan",
"dat",
"de",
"deden",
"deed",
"derde",
"derhalve",
"dertig",
"deze",
"dhr",
"die",
"dit",
"doe",
"doen",
"doet",
"door",
"drie",
"duizend",
"echter",
"een",
"eens",
"eerst",
"eerste",
"eigen",
"eigenlijk",
"elk",
"elke",
"en",
"enige",
"er",
"erg",
"ergens",
"etc",
"etcetera",
"even",
"geen",
"genoeg",
"geweest",
"haar",
"haarzelf",
"had",
"hadden",
"heb",
"hebben",
"hebt",
"hedden",
"heeft",
"heel",
"hem",
"hemzelf",
"hen",
"het",
"hetzelfde",
"hier",
"hierin",
"hierna",
"hierom",
"hij",
"hijzelf",
"hoe",
"honderd",
"hun",
"ieder",
"iedere",
"iedereen",
"iemand",
"iets",
"ik",
"in",
"inderdaad",
"intussen",
"is",
"ja",
"je",
"jij",
"jijzelf",
"jou",
"jouw",
"jullie",
"kan",
"kon",
"konden",
"kun",
"kunnen",
"kunt",
"laatst",
"later",
"lijken",
"lijkt",
"maak",
"maakt",
"maakte",
"maakten",
"maar",
"mag",
"maken",
"me",
"meer",
"meest",
"meestal",
"men",
"met",
"mevr",
"mij",
"mijn",
"minder",
"miss",
"misschien",
"missen",
"mits",
"mocht",
"mochten",
"moest",
"moesten",
"moet",
"moeten",
"mogen",
"mr",
"mrs",
"mw",
"na",
"naar",
"nam",
"namelijk",
"nee",
"neem",
"negen",
"nemen",
"nergens",
"niemand",
"niet",
"niets",
"niks",
"noch",
"nochtans",
"nog",
"nooit",
"nu",
"nv",
"of",
"om",
"omdat",
"ondanks",
"onder",
"ondertussen",
"ons",
"onze",
"onzeker",
"ooit",
"ook",
"op",
"over",
"overal",
"overige",
"paar",
"per",
"recent",
"redelijk",
"samen",
"sinds",
"steeds",
"te",
"tegen",
"tegenover",
"thans",
"tien",
"tiende",
"tijdens",
"tja",
"toch",
"toe",
"tot",
"totdat",
"tussen",
"twee",
"tweede",
"u",
"uit",
"uw",
"vaak",
"van",
"vanaf",
"veel",
"veertig",
"verder",
"verscheidene",
"verschillende",
"via",
"vier",
"vierde",
"vijf",
"vijfde",
"vijftig",
"volgend",
"volgens",
"voor",
"voordat",
"voorts",
"waar",
"waarom",
"waarschijnlijk",
"wanneer",
"waren",
"was",
"wat",
"we",
"wederom",
"weer",
"weinig",
"wel",
"welk",
"welke",
"werd",
"werden",
"werder",
"whatever",
"wie",
"wij",
"wijzelf",
"wil",
"wilden",
"willen",
"word",
"worden",
"wordt",
"zal",
"ze",
"zei",
"zeker",
"zelf",
"zelfde",
"zes",
"zeven",
"zich",
"zij",
"zijn",
"zijzelf",
"zo",
"zoals",
"zodat",
"zou",
"zouden",
"zulk",
"zullen",
],
"no": [
"alle",
"at",
"av",
"bare",
"begge",
"ble",
"blei",
"bli",
"blir",
"blitt",
"både",
"båe",
"da",
"de",
"deg",
"dei",
"deim",
"deira",
"deires",
"dem",
"den",
"denne",
"der",
"dere",
"deres",
"det",
"dette",
"di",
"din",
"disse",
"ditt",
"du",
"dykk",
"dykkar",
"då",
"eg",
"ein",
"eit",
"eitt",
"eller",
"elles",
"en",
"enn",
"er",
"et",
"ett",
"etter",
"for",
"fordi",
"fra",
"før",
"ha",
"hadde",
"han",
"hans",
"har",
"hennar",
"henne",
"hennes",
"her",
"hjå",
"ho",
"hoe",
"honom",
"hoss",
"hossen",
"hun",
"hva",
"hvem",
"hver",
"hvilke",
"hvilken",
"hvis",
"hvor",
"hvordan",
"hvorfor",
"i",
"ikke",
"ikkje",
"ingen",
"ingi",
"inkje",
"inn",
"inni",
"ja",
"jeg",
"kan",
"kom",
"korleis",
"korso",
"kun",
"kunne",
"kva",
"kvar",
"kvarhelst",
"kven",
"kvi",
"kvifor",
"man",
"mange",
"me",
"med",
"medan",
"meg",
"meget",
"mellom",
"men",
"mi",
"min",
"mine",
"mitt",
"mot",
"mykje",
"ned",
"no",
"noe",
"noen",
"noka",
"noko",
"nokon",
"nokor",
"nokre",
"nå",
"når",
"og",
"også",
"om",
"opp",
"oss",
"over",
"på",
"samme",
"seg",
"selv",
"si",
"sia",
"sidan",
"siden",
"sin",
"sine",
"sitt",
"sjøl",
"skal",
"skulle",
"slik",
"so",
"som",
"somme",
"somt",
"så",
"sånn",
"til",
"um",
"upp",
"ut",
"uten",
"var",
"vart",
"varte",
"ved",
"vere",
"verte",
"vi",
"vil",
"ville",
"vore",
"vors",
gitextract_i1k0jy7m/
├── .dockerignore
├── .editorconfig
├── .github/
│ ├── .stale.yml
│ ├── CODEOWNERS
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.md
│ │ ├── config.yml
│ │ ├── feature_request.md
│ │ └── question.md
│ ├── PULL_REQUEST_TEMPLATE.md
│ ├── dependabot.yml
│ ├── release-drafter.yml
│ └── workflows/
│ ├── cd.yml
│ ├── ci.yml
│ ├── greetings.yml
│ └── release-drafter.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── SECURITY.md
├── datasets/
│ └── external/
│ ├── get_language_dataset.sh
│ └── get_stanfordtweets.sh
├── docker/
│ ├── Dockerfile
│ └── README.md
├── docs/
│ ├── Makefile
│ ├── make.bat
│ ├── scripts/
│ │ └── buildsite.sh
│ └── source/
│ ├── _templates/
│ │ ├── module.rst_t
│ │ ├── package.rst_t
│ │ └── versions.html
│ ├── conf.py
│ ├── index.rst
│ └── tutorials/
│ ├── basic_notebook.ipynb
│ └── index.rst
├── nlpretext/
│ ├── __init__.py
│ ├── _config/
│ │ ├── __init__.py
│ │ ├── config.py
│ │ ├── constants.py
│ │ └── stopwords.py
│ ├── _utils/
│ │ ├── __init__.py
│ │ ├── daskloader.py
│ │ ├── file_loader.py
│ │ ├── pandasloader.py
│ │ ├── phone_number.py
│ │ └── stopwords.py
│ ├── augmentation/
│ │ ├── __init__.py
│ │ └── text_augmentation.py
│ ├── basic/
│ │ ├── __init__.py
│ │ └── preprocess.py
│ ├── cli/
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ └── preprocess.py
│ ├── preprocessor.py
│ ├── py.typed
│ ├── social/
│ │ ├── __init__.py
│ │ └── preprocess.py
│ ├── textloader.py
│ └── token/
│ ├── __init__.py
│ ├── preprocess.py
│ └── tokenizer.py
├── pyproject.toml
├── references/
│ └── .gitkeep
└── tests/
├── __init__.py
├── test_data_augmentation.py
├── test_file_loader.py
├── test_phone_number.py
├── test_preprocessor.py
├── test_textloader.py
└── test_tokenizer.py
SYMBOL INDEX (140 symbols across 20 files)
FILE: nlpretext/_utils/daskloader.py
function read_text (line 8) | def read_text(files_path: Union[str, List[str]], encoding: str): # type...
function read_json (line 12) | def read_json(files_path: Union[str, List[str]], encoding: str): # type...
function read_csv (line 16) | def read_csv(files_path: Union[str, List[str]], encoding: str): # type:...
function read_parquet (line 20) | def read_parquet(files_path: Union[str, List[str]], encoding: str): # t...
FILE: nlpretext/_utils/file_loader.py
function detect_encoding (line 26) | def detect_encoding(file_path_or_string: Union[str, bytes], n_lines: int...
function check_text_file_format (line 51) | def check_text_file_format(filepath: Union[str, List[str]]) -> str:
FILE: nlpretext/_utils/pandasloader.py
function _list_handler (line 7) | def _list_handler(func):
function read_text (line 18) | def read_text(file_path: str, encoding: str) -> pd.DataFrame:
function read_json (line 24) | def read_json(file_path: str, encoding: str) -> pd.DataFrame:
function read_csv (line 30) | def read_csv(file_path: str, encoding: str) -> pd.DataFrame:
function read_parquet (line 36) | def read_parquet(file_path: str, encoding: str) -> pd.DataFrame:
FILE: nlpretext/_utils/phone_number.py
function find_phone_numbers (line 24) | def find_phone_numbers(string: str, region_code: Optional[str] = None) -...
function extract_phone_numbers (line 55) | def extract_phone_numbers(text: str, countrylist: List[Optional[str]]) -...
class PhoneParser (line 78) | class PhoneParser:
method __init__ (line 84) | def __init__(self):
method parsed_num (line 90) | def parsed_num(self) -> Optional[_phonenumbers.PhoneNumber]:
method parsed_num (line 94) | def parsed_num(self, value: Optional[_phonenumbers.PhoneNumber]) -> None:
method parse_number (line 97) | def parse_number(
method format_number (line 131) | def format_number(self, num_format: str) -> str:
FILE: nlpretext/_utils/stopwords.py
function get_stopwords (line 24) | def get_stopwords(lang: str = "en") -> List[str]:
FILE: nlpretext/augmentation/text_augmentation.py
class CouldNotAugment (line 10) | class CouldNotAugment(ValueError): # noqa: D101
class UnavailableAugmenter (line 14) | class UnavailableAugmenter(ValueError): # noqa: D101
function augment_text (line 18) | def augment_text(
function process_entities_and_text (line 63) | def process_entities_and_text(
function are_entities_in_augmented_text (line 108) | def are_entities_in_augmented_text(entities: List[Dict[str, Any]], augme...
function get_augmenter (line 142) | def get_augmenter(method: str, stopwords: Optional[List[str]] = None) ->...
function get_augmented_entities (line 168) | def get_augmented_entities(
function clean_sentence_entities (line 212) | def clean_sentence_entities(text: str, entities: List[Dict[str, Any]]) -...
function check_interval_included (line 254) | def check_interval_included(
FILE: nlpretext/basic/preprocess.py
function normalize_whitespace (line 30) | def normalize_whitespace(text: str) -> str:
function remove_whitespace (line 56) | def remove_whitespace(text: str) -> str:
function lower_text (line 75) | def lower_text(text: str) -> str:
function filter_groups (line 90) | def filter_groups(token: str, ignored_stopwords: Optional[List[str]] = N...
function ungroup_ignored_stopwords (line 112) | def ungroup_ignored_stopwords(
function remove_stopwords (line 132) | def remove_stopwords(
function remove_eol_characters (line 189) | def remove_eol_characters(text: str) -> str:
function fix_bad_unicode (line 205) | def fix_bad_unicode(text: str, normalization: str = "NFC") -> str:
function unpack_english_contractions (line 238) | def unpack_english_contractions(text: str) -> str:
function replace_urls (line 282) | def replace_urls(text: str, replace_with: str = "*URL*") -> str:
function replace_emails (line 305) | def replace_emails(text: str, replace_with: str = "*EMAIL*") -> str:
function replace_phone_numbers (line 328) | def replace_phone_numbers(
function replace_numbers (line 376) | def replace_numbers(text: str, replace_with: str = "*NUMBER*") -> str:
function replace_currency_symbols (line 399) | def replace_currency_symbols(text: str, replace_with: Optional[str] = No...
function remove_punct (line 431) | def remove_punct(text: str, marks: Optional[str] = None) -> str:
function remove_accents (line 463) | def remove_accents(text: str, method: str = "unicode") -> str:
function remove_multiple_spaces_and_strip_text (line 502) | def remove_multiple_spaces_and_strip_text(text: str) -> str:
function filter_non_latin_characters (line 523) | def filter_non_latin_characters(text: str) -> str:
FILE: nlpretext/cli/__main__.py
function version_callback (line 17) | def version_callback(value: bool) -> None:
FILE: nlpretext/cli/preprocess.py
function run (line 13) | def run(
FILE: nlpretext/preprocessor.py
class Preprocessor (line 14) | class Preprocessor:
method __init__ (line 15) | def __init__(self):
method pipe (line 20) | def pipe(self, operation: Callable[[Any], Any], args: Optional[Dict[st...
method build_pipeline (line 33) | def build_pipeline(operation_list: List[Dict[Any, Any]]) -> Pipeline:
method run (line 56) | def run(self, text: str) -> str:
FILE: nlpretext/social/preprocess.py
function remove_mentions (line 24) | def remove_mentions(text: str) -> str:
function extract_mentions (line 40) | def extract_mentions(text: str) -> List[str]:
function remove_html_tags (line 56) | def remove_html_tags(text: str) -> str:
function remove_emoji (line 72) | def remove_emoji(text: str) -> str:
function convert_emoji_to_text (line 92) | def convert_emoji_to_text(text: str, code_delimiters: Tuple[str, str] = ...
function extract_emojis (line 112) | def extract_emojis(text: str) -> List[str]:
function extract_hashtags (line 133) | def extract_hashtags(text: str) -> List[str]:
function remove_hashtag (line 150) | def remove_hashtag(text: str) -> str:
FILE: nlpretext/textloader.py
class TextLoader (line 36) | class TextLoader:
method __init__ (line 37) | def __init__(self, text_column="text", encoding="utf-8", file_format=N...
method __repr__ (line 72) | def __repr__(self):
method _read_text_txt (line 82) | def _read_text_txt(self, files_path):
method _read_text_json (line 99) | def _read_text_json(self, files_path):
method _read_text_csv (line 118) | def _read_text_csv(self, files_path):
method _read_text_parquet (line 137) | def _read_text_parquet(self, files_path):
method read_text (line 156) | def read_text(
FILE: nlpretext/token/preprocess.py
function remove_stopwords (line 24) | def remove_stopwords(
function remove_tokens_with_nonletters (line 57) | def remove_tokens_with_nonletters(tokens: List[str]) -> List[str]:
function remove_special_caracters_from_tokenslist (line 77) | def remove_special_caracters_from_tokenslist(tokens: List[str]) -> List[...
function remove_smallwords (line 97) | def remove_smallwords(tokens: List[str], smallwords_threshold: int) -> L...
FILE: nlpretext/token/tokenizer.py
class LanguageNotHandled (line 33) | class LanguageNotHandled(Exception):
class LanguageNotInstalledError (line 37) | class LanguageNotInstalledError(Exception):
class SpacyModel (line 41) | class SpacyModel:
class SingletonSpacyModel (line 42) | class SingletonSpacyModel:
method __init__ (line 43) | def __init__(self, lang: str) -> None:
method __init__ (line 58) | def __init__(self, lang):
method get_lang_model (line 62) | def get_lang_model(self) -> Optional[str]: # noqa: D102
function _load_spacy_model (line 69) | def _load_spacy_model(model: str) -> Any:
function _get_spacy_tokenizer (line 83) | def _get_spacy_tokenizer(lang: str) -> Optional[spacy.tokenizer.Tokenizer]:
function tokenize (line 103) | def tokenize(text: str, lang_module: str = "en_spacy") -> List[str]:
function untokenize (line 145) | def untokenize(tokens: List[str], lang: str = "fr") -> str:
function convert_tokens_to_string (line 165) | def convert_tokens_to_string(tokens_or_str: Optional[Union[str, List[str...
function convert_string_to_tokens (line 175) | def convert_string_to_tokens( # noqa: D103
FILE: tests/test_data_augmentation.py
function test_process_entities_and_text_not_altered (line 35) | def test_process_entities_and_text_not_altered(text, text_augmented, ent...
function test_process_entities_and_text_altered (line 54) | def test_process_entities_and_text_altered(text, text_augmented, entities):
function test_get_augmenter (line 62) | def test_get_augmenter():
FILE: tests/test_file_loader.py
function create_files (line 27) | def create_files():
function test_detect_encoding (line 38) | def test_detect_encoding():
function remove_files (line 46) | def remove_files():
function test_check_text_file_format (line 105) | def test_check_text_file_format(input_filepath, raising, expected_str):
FILE: tests/test_phone_number.py
function test_extract_phone_number (line 22) | def test_extract_phone_number():
function test_extract_phone_number_us (line 29) | def test_extract_phone_number_us():
function test_extract_phone_number_fr (line 36) | def test_extract_phone_number_fr():
function test_extract_phone_number_international (line 43) | def test_extract_phone_number_international():
function test_phone_parser_us (line 50) | def test_phone_parser_us():
function test_phone_parser_fr (line 59) | def test_phone_parser_fr():
FILE: tests/test_preprocessor.py
function test_extract_emojis (line 65) | def test_extract_emojis(text, expected_result):
function test_remove_mentions (line 77) | def test_remove_mentions(text, expected_result):
function test_extract_mentions (line 89) | def test_extract_mentions(text, expected_result):
function test_remove_html_tags (line 104) | def test_remove_html_tags(text, expected_result):
function test_remove_smallwords (line 120) | def test_remove_smallwords(tokens_list, smallwords_threshold, expected_r...
function test_extract_hashtags (line 135) | def test_extract_hashtags(text, expected_result):
function test_remove_hashtag (line 153) | def test_remove_hashtag(text, expected_result):
function test_filter_non_latin_characters (line 167) | def test_filter_non_latin_characters(text, expected_filtered_text):
function test_remove_multiple_spaces_and_strip_text (line 182) | def test_remove_multiple_spaces_and_strip_text(input_str, expected_str):
function test_remove_eol_characters (line 195) | def test_remove_eol_characters(input_str, expected_str):
function test_remove_tokens_with_nonletters (line 200) | def test_remove_tokens_with_nonletters():
function test_remove_special_caracters_from_tokenslist (line 207) | def test_remove_special_caracters_from_tokenslist():
function test_get_stopwords (line 214) | def test_get_stopwords():
function test_remove_stopwords_tokens (line 225) | def test_remove_stopwords_tokens(input_tokens, lang, expected_output):
function test_remove_stopwords_text (line 250) | def test_remove_stopwords_text(
function test_remove_custom_stopwords_text (line 269) | def test_remove_custom_stopwords_text(input_text, lang, custom_stopwords...
function test_remove_accents (line 274) | def test_remove_accents():
function test_fix_bad_unicode (line 306) | def test_fix_bad_unicode(input_str, expected_str):
function test_normalize_whitespace (line 315) | def test_normalize_whitespace(input_str, expected_str):
function test_unpack_english_contractions (line 331) | def test_unpack_english_contractions(input_str, expected_str):
function test_replace_urls (line 352) | def test_replace_urls(input_str, expected_str):
function test_replace_emails (line 365) | def test_replace_emails(input_str, expected_str):
function test_replace_phone_numbers (line 388) | def test_replace_phone_numbers(input_str, expected_str):
function test_replace_numbers (line 406) | def test_replace_numbers(input_str, expected_str):
function test_replace_currency_symbols (line 425) | def test_replace_currency_symbols(input_str, param, expected_str):
function test_remove_punct (line 451) | def test_remove_punct(input_str, param, expected_str):
function test_remove_emoji (line 466) | def test_remove_emoji(input_str, expected_str):
function test_convert_emoji_to_text (line 481) | def test_convert_emoji_to_text(input_str, expected_str):
function test_custom_preprocess (line 486) | def test_custom_preprocess():
function test_apply_preprocessor (line 514) | def test_apply_preprocessor(input_str, expected_str):
FILE: tests/test_textloader.py
function test__read_text_txt_dask (line 43) | def test__read_text_txt_dask(mock_read_text):
function test__read_text_txt_pandas (line 66) | def test__read_text_txt_pandas(mock_read_text):
function test__read_text_json_dask (line 95) | def test__read_text_json_dask(mock_read):
function test__read_text_json_pandas (line 120) | def test__read_text_json_pandas(mock_read):
function test__read_text_csv_dask (line 140) | def test__read_text_csv_dask(mock_read_csv):
function test__read_text_csv_pandas (line 165) | def test__read_text_csv_pandas(mock_read):
function test__read_text_parquet_dask (line 185) | def test__read_text_parquet_dask(mock_read_parquet):
function test__read_text_parquet_pandas (line 210) | def test__read_text_parquet_pandas(mock_read):
function test_read_text (line 257) | def test_read_text(
FILE: tests/test_tokenizer.py
function test_load_spacy_model_validation (line 16) | def test_load_spacy_model_validation(bad_model_name):
Condensed preview — 72 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (496K chars).
[
{
"path": ".dockerignore",
"chars": 340,
"preview": "# Git\n.git\n.gitignore\n.github\n\n# Docker\n.dockerignore\ndocker/\n\n# IDE\n.idea\n.vscode\n\n# Byte-compiled / optimized / DLL fi"
},
{
"path": ".editorconfig",
"chars": 424,
"preview": "# Check http://editorconfig.org for more information\n# This is the main config file for this project:\nroot = true\n\n[*]\nc"
},
{
"path": ".github/.stale.yml",
"chars": 684,
"preview": "# Number of days of inactivity before an issue becomes stale\ndaysUntilStale: 60\n# Number of days of inactivity before a "
},
{
"path": ".github/CODEOWNERS",
"chars": 119,
"preview": "# https://help.github.com/en/articles/about-code-owners\n\n* @julesbertrand @amaleelhamri @hugovasselin @Guillaume6606\n"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.md",
"chars": 726,
"preview": "---\nname: 🐛 Bug report\nabout: If something isn't working 🔧\ntitle: ''\nlabels: bug\nassignees:\n---\n\n## 🐛 Bug Report\n\n<!-- A"
},
{
"path": ".github/ISSUE_TEMPLATE/config.yml",
"chars": 156,
"preview": "# Configuration: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repo"
},
{
"path": ".github/ISSUE_TEMPLATE/feature_request.md",
"chars": 506,
"preview": "---\nname: 🚀 Feature request\nabout: Suggest an idea for this project 🏖\ntitle: ''\nlabels: enhancement\nassignees:\n---\n\n## 🚀"
},
{
"path": ".github/ISSUE_TEMPLATE/question.md",
"chars": 495,
"preview": "---\nname: ❓ Question\nabout: Ask a question about this project 🎓\ntitle: ''\nlabels: question\nassignees:\n---\n\n## Checklist\n"
},
{
"path": ".github/PULL_REQUEST_TEMPLATE.md",
"chars": 1180,
"preview": "## Description\n\n<!-- Add a more detailed description of the changes if needed. -->\n\n## Related Issue\n\n<!-- If your PR re"
},
{
"path": ".github/dependabot.yml",
"chars": 1014,
"preview": "# Configuration: https://dependabot.com/docs/config-file/\n# Docs: https://docs.github.com/en/github/administering-a-repo"
},
{
"path": ".github/release-drafter.yml",
"chars": 787,
"preview": "# Release drafter configuration https://github.com/release-drafter/release-drafter#configuration\n# Emojis were chosen to"
},
{
"path": ".github/workflows/cd.yml",
"chars": 2715,
"preview": "name: Continuous Deployment\non:\n release:\n types: [published]\n\njobs:\n\n docker:\n\n runs-on: ubuntu-latest\n\n ste"
},
{
"path": ".github/workflows/ci.yml",
"chars": 2252,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": ".github/workflows/greetings.yml",
"chars": 744,
"preview": "name: Greetings\n\non:\n pull_request:\n types:\n - opened\n - reopened\n - edited\n - labeled\n - u"
},
{
"path": ".github/workflows/release-drafter.yml",
"chars": 396,
"preview": "name: Release Drafter\n\non:\n push:\n # branches to consider in the event; optional, defaults to all\n branches:\n "
},
{
"path": ".gitignore",
"chars": 10722,
"preview": "# Created by https://www.gitignore.io/api/osx,python,pycharm,windows,visualstudio,visualstudiocode\n# Edit at https://www"
},
{
"path": ".pre-commit-config.yaml",
"chars": 1398,
"preview": "default_language_version:\n python: python3.10\n\n\nrepos:\n - repo: https://github.com/pre-commit/pre-commit-hooks\n rev"
},
{
"path": "CODE_OF_CONDUCT.md",
"chars": 3362,
"preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, w"
},
{
"path": "CONTRIBUTING.md",
"chars": 2117,
"preview": "NLPretext\n==============================\n\n# How to contribute\n\n## Dependencies\n\nWe use `poetry` to manage the [dependenc"
},
{
"path": "LICENSE",
"chars": 11461,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "Makefile",
"chars": 3402,
"preview": "SHELL := /usr/bin/env bash\n\nIMAGE := nlpretext\nVERSION := latest\n\nNO_CHECK_FLAG = || true\n\nifeq ($(STRICT), 1)\n\tPOETRY_"
},
{
"path": "README.md",
"chars": 13646,
"preview": "# NLPretext\n\n<p align=\"center\">\n <img src=\"/references/logo_nlpretext.png\" />\n</p>\n\n<div align=\"center\">\n\n[![CI statu"
},
{
"path": "SECURITY.md",
"chars": 1186,
"preview": "# Security\n\n## 🔐 Reporting Security Issues\n\n> Do not open issues that might have security implications!\n> It is critical"
},
{
"path": "datasets/external/get_language_dataset.sh",
"chars": 1005,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "datasets/external/get_stanfordtweets.sh",
"chars": 1100,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "docker/Dockerfile",
"chars": 568,
"preview": "FROM python:3.10-slim-buster\n\nENV LANG=C.UTF-8 \\\n LC_ALL=C.UTF-8\n\nRUN apt-get update && \\\n apt-get install -y --no-ins"
},
{
"path": "docker/README.md",
"chars": 764,
"preview": "# Docker for nlpretext\n\n## Installation\n\nTo create Docker you need to run:\n\n```bash\nmake docker\n```\n\nwhich is equivalent"
},
{
"path": "docs/Makefile",
"chars": 959,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
},
{
"path": "docs/make.bat",
"chars": 4105,
"preview": "@ECHO OFF\n\nREM Command file for Sphinx documentation\n\nif \"%SPHINXBUILD%\" == \"\" (\n\tset SPHINXBUILD=sphinx-build\n)\nset BUI"
},
{
"path": "docs/scripts/buildsite.sh",
"chars": 1831,
"preview": "#!/bin/bash\n\nexport SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct)\n\n##############\n# BUILD DOCS #\n##############\n\n# Python"
},
{
"path": "docs/source/_templates/module.rst_t",
"chars": 186,
"preview": "\n{%- if show_headings %}\n{{- [basename] | join(' ') | e | heading }}\n\n{% endif -%}\n.. automodule:: {{ qualname }}\n{%- fo"
},
{
"path": "docs/source/_templates/package.rst_t",
"chars": 1113,
"preview": "\n{%- macro automodule(modname, options) -%}\n.. automodule:: {{ modname }}\n{%- for option in options %}\n :{{ option }}:"
},
{
"path": "docs/source/_templates/versions.html",
"chars": 807,
"preview": "\n{%- if current_version %}\n<div class=\"rst-versions\" data-toggle=\"rst-versions\" role=\"note\" aria-label=\"versions\">\n <sp"
},
{
"path": "docs/source/conf.py",
"chars": 2723,
"preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
},
{
"path": "docs/source/index.rst",
"chars": 776,
"preview": "=========\nNLPretext\n=========\n\n\nWelcome to NLPretext's documentation!\n========================================\n\nThe NLPr"
},
{
"path": "docs/source/tutorials/basic_notebook.ipynb",
"chars": 2439,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# How to use the package in a noteb"
},
{
"path": "docs/source/tutorials/index.rst",
"chars": 83,
"preview": "Tutorials\n=========\n\n\n.. toctree::\n :maxdepth: 4\n :glob:\n\n basic_notebook\n"
},
{
"path": "nlpretext/__init__.py",
"chars": 1271,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/_config/__init__.py",
"chars": 840,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/_config/config.py",
"chars": 10165,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/_config/constants.py",
"chars": 6585,
"preview": "# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "nlpretext/_config/stopwords.py",
"chars": 217637,
"preview": "STOPWORDS = {\n \"af\": [\n \"'n\",\n \"aan\",\n \"af\",\n \"al\",\n \"as\",\n \"baie\",\n "
},
{
"path": "nlpretext/_utils/__init__.py",
"chars": 840,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/_utils/daskloader.py",
"chars": 707,
"preview": "# mypy: disable-error-code=\"attr-defined\"\nfrom typing import List, Union\n\nimport dask.bag as db\nimport dask.dataframe as"
},
{
"path": "nlpretext/_utils/file_loader.py",
"chars": 2754,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/_utils/pandasloader.py",
"chars": 1032,
"preview": "from typing import List, Union\n\nimport pandas as pd\nfrom fsspec import open_files\n\n\ndef _list_handler(func):\n def wra"
},
{
"path": "nlpretext/_utils/phone_number.py",
"chars": 5294,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/_utils/stopwords.py",
"chars": 2604,
"preview": "# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "nlpretext/augmentation/__init__.py",
"chars": 840,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/augmentation/text_augmentation.py",
"chars": 9156,
"preview": "from typing import Any, Dict, List, Optional, Tuple\n\nimport logging\nimport re\nfrom itertools import combinations\n\nimport"
},
{
"path": "nlpretext/basic/__init__.py",
"chars": 840,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/basic/preprocess.py",
"chars": 14699,
"preview": "# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "nlpretext/cli/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "nlpretext/cli/__main__.py",
"chars": 606,
"preview": "# mypy: disable-error-code=\"attr-defined\"\n\nimport typer\nfrom nlpretext import __version__\nfrom nlpretext.cli import prep"
},
{
"path": "nlpretext/cli/preprocess.py",
"chars": 1179,
"preview": "from typing import List\n\nimport typer\nfrom nlpretext.preprocessor import Preprocessor\nfrom nlpretext.textloader import T"
},
{
"path": "nlpretext/preprocessor.py",
"chars": 2471,
"preview": "from typing import Any, Callable, Dict, List, Optional\n\nfrom nlpretext.basic.preprocess import fix_bad_unicode, normaliz"
},
{
"path": "nlpretext/py.typed",
"chars": 0,
"preview": ""
},
{
"path": "nlpretext/social/__init__.py",
"chars": 840,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/social/preprocess.py",
"chars": 4009,
"preview": "# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "nlpretext/textloader.py",
"chars": 7016,
"preview": "# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "nlpretext/token/__init__.py",
"chars": 840,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "nlpretext/token/preprocess.py",
"chars": 3024,
"preview": "# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "nlpretext/token/tokenizer.py",
"chars": 5654,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "pyproject.toml",
"chars": 3502,
"preview": "# Poetry pyproject.toml: https://python-poetry.org/docs/pyproject/\n\n[build-system]\nrequires = [\"poetry_core>=1.0.0\"]\nbui"
},
{
"path": "references/.gitkeep",
"chars": 0,
"preview": ""
},
{
"path": "tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/test_data_augmentation.py",
"chars": 2755,
"preview": "import pytest\nfrom nlpretext.augmentation.text_augmentation import (\n CouldNotAugment,\n UnavailableAugmenter,\n "
},
{
"path": "tests/test_file_loader.py",
"chars": 3990,
"preview": "# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "tests/test_phone_number.py",
"chars": 2351,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "tests/test_preprocessor.py",
"chars": 17815,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "tests/test_textloader.py",
"chars": 10516,
"preview": "# GNU Lesser General Public License v3.0 only\n# Copyright (C) 2020 Artefact\n# licence-information@artefact.com\n#\n# This "
},
{
"path": "tests/test_tokenizer.py",
"chars": 630,
"preview": "import pytest\nfrom nlpretext.token.tokenizer import LanguageNotInstalledError, _load_spacy_model\n\n\n@pytest.mark.parametr"
}
]
About this extraction
This page contains the full source code of the artefactory/NLPretext GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 72 files (410.9 KB), approximately 118.8k tokens, and a symbol index with 140 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.