Repository: aalto-speech/morfessor
Branch: master
Commit: 5314d3ebc1be
Files: 32
Total size: 182.3 KB
Directory structure:
gitextract_ss7q8p0b/
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README
├── docs/
│ ├── Makefile
│ ├── README
│ ├── build_requirements.txt
│ ├── make.bat
│ └── source/
│ ├── cmdtools.rst
│ ├── conf.py
│ ├── filetypes.rst
│ ├── general.rst
│ ├── index.rst
│ ├── installation.rst
│ ├── libinterface.rst
│ └── license.rst
├── ez_setup.py
├── morfessor/
│ ├── __init__.py
│ ├── baseline.py
│ ├── cmd.py
│ ├── evaluation.py
│ ├── exception.py
│ ├── io.py
│ ├── test/
│ │ ├── __init__.py
│ │ └── evaluation.py
│ └── utils.py
├── scripts/
│ ├── morfessor
│ ├── morfessor-evaluate
│ ├── morfessor-segment
│ ├── morfessor-train
│ └── tools/
│ └── morphlength_from_annotations.py
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.py[cod]
# C extensions
*.so
# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
MANIFEST
env*
# Installer logs
pip-log.txt
# Unit test / coverage reports
.coverage
.tox
nosetests.xml
# Translations
*.mo
#Idea IDE
.idea
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
================================================
FILE: LICENSE
================================================
Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: MANIFEST.in
================================================
include LICENSE
include ez_setup.py
include docs/build_requirements.txt
================================================
FILE: README
================================================
Morfessor 2.0 - Quick start
===========================
Installation
------------
Morfessor 2.0 is installed using setuptools library for Python. To
build and install the module and scripts to default paths, type
python setup.py install
For details, see http://docs.python.org/install/
Documentation
-------------
User instructions for Morfessor 2.0 are available in the docs directory
as Sphinx source files (see http://sphinx-doc.org/). Instructions how
to build the documentation can be found in docs/README.
The documentation is also available on-line at http://morfessor.readthedocs.org/
Details of the implemented algorithms and methods and a set of
experiments are described in the following technical report:
Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko
Kurimo. Morfessor 2.0: Python Implementation and Extensions for
Morfessor Baseline. Aalto University publication series SCIENCE +
TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN
978-952-60-5501-5.
The report is available online at
http://urn.fi/URN:ISBN:978-952-60-5501-5
Contact
-------
Questions or feedback? Email: morpho@aalto.fi
================================================
FILE: docs/Makefile
================================================
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = build
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " qthelp to make HTML files and a qthelp project"
@echo " devhelp to make HTML files and a Devhelp project"
@echo " epub to make an epub"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " xml to make Docutils-native XML files"
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
clean:
rm -rf $(BUILDDIR)/*
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."
qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/Morfessor.qhcp"
@echo "To view the help file:"
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/Morfessor.qhc"
devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo "Build finished."
@echo "To view the help file:"
@echo "# mkdir -p $$HOME/.local/share/devhelp/Morfessor"
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/Morfessor"
@echo "# devhelp"
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
latexpdfja:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through platex and dvipdfmx..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."
info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
xml:
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
@echo
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
================================================
FILE: docs/README
================================================
Generating Documentation
------------------------
The user instructions for Morfessor 2.0 are available as Sphinx source
files (see http://sphinx-doc.org/). To build the documentation you need
both the 'sphinx' and the 'sphinxcontrib-napoleon' package. With a recent
version of pip you could do::
pip install -e .[docs]
to automatically install the required dependencies for making the docs.
After installing Sphinx, you can generate the documentation in different
formats using the Makefile or make.bat in the directory "docs". For
example, to generate a PDF file, type "make latexpdf", and to generate
a single HTML file, type "make singlehtml". Type "make help" to see
all available formats.
The documentation can also be read online on http://morfessor.readthedocs.org/
================================================
FILE: docs/build_requirements.txt
================================================
sphinx
sphinxcontrib-napoleon
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set BUILDDIR=build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source
set I18NSPHINXOPTS=%SPHINXOPTS% source
if NOT "%PAPER%" == "" (
set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
)
if "%1" == "" goto help
if "%1" == "help" (
:help
echo.Please use `make ^<target^>` where ^<target^> is one of
echo. html to make standalone HTML files
echo. dirhtml to make HTML files named index.html in directories
echo. singlehtml to make a single large HTML file
echo. pickle to make pickle files
echo. json to make JSON files
echo. htmlhelp to make HTML files and a HTML help project
echo. qthelp to make HTML files and a qthelp project
echo. devhelp to make HTML files and a Devhelp project
echo. epub to make an epub
echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
echo. text to make text files
echo. man to make manual pages
echo. texinfo to make Texinfo files
echo. gettext to make PO message catalogs
echo. changes to make an overview over all changed/added/deprecated items
echo. xml to make Docutils-native XML files
echo. pseudoxml to make pseudoxml-XML files for display purposes
echo. linkcheck to check all external links for integrity
echo. doctest to run all doctests embedded in the documentation if enabled
goto end
)
if "%1" == "clean" (
for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
del /q /s %BUILDDIR%\*
goto end
)
%SPHINXBUILD% 2> nul
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
if "%1" == "html" (
%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/html.
goto end
)
if "%1" == "dirhtml" (
%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
goto end
)
if "%1" == "singlehtml" (
%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
goto end
)
if "%1" == "pickle" (
%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the pickle files.
goto end
)
if "%1" == "json" (
%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can process the JSON files.
goto end
)
if "%1" == "htmlhelp" (
%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
goto end
)
if "%1" == "qthelp" (
%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
echo.^> qcollectiongenerator %BUILDDIR%\qthelp\Morfessor.qhcp
echo.To view the help file:
echo.^> assistant -collectionFile %BUILDDIR%\qthelp\Morfessor.ghc
goto end
)
if "%1" == "devhelp" (
%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
if errorlevel 1 exit /b 1
echo.
echo.Build finished.
goto end
)
if "%1" == "epub" (
%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The epub file is in %BUILDDIR%/epub.
goto end
)
if "%1" == "latex" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
if errorlevel 1 exit /b 1
echo.
echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdf" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf
cd %BUILDDIR%/..
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "latexpdfja" (
%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
cd %BUILDDIR%/latex
make all-pdf-ja
cd %BUILDDIR%/..
echo.
echo.Build finished; the PDF files are in %BUILDDIR%/latex.
goto end
)
if "%1" == "text" (
%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The text files are in %BUILDDIR%/text.
goto end
)
if "%1" == "man" (
%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The manual pages are in %BUILDDIR%/man.
goto end
)
if "%1" == "texinfo" (
%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
goto end
)
if "%1" == "gettext" (
%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
goto end
)
if "%1" == "changes" (
%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
if errorlevel 1 exit /b 1
echo.
echo.The overview file is in %BUILDDIR%/changes.
goto end
)
if "%1" == "linkcheck" (
%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
if errorlevel 1 exit /b 1
echo.
echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
goto end
)
if "%1" == "doctest" (
%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
if errorlevel 1 exit /b 1
echo.
echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
goto end
)
if "%1" == "xml" (
%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The XML files are in %BUILDDIR%/xml.
goto end
)
if "%1" == "pseudoxml" (
%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
if errorlevel 1 exit /b 1
echo.
echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
goto end
)
:end
================================================
FILE: docs/source/cmdtools.rst
================================================
Command line tools
==================
The installation process installs 4 scripts in the appropriate PATH.
morfessor
---------
The morfessor command is a full-featured script for training, updating models
and segmenting test data.
Loading existing model
~~~~~~~~~~~~~~~~~~~~~~
``-l <file>``
load :ref:`binary-model-def`
``-L <file>``
load :ref:`morfessor1-model-def`
Loading data
~~~~~~~~~~~~
``-t <file>, --traindata <file>``
Input corpus file(s) for training (text or bz2/gzipped text; use '-'
for standard input; add several times in order to append multiple files).
Standard, all sentences are split on whitespace and the tokens are used as
compounds. The ``--traindata-list`` option can be used to read all input
files as a list of compounds, one compound per line optionally prefixed by
a count. See :ref:`data-format-options` for changing the delimiters used for
separating compounds and atoms.
``--traindata-list``
Interpret all training files as list files instead of corpus files. A list
file contains one compound per line with optionally a count as prefix.
``-T <file>, --testdata <file>``
Input corpus file(s) to analyze (text or bz2/gzipped text; use '-' for
standard input; add several times in order to append multiple files). The
file is read in the same manner as an input corpus file. See
:ref:`data-format-options` for changing the delimiters used for
separating compounds and atoms.
Training model options
~~~~~~~~~~~~~~~~~~~~~~
``-m <mode>, --mode <mode>``
Morfessor can run in different modes, each doing different actions on the
model. The modes are:
none
Do initialize or train a model. Can be used when just loading a model
for segmenting new data
init
Create new model and load input data. Does not train the model
batch
Loads an existing model (which is already initialized with training
data) and run :ref:`batch-training`
init+batch
Create a new model, load input data and run :ref:`batch-training`.
**Default**
online
Create a new model, read and train the model concurrently as described
in :ref:`online-training`
online+batch
First read and train the model concurrently as described in
:ref:`online-training` and after that retrain the model using
:ref:`batch-training`
``-a <algorithm>, --algorithm <algorithm>``
Algorithm to use for training:
recursive
Recursive as descirbed in :ref:`recursive-training` **Default**
viterbi
Viterbi as described in :ref:`viterbi-training`
``-d <type>, --dampening <type>``
Method for changing the compound counts in the input data. Options:
none
Do not alter the counts of compounds (token based training)
log
Change the count :math:`x` of a compound to :math:`\log(x)` (log-token
based training)
ones
Treat all compounds as if they only occured once (type based training)
``-f <list>, --forcesplit <list>``
A list of atoms that would always cause the compound to be split. By
default only hyphens (``-``) would force a split. Note the notation of the
argument list. To have no force split characters, use as an empty string as
argument (``-f ""``). To split, for example, both hyphen (``-``) and
apostrophe (``'``) use ``-f "-'"``
``-F <float>, --finish-threshold <float>``
Stopping threshold. Training stops when the decrease in model cost of the
last iteration is smaller then finish_threshold * #boundaries; (default
'0.005')
``-r <seed>, --randseed <seed>``
Seed for random number generator
``-R <float>, --randsplit <float>``
Initialize new words by random splitting using the
given split probability (default no splitting). See :ref:`rand-init`
``--skips``
Use random skips for frequently seen compounds to
speed up training. See :ref:`rand-init`
``--batch-minfreq <int>``
Compound frequency threshold for batch training
(default 1)
``--max-epochs <int>``
Hard maximum of epochs in training
``--nosplit-re <regexp>``
If the expression matches the two surrounding
characters, do not allow splitting (default None)
``--online-epochint <int>``
Epoch interval for online training (default 10000)
``--viterbi-smoothing <float>``
Additive smoothing parameter for Viterbi training and
segmentation (default 0).
``--viterbi-maxlen <int>``
Maximum construction length in Viterbi training and
segmentation (default 30)
Saving model
~~~~~~~~~~~~
``-s <file>``
save :ref:`binary-model-def`
``-S <file>``
save :ref:`morfessor1-model-def`
``--save-reduced``
save :ref:`binary-reduced-model-def`
Examples
~~~~~~~~
Training a model from inputdata.txt, saving a :ref:`morfessor1-model-def` and
segmenting the test.txt set: ::
morfessor -t inputdata.txt -S model.segm -T test.txt
morfessor-train
---------------
The morfessor-train command is a convenience command that enables easier
training for morfessor models.
The basic command structure is: ::
morfessor-train [arguments] traindata-file [traindata-file ...]
The arguments are identical to the ones for the `morfessor`_ command. The most
relevant are:
``-s <file>``
save binary model
``-S <file>``
save Morfessor 1.0 style model
``--save-reduced``
save reduced binary model
Examples
~~~~~~~~
Train a morfessor model from a wordcount list in ISO_8859-15, doing type based
training, writing the log to logfile and saving them model as model.bin: ::
morfessor-train --encoding=ISO_8859-15 --traindata-list --logfile=log.log -s model.bin -d ones traindata.txt
morfessor-segment
-----------------
The morfessor-segment command is a convenience command that enables easier
segmentation of test data with a morfessor model.
The basic command structure is: ::
morfessor-segment [arguments] testcorpus-file [testcorpus-file ...]
The arguments are identical to the ones for the `morfessor`_ command. The most
relevant are:
``-l <file>``
load binary model (normal or reduced)
``-L <file>``
load Morfessor 1.0 style model
Examples
~~~~~~~~
Loading a binary model and segmenting the words in testdata.txt: ::
morfessor-segment -l model.bin testdata.txt
morfessor-evaluate
------------------
The morfessor-evaluate command is used for evaluating a morfessor model against
a gold-standard. If multiple models are evaluated, it reports statistical
significant differences between them.
The basic command structure is: ::
morfessor-evaluate [arguments] <goldstandard> <model> [<model> ...]
Positional arguments
~~~~~~~~~~~~~~~~~~~~
``<goldstandard>``
gold standard file in standard annotation format
``<model>``
model files to segment (either binary or Morfessor 1.0 style segmentation
models).
Optional arguments
~~~~~~~~~~~~~~~~~~
``-t TEST_SEGMENTATIONS, --testsegmentation TEST_SEGMENTATIONS``
Segmentation of the test set. Note that all words in the gold-standard must
be segmented
``--num-samples <int>``
number of samples to take for testing
``--sample-size <int>``
size of each testing samples
``--format-string <format>``
Python new style format string used to report evaluation results. The
following variables are a value and and action separated with and
underscore. E.g. fscore_avg for the average f-score. The available
values are "precision", "recall", "fscore", "samplesize" and the available
actions: "avg", "max", "min", "values", "count". A last meta-data variable
(without action) is "name", the filename of the model. See also the
format-template option for predefined strings.
``--format-template <template>``
Uses a template string for the format-string options. Available templates
are: default, table and latex. If format-string is defined this option is
ignored.
Examples
~~~~~~~~
Evaluating three different models against a golden standard, outputting the
results in latex table format:::
morfessor-evaluate --format-template=latex goldstd.txt model1.bin model2.segm model3.bin
.. _data-format-options:
Data format command line options
--------------------------------
``--encoding <encoding>``
Encoding of input and output files (if none is given, both the local
encoding and UTF-8 are tried).
``--lowercase``
lowercase input data
``--traindata-list``
input file(s) for batch training are lists (one compound per line,
optionally count as a prefix)
``--atom-separator <regexp>``
atom separator regexp (default None)
``--compound-separator <regexp>``
compound separator regexp (default '\s+')
``--analysis-separator <str>``
separator for different analyses in an annotation file. Use NONE for only
allowing one analysis per line
``--output-format <format>``
format string for --output file (default: '{analysis}\\n'). Valid keywords
are: ``{analysis}`` = constructions of the compound, ``{compound}`` =
compound string, {count} = count of the compound (currently always 1),
``{logprob}`` = log-probability of the analysis, and ``{clogprob}`` =
log-probability of the compound. Valid escape sequences are ``\n`` (newline)
and ``\t`` (tabular)
``--output-format-separator <str>``
construction separator for analysis in --output file (default: ' ')
``--output-newlines``
for each newline in input, print newline in --output file (default: 'False')
Universal command line options
------------------------------
``--verbose <int> -v``
verbose level; controls what is written to the standard error stream or log file (default 1)
``--logfile <file>``
write log messages to file in addition to standard error stream
``--progressbar``
Force the progressbar to be displayed (possibly lowers the log level for the standard error stream)
``--help``
-h show this help message and exit
``--version``
show version number and exit
Morfessor features
==================
All features below are described in a short format, mainly to guide making the
right choice for a certain parameter. These features are explained in detail in
the :ref:`morfessor-tech-report`.
.. _`batch-training`:
Batch training
--------------
In batch training, each epoch consists of an iteration over the full training
data. Epochs are repeated until the model cost is converged. All training data
needed in the training needs to be loaded before the training starts.
.. _`online-training`:
Online training
---------------
In online training the model is updated while the data is being added. This
allows for rapid testing and prototyping. All data is only processed once,
hence it is advisable to run :ref:`batch-training` afterwards. The size of an
epoch is a fixed, predefined number of compounds processed. The only use of an
epoch for online training is to select the best annotations in semi-supervised
training.
.. _`recursive-training`:
Recursive training
------------------
In recursive training, each compound is processed in the following manner. The
current split for the compound is removed from the model and its constructions
are updated accordingly. After this, all possible splits are tried, by choosing
one split and running the algorithm recursively on the created constructions.
In the end, the best split is selected and the training continues with the next
compound.
.. _`viterbi-training`:
Local Viterbi training
----------------------
In Local Viterbi training the compounds are processed sequentially. Each
compound is removed from the corpus and afterwards segmented using Viterbi
segmentation. The result is put back into the model.
In order to allow new constructions to be created, the smoothing parameter
must be given some non-zero value.
.. _`rand-skips`:
Random skips
------------
In Random skips, frequently seen compounds are skipped in training with a
random probability. As shown in the :ref:`morfessor-tech-report` this speeds
up the training considerably with only a minor loss in model performance.
.. _`rand-init`:
Random initialization
---------------------
In random initialization all compounds are split randomly. Each possible
boundary is made a split with the given probability.
Selecting a good random initialization parameter helps in finding local optima
as long as the split probability is high enough.
.. _`corpusweight`:
Corpusweight (alpha) tuning
---------------------------
An important parameter of the Morfessor Baseline model is the corpusweight
(:math:`\alpha`), which balances the cost of the lexicon and the corpus. There
are different options available for tuning this weight:
Fixed weight (``--corpusweight``)
The weight is set fixed on the beginning of the training and does not change
Development set (``--develset``)
A development set is used to balance the corpusweight so that the precision
and recall of segmenting the developmentset will be equal
Morph length (``--morph-length``)
The corpusweight is tuned so that the average length of morphs in the
lexicon will be as desired
Num morph types (``--num-morph-types``)
The corpusweight is tuned so that there will be approximate the number of
desired morph types in the lexicon
================================================
FILE: docs/source/conf.py
================================================
# -*- coding: utf-8 -*-
#
# Morfessor documentation build configuration file, created by
# sphinx-quickstart on Wed Dec 4 13:41:43 2013.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import sys
import os
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.mathjax',
'sphinxcontrib.napoleon',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
#source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = u'Morfessor'
copyright = u'2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '2.0'
# The full version, including alpha/beta/rc tags.
release = '2.0.6'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []
# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []
# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = 'default'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}
# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
#html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
#html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = '%b %d, %Y'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If false, no module index is generated.
#html_domain_indices = True
# If false, no index is generated.
#html_use_index = True
# If true, the index is split into individual pages for each letter.
#html_split_index = False
# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None
# Output file base name for HTML help builder.
htmlhelp_basename = 'Morfessordoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#'preamble': '',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'Morfessor.tex', u'Morfessor Documentation',
u'Sami Virpioja and Peter Smit', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# If true, show page references after internal links.
#latex_show_pagerefs = False
# If true, show URL addresses after external links.
#latex_show_urls = False
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# If false, no module index is generated.
#latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'morfessor', u'Morfessor Documentation',
[u'Sami Virpioja and Peter Smit'], 1)
]
# If true, show URL addresses after external links.
#man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'Morfessor', u'Morfessor Documentation',
u'Sami Virpioja and Peter Smit', 'Morfessor',
'Tool for unsupervised and semi-supervised morphological segmentation.',
'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
#texinfo_appendices = []
# If false, no module index is generated.
#texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False
================================================
FILE: docs/source/filetypes.rst
================================================
Morfessor file types
====================
.. _binary-model-def:
Binary model
------------
.. warning::
Pickled models are sensitive to bitrot. Sometimes incompatibilities exist
between Python versions that prevent loading a model stored by a different
version. Also, next versions of Morfessor are not guaranteed to be able to
load models of older versions.
The standard format for Morfessor 2.0 is a binary model, generated by pickling
the :ref:`BaselineModel <baseline-model-label>` object. This ensures that all
training-data, annotation-data and weights are exactly the same as when the
model was saved.
.. _binary-reduced-model-def:
Reduced Binary model
--------------------
A reduced Morfessor model contains only that information that is necessary for
segmenting new words using (nbest) viterbi segmentation. Reduced binary models
much smaller that the full models, but no model modificating actions can be
performed.
.. _morfessor1-model-def:
Morfessor 1.0 style text model
------------------------------
Morfessor 2.0 also supports the text model files that are used in Morfessor
1.0. These files consists of one segmentation per line, preceded by a count,
where the constructions are separated by ' + '.
Specification: ::
<int><space><CONSTRUCTION>[<space>+<space><CONSTRUCTION>]*
Example: ::
10 kahvi + kakku
5 kahvi + kilo + n
24 kahvi + kone + emme
Text corpus file
----------------
A text corpus file is a free format text-file. All lines are split into
compounds using the compound-separator (default <space>). The compounds then
are split into atoms using the atom-separator. Compounds can occur multiple
times and will be counted as such.
Example: ::
kavhikakku kahvikilon kahvikilon
kahvikoneemme kahvikakku
Word list file
--------------
A word list corpus file contains one compound per line, possibly preceded by a
count. If multiple entries of the same word occur there counts are summed. If
no count is given, a count of one is assumed (per entry).
Specification: ::
[<int><space>]<COMPOUND>
Example 1: ::
10 kahvikakku
5 kahvikilon
24 kahvikoneemme
Example 2: ::
kahvikakku
kahvikilon
kahvikoneemme
Annotation file
---------------
An annotation file contains one compound and one or more annotations per
compound on each line. The separators between the annotations (default ', ')
and between the constructions (default ' ') are configurable.
Specification: ::
<compound> <analysis1construction1>[ <analysis1constructionN>][, <analysis2construction1> [<analysis2constructionN>]*]*
Example: ::
kahvikakku kahvi kakku, kahvi kak ku
kahvikilon kahvi kilon
kahvikoneemme kahvi konee mme, kah vi ko nee mme
================================================
FILE: docs/source/general.rst
================================================
General
=======
.. _morfessor-tech-report:
Morfessor 2.0 Technical Report
------------------------------
The work done in Morfessor 2.0 is described in detail in the Morfessor 2.0
Technical Report [TechRep]_. The report is available for download from
http://urn.fi/URN:ISBN:978-952-60-5501-5.
Terminology
-----------
Unlike previous Morfessor implementations, Morfessor 2.0 is, in
principle, applicable to any string segmentation task. Thus we use
terms that are not specific to morphological segmentation task.
The task of the algorithm is to find a set of *constructions* that
describe the provided training corpus efficiently and accurately. The
training corpus contains a collection of *compounds*, which are the
largest sequences that a single construction can hold. The smallest
pieces of constructions and compounds are called *atoms*.
For example, in morphological segmentation, compounds are word forms,
constructions are morphs, and atoms are characters. In chunking,
compounds are sentences, constructions are phrases, and atoms are
words.
Citing
------
The authors do kindly ask that you cite the Morfessor 2.0 techical report
[TechRep]_ when using this tool in academic publications.
In addition, when you refer to the Morfessor algorithms, you should cite the
respective publications where they have been introduced. For example, the first
Morfessor algorithm was published in [Creutz2002]_ and the semi-supervised
extension in [Kohonen2010]_. See [TechRep]_ for further information on the
relevant publications.
.. [TechRep] Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN 978-952-60-5501-5.
.. [Creutz2002] Mathias Creutz and Krista Lagus. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL-02, pages 21-30, Philadelphia, Pennsylvania, 11 July, 2002.
.. [Kohonen2010] Oskar Kohonen, Sami Virpioja and Krista Lagus. Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 78-86, Uppsala, Sweden, July 2010. Association for Computational Linguistics.
================================================
FILE: docs/source/index.rst
================================================
.. Morfessor documentation master file, created by
sphinx-quickstart on Wed Dec 4 13:41:43 2013.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Morfessor 2.0 documentation
=====================================
.. note:: The Morfessor 2.0 documentation is still a work in progress and
contains some unfinished parts
Contents:
.. toctree::
:maxdepth: 2
license
general
installation
filetypes
cmdtools
libinterface
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: docs/source/installation.rst
================================================
Installation instructions
=========================
Morfessor 2.0 is installed using setuptools library for Python. Morfessor can
be installed from the packages available on the
`Morpho project homepage`_ and the `Morfessor Github page`_, or can be
directly installed from the `Python Package Index (PyPI)`_.
The Morfessor packages are created using the current Python packaging
standards, as described on http://docs.python.org/install/. Morfessor packages
are fully compatible with, and recommended to run in, virtual environments as
described on http://virtualenv.org.
Installation from tarball or zip file
-------------------------------------
The Morfessor 2.0 tarball and zip files can be downloaded from the
`Morpho project homepage`_ (latest stable version) or from the
`Morfessor Github page`_ (all versions).
The tarball can be installed in two different ways. The first is to unpack the
tarball or zip file and run::
python setup.py install
A second method is to use the tool pip on the tarball or zip file directly::
pip install morfessor-VERSION.tar.gz
Installation from PyPI
----------------------
Morfessor 2.0 is also distributed through the `Python Package Index (PyPI)`_.
This means that tools like pip and easy_install can automatically download and
install the latest version of Morfessor.
Simply type::
pip install morfessor
or::
easy_install morfessor
To install the morfessor library and tools.
.. _Morpho project homepage: http://morpho.aalto.fi
.. _Morfessor Github page: https://github.com/aalto-speech/morfessor/releases
.. _Python Package Index (PyPI): https://pypi.python.org/pypi/Morfessor
================================================
FILE: docs/source/libinterface.rst
================================================
Python library interface to Morfessor
=====================================
Morfessor 2.0 contains a library interface in order to be integrated in other
python applications. The public members are documented below and should remain
relatively the same between Morfessor versions. Private members are documented
in the code and can change anytime in releases.
The classes are documented below.
IO class
--------
.. automodule:: morfessor.io
:members:
.. _baseline-model-label:
Model classes
-------------
.. automodule:: morfessor.baseline
:members:
Evaluation classes
------------------
.. automodule:: morfessor.evaluation
:members:
Code Examples for using library interface
=========================================
Segmenting new data using an existing model
-------------------------------------------
::
import morfessor
io = morfessor.MorfessorIO()
model = io.read_binary_model_file('model.bin')
words = ['words', 'segmenting', 'morfessor', 'unsupervised']
for word in words:
print(model.viterbi_segment(word))
Testing type vs token models
----------------------------
::
import morfessor
io = morfessor.MorfessorIO()
train_data = list(io.read_corpus_file('training_data'))
model_types = morfessor.BaselineModel()
model_logtokens = morfessor.BaselineModel()
model_tokens = morfessor.BaselineModel()
model_types.load_data(train_data, count_modifier=lambda x: 1)
def log_func(x):
return int(round(math.log(x + 1, 2)))
model_logtokens.load_data(train_data, count_modifier=log_func)
model_tokens.load_data(train_data)
models = [model_types, model_logtokens, model_tokens]
for model in models:
model.train_batch()
goldstd_data = io.read_annotations_file('gold_std')
ev = morfessor.MorfessorEvaluation(goldstd_data)
results = [ev.evaluate_model(m) for m in models]
wsr = morfessor.WilcoxonSignedRank()
r = wsr.significance_test(results)
WilcoxonSignedRank.print_table(r)
The equivalent of this on the command line would be: ::
morfessor-train -s model_types -d ones training_data
morfessor-train -s model_logtokens -d log training_data
morfessor-train -s model_tokens training_data
morfessor-evaluate gold_std morfessor-train morfessor-train morfessor-train
Testing different amounts of supervision data
---------------------------------------------
================================================
FILE: docs/source/license.rst
================================================
License
=======
Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
================================================
FILE: ez_setup.py
================================================
#!python
"""Bootstrap setuptools installation
If you want to use setuptools in your package's setup.py, just include this
file in the same directory with it, and add this to the top of your setup.py::
from ez_setup import use_setuptools
use_setuptools()
If you want to require a specific version of setuptools, set a download
mirror, or use an alternate download directory, you can do so by supplying
the appropriate options to ``use_setuptools()``.
This file can also be run as a script to install or upgrade setuptools.
"""
import os
import shutil
import sys
import tempfile
import tarfile
import optparse
import subprocess
from distutils import log
try:
from site import USER_SITE
except ImportError:
USER_SITE = None
DEFAULT_VERSION = "0.9.6"
DEFAULT_URL = "https://pypi.python.org/packages/source/s/setuptools/"
def _python_cmd(*args):
args = (sys.executable,) + args
return subprocess.call(args) == 0
def _install(tarball, install_args=()):
# extracting the tarball
tmpdir = tempfile.mkdtemp()
log.warn('Extracting in %s', tmpdir)
old_wd = os.getcwd()
try:
os.chdir(tmpdir)
tar = tarfile.open(tarball)
_extractall(tar)
tar.close()
# going in the directory
subdir = os.path.join(tmpdir, os.listdir(tmpdir)[0])
os.chdir(subdir)
log.warn('Now working in %s', subdir)
# installing
log.warn('Installing Setuptools')
if not _python_cmd('setup.py', 'install', *install_args):
log.warn('Something went wrong during the installation.')
log.warn('See the error message above.')
# exitcode will be 2
return 2
finally:
os.chdir(old_wd)
shutil.rmtree(tmpdir)
def _build_egg(egg, tarball, to_dir):
# extracting the tarball
tmpdir = tempfile.mkdtemp()
log.warn('Extracting in %s', tmpdir)
old_wd = os.getcwd()
try:
os.chdir(tmpdir)
tar = tarfile.open(tarball)
_extractall(tar)
tar.close()
# going in the directory
subdir = os.path.join(tmpdir, os.listdir(tmpdir)[0])
os.chdir(subdir)
log.warn('Now working in %s', subdir)
# building an egg
log.warn('Building a Setuptools egg in %s', to_dir)
_python_cmd('setup.py', '-q', 'bdist_egg', '--dist-dir', to_dir)
finally:
os.chdir(old_wd)
shutil.rmtree(tmpdir)
# returning the result
log.warn(egg)
if not os.path.exists(egg):
raise IOError('Could not build the egg.')
def _do_download(version, download_base, to_dir, download_delay):
egg = os.path.join(to_dir, 'setuptools-%s-py%d.%d.egg'
% (version, sys.version_info[0], sys.version_info[1]))
if not os.path.exists(egg):
tarball = download_setuptools(version, download_base,
to_dir, download_delay)
_build_egg(egg, tarball, to_dir)
sys.path.insert(0, egg)
import setuptools
setuptools.bootstrap_install_from = egg
def use_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
to_dir=os.curdir, download_delay=15):
# making sure we use the absolute path
to_dir = os.path.abspath(to_dir)
was_imported = 'pkg_resources' in sys.modules or \
'setuptools' in sys.modules
try:
import pkg_resources
except ImportError:
return _do_download(version, download_base, to_dir, download_delay)
try:
pkg_resources.require("setuptools>=" + version)
return
except pkg_resources.VersionConflict:
e = sys.exc_info()[1]
if was_imported:
sys.stderr.write(
"The required version of setuptools (>=%s) is not available,\n"
"and can't be installed while this script is running. Please\n"
"install a more recent version first, using\n"
"'easy_install -U setuptools'."
"\n\n(Currently using %r)\n" % (version, e.args[0]))
sys.exit(2)
else:
del pkg_resources, sys.modules['pkg_resources'] # reload ok
return _do_download(version, download_base, to_dir,
download_delay)
except pkg_resources.DistributionNotFound:
return _do_download(version, download_base, to_dir,
download_delay)
def download_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
to_dir=os.curdir, delay=15):
"""Download setuptools from a specified location and return its filename
`version` should be a valid setuptools version number that is available
as an egg for download under the `download_base` URL (which should end
with a '/'). `to_dir` is the directory where the egg will be downloaded.
`delay` is the number of seconds to pause before an actual download
attempt.
"""
# making sure we use the absolute path
to_dir = os.path.abspath(to_dir)
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
tgz_name = "setuptools-%s.tar.gz" % version
url = download_base + tgz_name
saveto = os.path.join(to_dir, tgz_name)
src = dst = None
if not os.path.exists(saveto): # Avoid repeated downloads
try:
log.warn("Downloading %s", url)
src = urlopen(url)
# Read/write all in one block, so we don't create a corrupt file
# if the download is interrupted.
data = src.read()
dst = open(saveto, "wb")
dst.write(data)
finally:
if src:
src.close()
if dst:
dst.close()
return os.path.realpath(saveto)
def _extractall(self, path=".", members=None):
"""Extract all members from the archive to the current working
directory and set owner, modification time and permissions on
directories afterwards. `path' specifies a different directory
to extract to. `members' is optional and must be a subset of the
list returned by getmembers().
"""
import copy
import operator
from tarfile import ExtractError
directories = []
if members is None:
members = self
for tarinfo in members:
if tarinfo.isdir():
# Extract directories with a safe mode.
directories.append(tarinfo)
tarinfo = copy.copy(tarinfo)
tarinfo.mode = 448 # decimal for oct 0700
self.extract(tarinfo, path)
# Reverse sort directories.
if sys.version_info < (2, 4):
def sorter(dir1, dir2):
return cmp(dir1.name, dir2.name)
directories.sort(sorter)
directories.reverse()
else:
directories.sort(key=operator.attrgetter('name'), reverse=True)
# Set correct owner, mtime and filemode on directories.
for tarinfo in directories:
dirpath = os.path.join(path, tarinfo.name)
try:
self.chown(tarinfo, dirpath)
self.utime(tarinfo, dirpath)
self.chmod(tarinfo, dirpath)
except ExtractError:
e = sys.exc_info()[1]
if self.errorlevel > 1:
raise
else:
self._dbg(1, "tarfile: %s" % e)
def _build_install_args(options):
"""
Build the arguments to 'python setup.py install' on the setuptools package
"""
install_args = []
if options.user_install:
if sys.version_info < (2, 6):
log.warn("--user requires Python 2.6 or later")
raise SystemExit(1)
install_args.append('--user')
return install_args
def _parse_args():
"""
Parse the command line for options
"""
parser = optparse.OptionParser()
parser.add_option(
'--user', dest='user_install', action='store_true', default=False,
help='install in user site package (requires Python 2.6 or later)')
parser.add_option(
'--download-base', dest='download_base', metavar="URL",
default=DEFAULT_URL,
help='alternative URL from where to download the setuptools package')
options, args = parser.parse_args()
# positional arguments are ignored
return options
def main(version=DEFAULT_VERSION):
"""Install or upgrade setuptools and EasyInstall"""
options = _parse_args()
tarball = download_setuptools(download_base=options.download_base)
return _install(tarball, _build_install_args(options))
if __name__ == '__main__':
sys.exit(main())
================================================
FILE: morfessor/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Morfessor 2.0 - Python implementation of the Morfessor method
"""
import logging
__all__ = ['MorfessorException', 'ArgumentException', 'MorfessorIO',
'BaselineModel', 'main', 'get_default_argparser', 'main_evaluation',
'get_evaluation_argparser']
__version__ = '2.0.6'
__author__ = 'Sami Virpioja, Peter Smit, Stig-Arne Grönroos'
__author_email__ = "morpho@aalto.fi"
_logger = logging.getLogger(__name__)
def get_version():
return __version__
# The public api imports need to be at the end of the file,
# so that the package global names are available to the modules
# when they are imported.
from .baseline import BaselineModel, FixedCorpusWeight, AnnotationCorpusWeight, \
NumMorphCorpusWeight, MorphLengthCorpusWeight
from .cmd import main, get_default_argparser, main_evaluation, \
get_evaluation_argparser
from .exception import MorfessorException, ArgumentException
from .io import MorfessorIO
from .utils import _progress
from .evaluation import MorfessorEvaluation, MorfessorEvaluationResult
================================================
FILE: morfessor/baseline.py
================================================
import collections
import heapq
import logging
import math
import numbers
import random
import re
from .utils import _progress, _is_string
from .exception import MorfessorException, SegmentOnlyModelException
_logger = logging.getLogger(__name__)
def _constructions_to_str(constructions):
"""Return a readable string for a list of constructions."""
if _is_string(constructions[0]):
# Constructions are strings
return ' + '.join(constructions)
else:
# Constructions are not strings (should be tuples of strings)
return ' + '.join(map(lambda x: ' '.join(x), constructions))
# rcount = root count (from corpus)
# count = total count of the node
# splitloc = integer or tuple. Location(s) of the possible splits for virtual
# constructions; empty tuple or 0 if real construction
ConstrNode = collections.namedtuple('ConstrNode',
['rcount', 'count', 'splitloc'])
class BaselineModel(object):
"""Morfessor Baseline model class.
Implements training of and segmenting with a Morfessor model. The model
is complete agnostic to whether it is used with lists of strings (finding
phrases in sentences) or strings of characters (finding morphs in words).
"""
penalty = -9999.9
def __init__(self, forcesplit_list=None, corpusweight=None,
use_skips=False, nosplit_re=None):
"""Initialize a new model instance.
Arguments:
forcesplit_list: force segmentations on the characters in
the given list
corpusweight: weight for the corpus cost
use_skips: randomly skip frequently occurring constructions
to speed up training
nosplit_re: regular expression string for preventing splitting
in certain contexts
"""
# In analyses for each construction a ConstrNode is stored. All
# training data has a rcount (real count) > 0. All real morphemes
# have no split locations.
self._analyses = {}
# Flag to indicate the model is only useful for segmentation
self._segment_only = False
# Cost variables
self._lexicon_coding = LexiconEncoding()
self._corpus_coding = CorpusEncoding(self._lexicon_coding)
self._annot_coding = None
#Set corpus weight updater
self.set_corpus_weight_updater(corpusweight)
# Configuration variables
self._use_skips = use_skips # Random skips for frequent constructions
self._supervised = False
# Counter for random skipping
self._counter = collections.Counter()
if forcesplit_list is None:
self.forcesplit_list = []
else:
self.forcesplit_list = forcesplit_list
if nosplit_re is None:
self.nosplit_re = None
else:
self.nosplit_re = re.compile(nosplit_re, re.UNICODE)
# Used only for (semi-)supervised learning
self.annotations = None
def set_corpus_weight_updater(self, corpus_weight):
if corpus_weight is None:
self._corpus_weight_updater = FixedCorpusWeight(1.0)
elif isinstance(corpus_weight, numbers.Number):
self._corpus_weight_updater = FixedCorpusWeight(corpus_weight)
else:
self._corpus_weight_updater = corpus_weight
self._corpus_weight_updater.update(self, 0)
def _check_segment_only(self):
if self._segment_only:
raise SegmentOnlyModelException()
@property
def tokens(self):
"""Return the number of construction tokens."""
return self._corpus_coding.tokens
@property
def types(self):
"""Return the number of construction types."""
return self._corpus_coding.types - 1 # do not include boundary
def _add_compound(self, compound, c):
"""Add compound with count c to data."""
self._corpus_coding.boundaries += c
self._modify_construction_count(compound, c)
oldrc = self._analyses[compound].rcount
self._analyses[compound] = \
self._analyses[compound]._replace(rcount=oldrc + c)
def _remove(self, construction):
"""Remove construction from model."""
rcount, count, splitloc = self._analyses[construction]
self._modify_construction_count(construction, -count)
return rcount, count
def _random_split(self, compound, threshold):
"""Return a random split for compound.
Arguments:
compound: compound to split
threshold: probability of splitting at each position
"""
splitloc = tuple(i for i in range(1, len(compound))
if random.random() < threshold)
return self._splitloc_to_segmentation(compound, splitloc)
def _set_compound_analysis(self, compound, parts, ptype='rbranch'):
"""Set analysis of compound to according to given segmentation.
Arguments:
compound: compound to split
parts: desired constructions of the compound
ptype: type of the parse tree to use
If ptype is 'rbranch', the analysis is stored internally as a
right-branching tree. If ptype is 'flat', the analysis is stored
directly to the compound's node.
"""
if len(parts) == 1:
rcount, count = self._remove(compound)
self._analyses[compound] = ConstrNode(rcount, 0, tuple())
self._modify_construction_count(compound, count)
elif ptype == 'flat':
rcount, count = self._remove(compound)
splitloc = self.segmentation_to_splitloc(parts)
self._analyses[compound] = ConstrNode(rcount, count, splitloc)
for constr in parts:
self._modify_construction_count(constr, count)
elif ptype == 'rbranch':
construction = compound
for p in range(len(parts)):
rcount, count = self._remove(construction)
prefix = parts[p]
if p == len(parts) - 1:
self._analyses[construction] = ConstrNode(rcount, 0,
0)
self._modify_construction_count(construction, count)
else:
suffix = self._join_constructions(parts[p + 1:])
self._analyses[construction] = ConstrNode(rcount, count,
len(prefix))
self._modify_construction_count(prefix, count)
self._modify_construction_count(suffix, count)
construction = suffix
else:
raise MorfessorException("Unknown parse type '%s'" % ptype)
def _update_annotation_choices(self):
"""Update the selection of alternative analyses in annotations.
For semi-supervised models, select the most likely alternative
analyses included in the annotations of the compounds.
"""
if not self._supervised:
return
# Collect constructions from the most probable segmentations
# and add missing compounds also to the unannotated data
constructions = collections.Counter()
for compound, alternatives in self.annotations.items():
if compound not in self._analyses:
self._add_compound(compound, 1)
analysis, cost = self._best_analysis(alternatives)
for m in analysis:
constructions[m] += self._analyses[compound].rcount
# Apply the selected constructions in annotated corpus coding
self._annot_coding.set_constructions(constructions)
for m, f in constructions.items():
count = 0
if m in self._analyses and not self._analyses[m].splitloc:
count = self._analyses[m].count
self._annot_coding.set_count(m, count)
def _best_analysis(self, choices):
"""Select the best analysis out of the given choices."""
bestcost = None
bestanalysis = None
for analysis in choices:
cost = 0.0
for m in analysis:
if m in self._analyses and not self._analyses[m].splitloc:
cost += (math.log(self._corpus_coding.tokens) -
math.log(self._analyses[m].count))
else:
cost -= self.penalty # penalty is negative
if bestcost is None or cost < bestcost:
bestcost = cost
bestanalysis = analysis
return bestanalysis, bestcost
def _force_split(self, compound):
"""Return forced split of the compound."""
if len(self.forcesplit_list) == 0:
return [compound]
clen = len(compound)
j = 0
parts = []
for i in range(0, clen):
if compound[i] in self.forcesplit_list:
if len(compound[j:i]) > 0:
parts.append(compound[j:i])
parts.append(compound[i:i + 1])
j = i + 1
if j < clen:
parts.append(compound[j:])
return [p for p in parts if len(p) > 0]
def _test_skip(self, construction):
"""Return true if construction should be skipped."""
if construction in self._counter:
t = self._counter[construction]
if random.random() > 1.0 / max(1, t):
return True
self._counter[construction] += 1
return False
def _viterbi_optimize(self, compound, addcount=0, maxlen=30):
"""Optimize segmentation of the compound using the Viterbi algorithm.
Arguments:
compound: compound to optimize
addcount: constant for additive smoothing of Viterbi probs
maxlen: maximum length for a construction
Returns list of segments.
"""
clen = len(compound)
if clen == 1: # Single atom
return [compound]
if self._use_skips and self._test_skip(compound):
return self.segment(compound)
# Collect forced subsegments
parts = self._force_split(compound)
# Use Viterbi algorithm to optimize the subsegments
constructions = []
for part in parts:
constructions += self.viterbi_segment(part, addcount=addcount,
maxlen=maxlen)[0]
self._set_compound_analysis(compound, constructions, ptype='flat')
return constructions
def _recursive_optimize(self, compound):
"""Optimize segmentation of the compound using recursive splitting.
Returns list of segments.
"""
if len(compound) == 1: # Single atom
return [compound]
if self._use_skips and self._test_skip(compound):
return self.segment(compound)
# Collect forced subsegments
parts = self._force_split(compound)
if len(parts) == 1:
# just one part
return self._recursive_split(compound)
self._set_compound_analysis(compound, parts)
# Use recursive algorithm to optimize the subsegments
constructions = []
for part in parts:
constructions += self._recursive_split(part)
return constructions
def _recursive_split(self, construction):
"""Optimize segmentation of the construction by recursive splitting.
Returns list of segments.
"""
if len(construction) == 1: # Single atom
return [construction]
if self._use_skips and self._test_skip(construction):
return self.segment(construction)
rcount, count = self._remove(construction)
# Check all binary splits and no split
self._modify_construction_count(construction, count)
mincost = self.get_cost()
self._modify_construction_count(construction, -count)
splitloc = 0
for i in range(1, len(construction)):
if (self.nosplit_re and
self.nosplit_re.match(construction[(i - 1):(i + 1)])):
continue
prefix = construction[:i]
suffix = construction[i:]
self._modify_construction_count(prefix, count)
self._modify_construction_count(suffix, count)
cost = self.get_cost()
self._modify_construction_count(prefix, -count)
self._modify_construction_count(suffix, -count)
if cost <= mincost:
mincost = cost
splitloc = i
if splitloc:
# Virtual construction
self._analyses[construction] = ConstrNode(rcount, count,
splitloc)
prefix = construction[:splitloc]
suffix = construction[splitloc:]
self._modify_construction_count(prefix, count)
self._modify_construction_count(suffix, count)
lp = self._recursive_split(prefix)
if suffix != prefix:
return lp + self._recursive_split(suffix)
else:
return lp + lp
else:
# Real construction
self._analyses[construction] = ConstrNode(rcount, 0, tuple())
self._modify_construction_count(construction, count)
return [construction]
def _modify_construction_count(self, construction, dcount):
"""Modify the count of construction by dcount.
For virtual constructions, recurses to child nodes in the
tree. For real constructions, adds/removes construction
to/from the lexicon whenever necessary.
"""
if construction in self._analyses:
rcount, count, splitloc = self._analyses[construction]
else:
rcount, count, splitloc = 0, 0, 0
newcount = count + dcount
if newcount == 0:
del self._analyses[construction]
else:
self._analyses[construction] = ConstrNode(rcount, newcount,
splitloc)
if splitloc:
# Virtual construction
children = self._splitloc_to_segmentation(construction, splitloc)
for child in children:
self._modify_construction_count(child, dcount)
else:
# Real construction
self._corpus_coding.update_count(construction, count, newcount)
if self._supervised:
self._annot_coding.update_count(construction, count, newcount)
if count == 0 and newcount > 0:
self._lexicon_coding.add(construction)
elif count > 0 and newcount == 0:
self._lexicon_coding.remove(construction)
def _epoch_update(self, epoch_num):
"""Do model updates that are necessary between training epochs.
The argument is the number of training epochs finished.
In practice, this does two things:
- If random skipping is in use, reset construction counters.
- If semi-supervised learning is in use and there are alternative
analyses in the annotated data, select the annotations that are
most likely given the model parameters. If not hand-set, update
the weight of the annotated corpus.
This method should also be run prior to training (with the
epoch number argument as 0).
"""
forced_epochs = 0
if self._corpus_weight_updater.update(self, epoch_num):
forced_epochs += 2
if self._use_skips:
self._counter = collections.Counter()
if self._supervised:
self._update_annotation_choices()
self._annot_coding.update_weight()
return forced_epochs
@staticmethod
def segmentation_to_splitloc(constructions):
"""Return a list of split locations for a segmented compound."""
splitloc = []
i = 0
for c in constructions:
i += len(c)
splitloc.append(i)
return tuple(splitloc[:-1])
@staticmethod
def _splitloc_to_segmentation(compound, splitloc):
"""Return segmentation corresponding to the list of split locations."""
if isinstance(splitloc, numbers.Number):
return [compound[:splitloc], compound[splitloc:]]
parts = []
startpos = 0
endpos = 0
for i in range(len(splitloc)):
endpos = splitloc[i]
parts.append(compound[startpos:endpos])
startpos = endpos
parts.append(compound[endpos:])
return parts
@staticmethod
def _join_constructions(constructions):
"""Append the constructions after each other by addition. Works for
both lists and strings """
result = type(constructions[0])()
for c in constructions:
result += c
return result
def get_compounds(self):
"""Return the compound types stored by the model."""
self._check_segment_only()
return [w for w, node in self._analyses.items()
if node.rcount > 0]
def get_constructions(self):
"""Return a list of the present constructions and their counts."""
return sorted((c, node.count) for c, node in self._analyses.items()
if not node.splitloc)
def get_cost(self):
"""Return current model encoding cost."""
cost = self._corpus_coding.get_cost() + self._lexicon_coding.get_cost()
if self._supervised:
return cost + self._annot_coding.get_cost()
else:
return cost
def get_segmentations(self):
"""Retrieve segmentations for all compounds encoded by the model."""
self._check_segment_only()
for w in sorted(self._analyses.keys()):
c = self._analyses[w].rcount
if c > 0:
yield c, w, self.segment(w)
def load_data(self, data, freqthreshold=1, count_modifier=None,
init_rand_split=None):
"""Load data to initialize the model for batch training.
Arguments:
data: iterator of (count, compound_atoms) tuples
freqthreshold: discard compounds that occur less than
given times in the corpus (default 1)
count_modifier: function for adjusting the counts of each
compound
init_rand_split: If given, random split the word with
init_rand_split as the probability for each
split
Adds the compounds in the corpus to the model lexicon. Returns
the total cost.
"""
self._check_segment_only()
totalcount = collections.Counter()
for count, atoms in data:
if len(atoms) > 0:
totalcount[atoms] += count
for atoms, count in totalcount.items():
if count < freqthreshold:
continue
if count_modifier is not None:
self._add_compound(atoms, count_modifier(count))
else:
self._add_compound(atoms, count)
if init_rand_split is not None and init_rand_split > 0:
parts = self._random_split(atoms, init_rand_split)
self._set_compound_analysis(atoms, parts)
return self.get_cost()
def load_segmentations(self, segmentations):
"""Load model from existing segmentations.
The argument should be an iterator providing a count, a
compound, and its segmentation.
"""
self._check_segment_only()
for count, compound, segmentation in segmentations:
self._add_compound(compound, count)
self._set_compound_analysis(compound, segmentation)
def set_annotations(self, annotations, annotatedcorpusweight=None):
"""Prepare model for semi-supervised learning with given
annotations.
"""
self._check_segment_only()
self._supervised = True
self.annotations = annotations
self._annot_coding = AnnotatedCorpusEncoding(self._corpus_coding,
weight=
annotatedcorpusweight)
self._annot_coding.boundaries = len(self.annotations)
def segment(self, compound):
"""Segment the compound by looking it up in the model analyses.
Raises KeyError if compound is not present in the training
data. For segmenting new words, use viterbi_segment(compound).
"""
self._check_segment_only()
rcount, count, splitloc = self._analyses[compound]
constructions = []
if splitloc:
for child in self._splitloc_to_segmentation(compound,
splitloc):
constructions += self.segment(child)
else:
constructions.append(compound)
return constructions
def train_batch(self, algorithm='recursive', algorithm_params=(),
finish_threshold=0.005, max_epochs=None):
"""Train the model in batch fashion.
The model is trained with the data already loaded into the model (by
using an existing model or calling one of the load_ methods).
In each iteration (epoch) all compounds in the training data are
optimized once, in a random order. If applicable, corpus weight,
annotation cost, and random split counters are recalculated after
each iteration.
Arguments:
algorithm: string in ('recursive', 'viterbi') that indicates
the splitting algorithm used.
algorithm_params: parameters passed to the splitting algorithm.
finish_threshold: the stopping threshold. Training stops when
the improvement of the last iteration is
smaller then finish_threshold * #boundaries
max_epochs: maximum number of epochs to train
"""
epochs = 0
forced_epochs = max(1, self._epoch_update(epochs))
newcost = self.get_cost()
compounds = list(self.get_compounds())
_logger.info("Compounds in training data: %s types / %s tokens",
len(compounds), self._corpus_coding.boundaries)
_logger.info("Starting batch training")
_logger.info("Epochs: %s\tCost: %s", epochs, newcost)
while True:
# One epoch
random.shuffle(compounds)
for w in _progress(compounds):
if algorithm == 'recursive':
segments = self._recursive_optimize(w, *algorithm_params)
elif algorithm == 'viterbi':
segments = self._viterbi_optimize(w, *algorithm_params)
else:
raise MorfessorException("unknown algorithm '%s'" %
algorithm)
_logger.debug("#%s -> %s", w, _constructions_to_str(segments))
epochs += 1
_logger.debug("Cost before epoch update: %s", self.get_cost())
forced_epochs = max(forced_epochs, self._epoch_update(epochs))
oldcost = newcost
newcost = self.get_cost()
_logger.info("Epochs: %s\tCost: %s", epochs, newcost)
if (forced_epochs == 0 and
newcost >= oldcost - finish_threshold *
self._corpus_coding.boundaries):
break
if forced_epochs > 0:
forced_epochs -= 1
if max_epochs is not None and epochs >= max_epochs:
_logger.info("Max number of epochs reached, stop training")
break
_logger.info("Done.")
return epochs, newcost
def train_online(self, data, count_modifier=None, epoch_interval=10000,
algorithm='recursive', algorithm_params=(),
init_rand_split=None, max_epochs=None):
"""Train the model in online fashion.
The model is trained with the data provided in the data argument.
As example the data could come from a generator linked to standard in
for live monitoring of the splitting.
All compounds from data are only optimized once. After online
training, batch training could be used for further optimization.
Epochs are defined as a fixed number of compounds. After each epoch (
like in batch training), the annotation cost, and random split counters
are recalculated if applicable.
Arguments:
data: iterator of (_, compound_atoms) tuples. The first
argument is ignored, as every occurence of the
compound is taken with count 1
count_modifier: function for adjusting the counts of each
compound
epoch_interval: number of compounds to process before starting
a new epoch
algorithm: string in ('recursive', 'viterbi') that indicates
the splitting algorithm used.
algorithm_params: parameters passed to the splitting algorithm.
init_rand_split: probability for random splitting a compound to
at any point for initializing the model. None
or 0 means no random splitting.
max_epochs: maximum number of epochs to train
"""
self._check_segment_only()
if count_modifier is not None:
counts = {}
_logger.info("Starting online training")
epochs = 0
i = 0
more_tokens = True
while more_tokens:
self._epoch_update(epochs)
newcost = self.get_cost()
_logger.info("Tokens processed: %s\tCost: %s", i, newcost)
for _ in _progress(range(epoch_interval)):
try:
_, w = next(data)
except StopIteration:
more_tokens = False
break
if len(w) == 0:
# Newline in corpus
continue
if count_modifier is not None:
if w not in counts:
c = 0
counts[w] = 1
addc = 1
else:
c = counts[w]
counts[w] = c + 1
addc = count_modifier(c + 1) - count_modifier(c)
if addc > 0:
self._add_compound(w, addc)
else:
self._add_compound(w, 1)
if init_rand_split is not None and init_rand_split > 0:
parts = self._random_split(w, init_rand_split)
self._set_compound_analysis(w, parts)
if algorithm == 'recursive':
segments = self._recursive_optimize(w, *algorithm_params)
elif algorithm == 'viterbi':
segments = self._viterbi_optimize(w, *algorithm_params)
else:
raise MorfessorException("unknown algorithm '%s'" %
algorithm)
_logger.debug("#%s: %s -> %s", i, w, _constructions_to_str(segments))
i += 1
epochs += 1
if max_epochs is not None and epochs >= max_epochs:
_logger.info("Max number of epochs reached, stop training")
break
self._epoch_update(epochs)
newcost = self.get_cost()
_logger.info("Tokens processed: %s\tCost: %s", i, newcost)
return epochs, newcost
def viterbi_segment(self, compound, addcount=1.0, maxlen=30):
"""Find optimal segmentation using the Viterbi algorithm.
Arguments:
compound: compound to be segmented
addcount: constant for additive smoothing (0 = no smoothing)
maxlen: maximum length for the constructions
If additive smoothing is applied, new complex construction types can
be selected during the search. Without smoothing, only new
single-atom constructions can be selected.
Returns the most probable segmentation and its log-probability.
"""
clen = len(compound)
grid = [(0.0, None)]
if self._corpus_coding.tokens + self._corpus_coding.boundaries + \
addcount > 0:
logtokens = math.log(self._corpus_coding.tokens +
self._corpus_coding.boundaries + addcount)
else:
logtokens = 0
badlikelihood = clen * logtokens + 1.0
# Viterbi main loop
for t in range(1, clen + 1):
# Select the best path to current node.
# Note that we can come from any node in history.
bestpath = None
bestcost = None
if self.nosplit_re and t < clen and \
self.nosplit_re.match(compound[(t-1):(t+1)]):
grid.append((clen*badlikelihood, t-1))
continue
for pt in range(max(0, t - maxlen), t):
if grid[pt][0] is None:
continue
cost = grid[pt][0]
construction = compound[pt:t]
if (construction in self._analyses and
not self._analyses[construction].splitloc):
if self._analyses[construction].count <= 0:
raise MorfessorException(
"Construction count of '%s' is %s" %
(construction,
self._analyses[construction].count))
cost += (logtokens -
math.log(self._analyses[construction].count +
addcount))
elif addcount > 0:
if self._corpus_coding.tokens == 0:
cost += (addcount * math.log(addcount) +
self._lexicon_coding.get_codelength(
construction)
/ self._corpus_coding.weight)
else:
cost += (logtokens - math.log(addcount) +
(((self._lexicon_coding.boundaries +
addcount) *
math.log(self._lexicon_coding.boundaries
+ addcount))
- (self._lexicon_coding.boundaries
* math.log(self._lexicon_coding.boundaries))
+ self._lexicon_coding.get_codelength(
construction))
/ self._corpus_coding.weight)
elif len(construction) == 1:
cost += badlikelihood
elif self.nosplit_re:
# Some splits are forbidden, so longer unknown
# constructions have to be allowed
cost += len(construction) * badlikelihood
else:
continue
if bestcost is None or cost < bestcost:
bestcost = cost
bestpath = pt
grid.append((bestcost, bestpath))
constructions = []
cost, path = grid[-1]
lt = clen + 1
while path is not None:
t = path
constructions.append(compound[t:lt])
path = grid[t][1]
lt = t
constructions.reverse()
# Add boundary cost
cost += (math.log(self._corpus_coding.tokens +
self._corpus_coding.boundaries) -
math.log(self._corpus_coding.boundaries))
return constructions, cost
def forward_logprob(self, compound):
"""Find log-probability of a compound using the forward algorithm.
Arguments:
compound: compound to process
Returns the (negative) log-probability of the compound. If the
probability is zero, returns a number that is larger than the
value defined by the penalty attribute of the model object.
"""
clen = len(compound)
grid = [0.0]
if self._corpus_coding.tokens + self._corpus_coding.boundaries > 0:
logtokens = math.log(self._corpus_coding.tokens +
self._corpus_coding.boundaries)
else:
logtokens = 0
# Forward main loop
for t in range(1, clen + 1):
# Sum probabilities from all paths to the current node.
# Note that we can come from any node in history.
psum = 0.0
for pt in range(0, t):
cost = grid[pt]
construction = compound[pt:t]
if (construction in self._analyses and
not self._analyses[construction].splitloc):
if self._analyses[construction].count <= 0:
raise MorfessorException(
"Construction count of '%s' is %s" %
(construction,
self._analyses[construction].count))
cost += (logtokens -
math.log(self._analyses[construction].count))
else:
continue
psum += math.exp(-cost)
if psum > 0:
grid.append(-math.log(psum))
else:
grid.append(-self.penalty)
cost = grid[-1]
# Add boundary cost
cost += (math.log(self._corpus_coding.tokens +
self._corpus_coding.boundaries) -
math.log(self._corpus_coding.boundaries))
return cost
def viterbi_nbest(self, compound, n, addcount=1.0, maxlen=30):
"""Find top-n optimal segmentations using the Viterbi algorithm.
Arguments:
compound: compound to be segmented
n: how many segmentations to return
addcount: constant for additive smoothing (0 = no smoothing)
maxlen: maximum length for the constructions
If additive smoothing is applied, new complex construction types can
be selected during the search. Without smoothing, only new
single-atom constructions can be selected.
Returns the n most probable segmentations and their
log-probabilities.
"""
clen = len(compound)
grid = [[(0.0, None, None)]]
if self._corpus_coding.tokens + self._corpus_coding.boundaries + \
addcount > 0:
logtokens = math.log(self._corpus_coding.tokens +
self._corpus_coding.boundaries + addcount)
else:
logtokens = 0
badlikelihood = clen * logtokens + 1.0
# Viterbi main loop
for t in range(1, clen + 1):
# Select the best path to current node.
# Note that we can come from any node in history.
bestn = []
if self.nosplit_re and t < clen and \
self.nosplit_re.match(compound[(t-1):(t+1)]):
grid.append([(-clen*badlikelihood, t-1, -1)])
continue
for pt in range(max(0, t - maxlen), t):
for k in range(len(grid[pt])):
if grid[pt][k][0] is None:
continue
cost = grid[pt][k][0]
construction = compound[pt:t]
if (construction in self._analyses and
not self._analyses[construction].splitloc):
if self._analyses[construction].count <= 0:
raise MorfessorException(
"Construction count of '%s' is %s" %
(construction,
self._analyses[construction].count))
cost -= (logtokens -
math.log(self._analyses[construction].count +
addcount))
elif addcount > 0:
if self._corpus_coding.tokens == 0:
cost -= (addcount * math.log(addcount) +
self._lexicon_coding.get_codelength(
construction)
/ self._corpus_coding.weight)
else:
cost -= (logtokens - math.log(addcount) +
(((self._lexicon_coding.boundaries +
addcount) *
math.log(self._lexicon_coding.boundaries
+ addcount))
- (self._lexicon_coding.boundaries
* math.log(self._lexicon_coding.
boundaries))
+ self._lexicon_coding.get_codelength(
construction))
/ self._corpus_coding.weight)
elif len(construction) == 1:
cost -= badlikelihood
elif self.nosplit_re:
# Some splits are forbidden, so longer unknown
# constructions have to be allowed
cost -= len(construction) * badlikelihood
else:
continue
if len(bestn) < n:
heapq.heappush(bestn, (cost, pt, k))
else:
heapq.heappushpop(bestn, (cost, pt, k))
grid.append(bestn)
results = []
for k in range(len(grid[-1])):
constructions = []
cost, path, ki = grid[-1][k]
lt = clen + 1
while path is not None:
t = path
constructions.append(compound[t:lt])
path = grid[t][ki][1]
ki = grid[t][ki][2]
lt = t
constructions.reverse()
# Add boundary cost
cost -= (math.log(self._corpus_coding.tokens +
self._corpus_coding.boundaries) -
math.log(self._corpus_coding.boundaries))
results.append((-cost, constructions))
return [(constr, cost) for cost, constr in sorted(results)]
def get_corpus_coding_weight(self):
return self._corpus_coding.weight
def set_corpus_coding_weight(self, weight):
self._check_segment_only()
self._corpus_coding.weight = weight
def make_segment_only(self):
"""Reduce the size of this model by removing all non-morphs from the
analyses. After calling this method it is not possible anymore to call
any other method that would change the state of the model. Anyway
doing so would throw an exception.
"""
self._segment_only = True
self._analyses = {k: v for (k, v) in self._analyses.items()
if not v.splitloc}
def clear_segmentation(self):
for compound in list(self.get_compounds()):
self._set_compound_analysis(compound, [compound])
class CorpusWeight(object):
@classmethod
def move_direction(cls, model, direction, epoch):
if direction != 0:
weight = model.get_corpus_coding_weight()
if direction > 0:
weight *= 1 + 2.0 / epoch
else:
weight *= 1.0 / (1 + 2.0 / epoch)
model.set_corpus_coding_weight(weight)
_logger.info("Corpus weight set to %s", weight)
return True
return False
class FixedCorpusWeight(CorpusWeight):
def __init__(self, weight):
self.weight = weight
def update(self, model, _):
model.set_corpus_coding_weight(self.weight)
return False
class AnnotationCorpusWeight(CorpusWeight):
"""Class for using development annotations to update the corpus weight
during batch training
"""
def __init__(self, devel_set, threshold=0.01):
self.data = devel_set
self.threshold = threshold
def update(self, model, epoch):
"""Tune model corpus weight based on the precision and
recall of the development data, trying to keep them equal"""
if epoch < 1:
return False
tmp = self.data.items()
wlist, annotations = zip(*tmp)
segments = [model.viterbi_segment(w)[0] for w in wlist]
d = self._estimate_segmentation_dir(segments, annotations)
return self.move_direction(model, d, epoch)
@classmethod
def _boundary_recall(cls, prediction, reference):
"""Calculate average boundary recall for given segmentations."""
rec_total = 0
rec_sum = 0.0
for pre_list, ref_list in zip(prediction, reference):
best = -1
for ref in ref_list:
# list of internal boundary positions
ref_b = set(BaselineModel.segmentation_to_splitloc(ref))
if len(ref_b) == 0:
best = 1.0
break
for pre in pre_list:
pre_b = set(BaselineModel.segmentation_to_splitloc(pre))
r = len(ref_b.intersection(pre_b)) / float(len(ref_b))
if r > best:
best = r
if best >= 0:
rec_sum += best
rec_total += 1
return rec_sum, rec_total
@classmethod
def _bpr_evaluation(cls, prediction, reference):
"""Return boundary precision, recall, and F-score for segmentations."""
rec_s, rec_t = cls._boundary_recall(prediction, reference)
pre_s, pre_t = cls._boundary_recall(reference, prediction)
rec = rec_s / rec_t
pre = pre_s / pre_t
f = 2.0 * pre * rec / (pre + rec)
return pre, rec, f
def _estimate_segmentation_dir(self, segments, annotations):
"""Estimate if the given compounds are under- or oversegmented.
The decision is based on the difference between boundary precision
and recall values for the given sample of segmented data.
Arguments:
segments: list of predicted segmentations
annotations: list of reference segmentations
Return 1 in the case of oversegmentation, -1 in the case of
undersegmentation, and 0 if no changes are required.
"""
pre, rec, f = self._bpr_evaluation([[x] for x in segments], annotations)
_logger.info("Boundary evaluation: precision %.4f; recall %.4f", pre, rec)
if abs(pre - rec) < self.threshold:
return 0
elif rec > pre:
return 1
else:
return -1
class MorphLengthCorpusWeight(CorpusWeight):
def __init__(self, morph_lenght, threshold=0.01):
self.morph_length = morph_lenght
self.threshold = threshold
def update(self, model, epoch):
if epoch < 1:
return False
cur_length = self.calc_morph_length(model)
_logger.info("Current morph-length: %s", cur_length)
if (abs(self.morph_length - cur_length) / self.morph_length >
self.threshold):
d = abs(self.morph_length - cur_length) / (self.morph_length
- cur_length)
return self.move_direction(model, d, epoch)
return False
@classmethod
def calc_morph_length(cls, model):
total_constructions = 0
total_atoms = 0
for compound in model.get_compounds():
constructions = model.segment(compound)
for construction in constructions:
total_constructions += 1
total_atoms += len(construction)
if total_constructions > 0:
return float(total_atoms) / total_constructions
else:
return 0.0
class NumMorphCorpusWeight(CorpusWeight):
def __init__(self, num_morph_types, threshold=0.01):
self.num_morph_types = num_morph_types
self.threshold = threshold
def update(self, model, epoch):
if epoch < 1:
return False
cur_morph_types = model._lexicon_coding.boundaries
_logger.info("Number of morph types: %s", cur_morph_types)
if (abs(self.num_morph_types - cur_morph_types) / self.num_morph_types
> self.threshold):
d = (abs(self.num_morph_types - cur_morph_types) /
(self.num_morph_types - cur_morph_types))
return self.move_direction(model, d, epoch)
return False
class Encoding(object):
"""Base class for calculating the entropy (encoding length) of a corpus
or lexicon.
Commonly subclassed to redefine specific methods.
"""
def __init__(self, weight=1.0):
"""Initizalize class
Arguments:
weight: weight used for this encoding
"""
self.logtokensum = 0.0
self.tokens = 0
self.boundaries = 0
self.weight = weight
# constant used for speeding up logfactorial calculations with Stirling's
# approximation
_log2pi = math.log(2 * math.pi)
@property
def types(self):
"""Define number of types as 0. types is made a property method to
ensure easy redefinition in subclasses
"""
return 0
@classmethod
def _logfactorial(cls, n):
"""Calculate logarithm of n!.
For large n (n > 20), use Stirling's approximation.
"""
if n < 2:
return 0.0
if n < 20:
return math.log(math.factorial(n))
logn = math.log(n)
return n * logn - n + 0.5 * (logn + cls._log2pi)
def frequency_distribution_cost(self):
"""Calculate -log[(u - 1)! (v - u)! / (v - 1)!]
v is the number of tokens+boundaries and u the number of types
"""
if self.types < 2:
return 0.0
tokens = self.tokens + self.boundaries
return (self._logfactorial(tokens - 1) -
self._logfactorial(self.types - 1) -
self._logfactorial(tokens - self.types))
def permutations_cost(self):
"""The permutations cost for the encoding."""
return -self._logfactorial(self.boundaries)
def update_count(self, construction, old_count, new_count):
"""Update the counts in the encoding."""
self.tokens += new_count - old_count
if old_count > 1:
self.logtokensum -= old_count * math.log(old_count)
if new_count > 1:
self.logtokensum += new_count * math.log(new_count)
def get_cost(self):
"""Calculate the cost for encoding the corpus/lexicon"""
if self.boundaries == 0:
return 0.0
n = self.tokens + self.boundaries
return ((n * math.log(n)
- self.boundaries * math.log(self.boundaries)
- self.logtokensum
+ self.permutations_cost()) * self.weight
+ self.frequency_distribution_cost())
class CorpusEncoding(Encoding):
"""Encoding the corpus class
The basic difference to a normal encoding is that the number of types is
not stored directly but fetched from the lexicon encoding. Also does the
cost function not contain any permutation cost.
"""
def __init__(self, lexicon_encoding, weight=1.0):
super(CorpusEncoding, self).__init__(weight)
self.lexicon_encoding = lexicon_encoding
@property
def types(self):
"""Return the number of types of the corpus, which is the same as the
number of boundaries in the lexicon + 1
"""
return self.lexicon_encoding.boundaries + 1
def frequency_distribution_cost(self):
"""Calculate -log[(M - 1)! (N - M)! / (N - 1)!] for M types and N
tokens.
"""
if self.types < 2:
return 0.0
tokens = self.tokens
return (self._logfactorial(tokens - 1) -
self._logfactorial(self.types - 2) -
self._logfactorial(tokens - self.types + 1))
def get_cost(self):
"""Override for the Encoding get_cost function. A corpus does not
have a permutation cost
"""
if self.boundaries == 0:
return 0.0
n = self.tokens + self.boundaries
return ((n * math.log(n)
- self.boundaries * math.log(self.boundaries)
- self.logtokensum) * self.weight
+ self.frequency_distribution_cost())
class AnnotatedCorpusEncoding(Encoding):
"""Encoding the cost of an Annotated Corpus.
In this encoding constructions that are missing are penalized.
"""
def __init__(self, corpus_coding, weight=None, penalty=-9999.9):
"""
Initialize encoding with appropriate meta data
Arguments:
corpus_coding: CorpusEncoding instance used for retrieving the
number of tokens and boundaries in the corpus
weight: The weight of this encoding. If the weight is None,
it is updated automatically to be in balance with the
corpus
penalty: log penalty used for missing constructions
"""
super(AnnotatedCorpusEncoding, self).__init__()
self.do_update_weight = True
self.weight = 1.0
if weight is not None:
self.do_update_weight = False
self.weight = weight
self.corpus_coding = corpus_coding
self.penalty = penalty
self.constructions = collections.Counter()
def set_constructions(self, constructions):
"""Method for re-initializing the constructions. The count of the
constructions must still be set with a call to set_count
"""
self.constructions = constructions
self.tokens = sum(constructions.values())
self.logtokensum = 0.0
def set_count(self, construction, count):
"""Set an initial count for each construction. Missing constructions
are penalized
"""
annot_count = self.constructions[construction]
if count > 0:
self.logtokensum += annot_count * math.log(count)
else:
self.logtokensum += annot_count * self.penalty
def update_count(self, construction, old_count, new_count):
"""Update the counts in the Encoding, setting (or removing) a penalty
for missing constructions
"""
if construction in self.constructions:
annot_count = self.constructions[construction]
if old_count > 0:
self.logtokensum -= annot_count * math.log(old_count)
else:
self.logtokensum -= annot_count * self.penalty
if new_count > 0:
self.logtokensum += annot_count * math.log(new_count)
else:
self.logtokensum += annot_count * self.penalty
def update_weight(self):
"""Update the weight of the Encoding by taking the ratio of the
corpus boundaries and annotated boundaries
"""
if not self.do_update_weight:
return
old = self.weight
self.weight = (self.corpus_coding.weight *
float(self.corpus_coding.boundaries) / self.boundaries)
if self.weight != old:
_logger.info("Corpus weight of annotated data set to %s", self.weight)
def get_cost(self):
"""Return the cost of the Annotation Corpus."""
if self.boundaries == 0:
return 0.0
n = self.tokens + self.boundaries
return ((n * math.log(self.corpus_coding.tokens +
self.corpus_coding.boundaries)
- self.boundaries * math.log(self.corpus_coding.boundaries)
- self.logtokensum) * self.weight)
class LexiconEncoding(Encoding):
"""Class for calculating the encoding cost for the Lexicon"""
def __init__(self):
"""Initialize Lexcion Encoding"""
super(LexiconEncoding, self).__init__()
self.atoms = collections.Counter()
@property
def types(self):
"""Return the number of different atoms in the lexicon + 1 for the
compound-end-token
"""
return len(self.atoms) + 1
def add(self, construction):
"""Add a construction to the lexicon, updating automatically the
count for its atoms
"""
self.boundaries += 1
for atom in construction:
c = self.atoms[atom]
self.atoms[atom] = c + 1
self.update_count(atom, c, c + 1)
def remove(self, construction):
"""Remove construction from the lexicon, updating automatically the
count for its atoms
"""
self.boundaries -= 1
for atom in construction:
c = self.atoms[atom]
self.atoms[atom] = c - 1
self.update_count(atom, c, c - 1)
def get_codelength(self, construction):
"""Return an approximate codelength for new construction."""
l = len(construction) + 1
cost = l * math.log(self.tokens + l)
cost -= math.log(self.boundaries + 1)
for atom in construction:
if atom in self.atoms:
c = self.atoms[atom]
else:
c = 1
cost -= math.log(c)
return cost
================================================
FILE: morfessor/cmd.py
================================================
# -*- coding: utf-8 -*-
import locale
import logging
import math
import random
import os.path
import sys
import time
import string
from . import get_version
from . import utils
from .baseline import BaselineModel, AnnotationCorpusWeight, \
MorphLengthCorpusWeight, NumMorphCorpusWeight, FixedCorpusWeight
from .exception import ArgumentException
from .io import MorfessorIO
from .evaluation import MorfessorEvaluation, EvaluationConfig, \
WilcoxonSignedRank, FORMAT_STRINGS
PY3 = sys.version_info[0] == 3
# _str is used to convert command line arguments to the right type (str for PY3, unicode for PY2
if PY3:
_str = str
else:
_str = lambda x: unicode(x, encoding=locale.getpreferredencoding())
_logger = logging.getLogger(__name__)
LRU_MAX_SIZE = 1000000
def get_default_argparser():
import argparse
parser = argparse.ArgumentParser(
prog='morfessor.py',
description="""
Morfessor %s
Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
Command-line arguments:
""" % get_version(),
epilog="""
Simple usage examples (training and testing):
%(prog)s -t training_corpus.txt -s model.pickled
%(prog)s -l model.pickled -T test_corpus.txt -o test_corpus.segmented
Interactive use (read corpus from user):
%(prog)s -m online -v 2 -t -
""",
formatter_class=argparse.RawDescriptionHelpFormatter,
add_help=False)
# Options for input data files
add_arg = parser.add_argument_group('input data files').add_argument
add_arg('-l', '--load', dest="loadfile", default=None, metavar='<file>',
help="load existing model from file (pickled model object)")
add_arg('-L', '--load-segmentation', dest="loadsegfile", default=None,
metavar='<file>',
help="load existing model from segmentation "
"file (Morfessor 1.0 format)")
add_arg('-t', '--traindata', dest='trainfiles', action='append',
default=[], metavar='<file>',
help="input corpus file(s) for training (text or bz2/gzipped text;"
" use '-' for standard input; add several times in order to "
"append multiple files)")
add_arg('-T', '--testdata', dest='testfiles', action='append',
default=[], metavar='<file>',
help="input corpus file(s) to analyze (text or bz2/gzipped text; "
"use '-' for standard input; add several times in order to "
"append multiple files)")
# Options for output data files
add_arg = parser.add_argument_group('output data files').add_argument
add_arg('-o', '--output', dest="outfile", default='-', metavar='<file>',
help="output file for test data results (for standard output, "
"use '-'; default '%(default)s')")
add_arg('-s', '--save', dest="savefile", default=None, metavar='<file>',
help="save final model to file (pickled model object)")
add_arg('-S', '--save-segmentation', dest="savesegfile", default=None,
metavar='<file>',
help="save model segmentations to file (Morfessor 1.0 format)")
add_arg('--save-reduced', dest="savereduced", default=None,
metavar='<file>',
help="save final model to file in reduced form (pickled model "
"object). A model in reduced form can only be used for "
"segmentation of new words.")
add_arg('-x', '--lexicon', dest="lexfile", default=None, metavar='<file>',
help="output final lexicon to given file")
add_arg('--nbest', dest="nbest", default=1, type=int, metavar='<int>',
help="output n-best viterbi results")
# Options for data formats
add_arg = parser.add_argument_group(
'data format options').add_argument
add_arg('-e', '--encoding', dest='encoding', metavar='<encoding>',
help="encoding of input and output files (if none is given, "
"both the local encoding and UTF-8 are tried)")
add_arg('--lowercase', dest="lowercase", default=False,
action='store_true',
help="lowercase input data")
add_arg('--traindata-list', dest="list", default=False,
action='store_true',
help="input file(s) for batch training are lists "
"(one compound per line, optionally count as a prefix)")
add_arg('--atom-separator', dest="separator", type=_str, default=None,
metavar='<regexp>',
help="atom separator regexp (default %(default)s)")
add_arg('--compound-separator', dest="cseparator", type=_str, default=r'\s+',
metavar='<regexp>',
help="compound separator regexp (default '%(default)s')")
add_arg('--analysis-separator', dest='analysisseparator', type=_str,
default=',', metavar='<str>',
help="separator for different analyses in an annotation file. Use"
" NONE for only allowing one analysis per line")
add_arg('--output-format', dest='outputformat', type=_str,
default=r'{analysis}\n', metavar='<format>',
help="format string for --output file (default: '%(default)s'). "
"Valid keywords are: "
"{analysis} = constructions of the compound, "
"{compound} = compound string, "
"{count} = count of the compound (currently always 1), "
"{logprob} = log-probability of the analysis, and "
"{clogprob} = log-probability of the compound. Valid escape "
"sequences are '\\n' (newline) and '\\t' (tabular)")
add_arg('--output-format-separator', dest='outputformatseparator',
type=_str, default=' ', metavar='<str>',
help="construction separator for analysis in --output file "
"(default: '%(default)s')")
add_arg('--output-newlines', dest='outputnewlines', default=False,
action='store_true',
help="for each newline in input, print newline in --output file "
"(default: '%(default)s')")
# Options for model training
add_arg = parser.add_argument_group(
'training and segmentation options').add_argument
add_arg('-m', '--mode', dest="trainmode", default='init+batch',
metavar='<mode>',
choices=['none', 'batch', 'init', 'init+batch', 'online',
'online+batch'],
help="training mode ('none', 'init', 'batch', 'init+batch', "
"'online', or 'online+batch'; default '%(default)s')")
add_arg('-a', '--algorithm', dest="algorithm", default='recursive',
metavar='<algorithm>', choices=['recursive', 'viterbi'],
help="algorithm type ('recursive', 'viterbi'; default "
"'%(default)s')")
add_arg('-d', '--dampening', dest="dampening", type=_str, default='ones',
metavar='<type>', choices=['none', 'log', 'ones'],
help="frequency dampening for training data ('none', 'log', or "
"'ones'; default '%(default)s')")
add_arg('-f', '--forcesplit', dest="forcesplit", type=list, default=['-'],
metavar='<list>',
help="force split on given atoms (default '-'). The list argument "
"is a string of characthers, use '' for no forced splits.")
add_arg('-F', '--finish-threshold', dest='finish_threshold', type=float,
default=0.005, metavar='<float>',
help="Stopping threshold. Training stops when "
"the improvement of the last iteration is"
"smaller then finish_threshold * #boundaries; "
"(default '%(default)s')")
add_arg('-r', '--randseed', dest="randseed", default=None,
metavar='<seed>',
help="seed for random number generator")
add_arg('-R', '--randsplit', dest="splitprob", default=None, type=float,
metavar='<float>',
help="initialize new words by random splitting using the given "
"split probability (default no splitting)")
add_arg('--skips', dest="skips", default=False, action='store_true',
help="use random skips for frequently seen compounds to speed up "
"training")
add_arg('--batch-minfreq', dest="freqthreshold", type=int, default=1,
metavar='<int>',
help="compound frequency threshold for batch training (default "
"%(default)s)")
add_arg('--max-epochs', dest='maxepochs', type=int, default=None,
metavar='<int>',
help='hard maximum of epochs in training')
add_arg('--nosplit-re', dest="nosplit", type=_str, default=None,
metavar='<regexp>',
help="if the expression matches the two surrounding characters, "
"do not allow splitting (default %(default)s)")
add_arg('--online-epochint', dest="epochinterval", type=int,
default=10000, metavar='<int>',
help="epoch interval for online training (default %(default)s)")
add_arg('--viterbi-smoothing', dest="viterbismooth", default=0,
type=float, metavar='<float>',
help="additive smoothing parameter for Viterbi training "
"and segmentation (default %(default)s)")
add_arg('--viterbi-maxlen', dest="viterbimaxlen", default=30,
type=int, metavar='<int>',
help="maximum construction length in Viterbi training "
"and segmentation (default %(default)s)")
# Options for corpusweight tuning
add_arg = parser.add_mutually_exclusive_group().add_argument
add_arg('-D', '--develset', dest="develfile", default=None,
metavar='<file>',
help="load annotated data for tuning the corpus weight parameter")
add_arg('--morph-length', dest='morphlength', default=None, type=float,
metavar='<float>',
help="tune the corpusweight to obtain the desired average morph "
"length")
add_arg('--num-morph-types', dest='morphtypes', default=None, type=float,
metavar='<float>',
help="tune the corpusweight to obtain the desired number of morph "
"types")
# Options for semi-supervised model training
add_arg = parser.add_argument_group(
'semi-supervised training options').add_argument
add_arg('-w', '--corpusweight', dest="corpusweight", type=float,
default=1.0, metavar='<float>',
help="corpus weight parameter (default %(default)s); "
"sets the initial value if other tuning options are used")
add_arg('--weight-threshold', dest='threshold', default=0.01,
metavar='<float>', type=float,
help='percentual stopping threshold for corpusweight updaters')
add_arg('--full-retrain', dest='fullretrain', action='store_true',
default=False,
help='do a full retrain after any weights have converged')
add_arg('-A', '--annotations', dest="annofile", default=None,
metavar='<file>',
help="load annotated data for semi-supervised learning")
add_arg('-W', '--annotationweight', dest="annotationweight",
type=float, default=None, metavar='<float>',
help="corpus weight parameter for annotated data (if unset, the "
"weight is set to balance the number of tokens in annotated "
"and unannotated data sets)")
# Options for evaluation
add_arg = parser.add_argument_group('Evaluation options').add_argument
add_arg('-G', '--goldstandard', dest='goldstandard', default=None,
metavar='<file>',
help='If provided, evaluate the model against the gold standard')
# Options for logging
add_arg = parser.add_argument_group('logging options').add_argument
add_arg('-v', '--verbose', dest="verbose", type=int, default=1,
metavar='<int>',
help="verbose level; controls what is written to the standard "
"error stream or log file (default %(default)s)")
add_arg('--logfile', dest='log_file', metavar='<file>',
help="write log messages to file in addition to standard "
"error stream")
add_arg('--progressbar', dest='progress', default=False,
action='store_true',
help="Force the progressbar to be displayed (possibly lowers the "
"log level for the standard error stream)")
add_arg = parser.add_argument_group('other options').add_argument
add_arg('-h', '--help', action='help',
help="show this help message and exit")
add_arg('--version', action='version',
version='%(prog)s ' + get_version(),
help="show version number and exit")
return parser
def initialize_logging(args):
"""Initialize loggers based on command line args"""
if args.verbose >= 2:
loglevel = logging.DEBUG
elif args.verbose >= 1:
loglevel = logging.INFO
else:
loglevel = logging.WARNING
rootlogger = logging.getLogger()
rootlogger.setLevel(logging.DEBUG)
logfile_format = '%(asctime)s %(levelname)s:%(message)s'
date_format = '%Y-%m-%d %H:%M:%S'
console_format = '%(message)s'
console_level = loglevel
if args.log_file is not None or (hasattr(args, 'progress') and args.progress):
# If logging to a file or progress bar is forced, make INFO
# the highest level for the error stream
console_level = max(loglevel, logging.INFO)
# Console handler
ch = logging.StreamHandler()
ch.setLevel(console_level)
ch.setFormatter(logging.Formatter(console_format))
rootlogger.addHandler(ch)
# FileHandler for log_file
if args.log_file is not None:
fh = logging.FileHandler(args.log_file, 'w')
fh.setLevel(loglevel)
fh.setFormatter(logging.Formatter(logfile_format, date_format))
rootlogger.addHandler(fh)
return console_level
@utils.lru_cache(maxsize=LRU_MAX_SIZE)
def _viterbi_segment(model, atoms, smooth, maxlen):
return model.viterbi_segment(atoms, smooth, maxlen)
@utils.lru_cache(maxsize=LRU_MAX_SIZE)
def _viterbi_nbest(model, atoms, nbest, smooth, maxlen):
return model.viterbi_nbest(atoms, nbest, smooth,maxlen)
def main(args):
console_level = initialize_logging(args)
# If debug messages are printed to screen, only warning messages
# (or above) should be printed to screen, or if stderr is not a
# tty (but a pipe or a file), don't show the progressbar
if (console_level != logging.INFO or
(hasattr(sys.stderr, 'isatty') and not sys.stderr.isatty())):
utils.show_progress_bar = False
# Force progress bar
if args.progress:
utils.show_progress_bar = True
if (args.loadfile is None and
args.loadsegfile is None and
len(args.trainfiles) == 0):
raise ArgumentException("either model file or training data should "
"be defined")
if args.randseed is not None:
random.seed(args.randseed)
io = MorfessorIO(encoding=args.encoding,
compound_separator=args.cseparator,
atom_separator=args.separator,
lowercase=args.lowercase)
# Load exisiting model or create a new one
if args.loadfile is not None:
model = io.read_binary_model_file(args.loadfile)
else:
model = BaselineModel(forcesplit_list=args.forcesplit,
corpusweight=args.corpusweight,
use_skips=args.skips,
nosplit_re=args.nosplit)
if args.loadsegfile is not None:
model.load_segmentations(io.read_segmentation_file(args.loadsegfile))
analysis_sep = (args.analysisseparator
if args.analysisseparator != 'NONE' else None)
if args.annofile is not None:
annotations = io.read_annotations_file(args.annofile,
analysis_sep=analysis_sep)
model.set_annotations(annotations, args.annotationweight)
if args.develfile is not None:
develannots = io.read_annotations_file(args.develfile,
analysis_sep=analysis_sep)
updater = AnnotationCorpusWeight(develannots, args.threshold)
model.set_corpus_weight_updater(updater)
if args.morphlength is not None:
updater = MorphLengthCorpusWeight(args.morphlength, args.threshold)
model.set_corpus_weight_updater(updater)
if args.morphtypes is not None:
updater = NumMorphCorpusWeight(args.morphtypes, args.threshold)
model.set_corpus_weight_updater(updater)
start_corpus_weight = model.get_corpus_coding_weight()
# Set frequency dampening function
if args.dampening == 'none':
dampfunc = None
elif args.dampening == 'log':
dampfunc = lambda x: int(round(math.log(x + 1, 2)))
elif args.dampening == 'ones':
dampfunc = lambda x: 1
else:
raise ArgumentException("unknown dampening type '%s'" % args.dampening)
# Set algorithm parameters
if args.algorithm == 'viterbi':
algparams = (args.viterbismooth, args.viterbimaxlen)
else:
algparams = ()
# Train model
if args.trainmode == 'none':
pass
elif args.trainmode == 'batch':
if len(model.get_compounds()) == 0:
_logger.warning("Model contains no compounds for batch training."
" Use 'init+batch' mode to add new data.")
else:
if len(args.trainfiles) > 0:
_logger.warning("Training mode 'batch' ignores new data "
"files. Use 'init+batch' or 'online' to "
"add new compounds.")
ts = time.time()
e, c = model.train_batch(args.algorithm, algparams,
args.finish_threshold, args.maxepochs)
te = time.time()
_logger.info("Epochs: %s", e)
_logger.info("Final cost: %s", c)
_logger.info("Training time: %.3fs", (te - ts))
elif len(args.trainfiles) > 0:
ts = time.time()
if args.trainmode == 'init':
if args.list:
data = io.read_corpus_list_files(args.trainfiles)
else:
data = io.read_corpus_files(args.trainfiles)
c = model.load_data(data, args.freqthreshold, dampfunc,
args.splitprob)
elif args.trainmode == 'init+batch':
if args.list:
data = io.read_corpus_list_files(args.trainfiles)
else:
data = io.read_corpus_files(args.trainfiles)
c = model.load_data(data, args.freqthreshold, dampfunc,
args.splitprob)
e, c = model.train_batch(args.algorithm, algparams,
args.finish_threshold, args.maxepochs)
_logger.info("Epochs: %s", e)
if args.fullretrain:
if abs(model.get_corpus_coding_weight() - start_corpus_weight) > 0.1:
model.set_corpus_weight_updater(
FixedCorpusWeight(model.get_corpus_coding_weight()))
model.clear_segmentation()
e, c = model.train_batch(args.algorithm, algparams,
args.finish_threshold,
args.maxepochs)
_logger.info("Retrain Epochs: %s", e)
elif args.trainmode == 'online':
data = io.read_corpus_files(args.trainfiles)
e, c = model.train_online(data, dampfunc, args.epochinterval,
args.algorithm, algparams,
args.splitprob, args.maxepochs)
_logger.info("Epochs: %s", e)
elif args.trainmode == 'online+batch':
data = io.read_corpus_files(args.trainfiles)
e, c = model.train_online(data, dampfunc, args.epochinterval,
args.algorithm, algparams,
args.splitprob, args.maxepochs)
e, c = model.train_batch(args.algorithm, algparams,
args.finish_threshold, args.maxepochs - e)
_logger.info("Epochs: %s", e)
if args.fullretrain:
if abs(model.get_corpus_coding_weight() - start_corpus_weight) > 0.1:
model.clear_segmentation()
e, c = model.train_batch(args.algorithm, algparams,
args.finish_threshold,
args.maxepochs)
_logger.info("Retrain Epochs: %s", e)
else:
raise ArgumentException("unknown training mode '%s'", args.trainmode)
te = time.time()
_logger.info("Final cost: %s", c)
_logger.info("Training time: %.3fs", (te - ts))
else:
_logger.warning("No training data files specified.")
# Save model
if args.savefile is not None:
io.write_binary_model_file(args.savefile, model)
if args.savesegfile is not None:
io.write_segmentation_file(args.savesegfile, model.get_segmentations())
# Output lexicon
if args.lexfile is not None:
io.write_lexicon_file(args.lexfile, model.get_constructions())
if args.savereduced is not None:
model.make_segment_only()
io.write_binary_model_file(args.savereduced, model)
# Segment test data
if len(args.testfiles) > 0:
_logger.info("Segmenting test data...")
outformat = args.outputformat
csep = args.outputformatseparator
outformat = outformat.replace(r"\n", "\n")
outformat = outformat.replace(r"\t", "\t")
keywords = [x[1] for x in string.Formatter().parse(outformat)]
with io._open_text_file_write(args.outfile) as fobj:
testdata = io.read_corpus_files(args.testfiles)
i = 0
for count, atoms in testdata:
if io.atom_separator is None:
compound = "".join(atoms)
else:
compound = io.atom_separator.join(atoms)
if len(atoms) == 0:
# Newline in corpus
if args.outputnewlines:
fobj.write("\n")
continue
if "clogprob" in keywords:
clogprob = model.forward_logprob(atoms)
else:
clogprob = 0
if args.nbest > 1:
nbestlist = _viterbi_nbest(
model, atoms, args.nbest, args.viterbismooth,
args.viterbimaxlen)
for constructions, logp in nbestlist:
analysis = io.format_constructions(constructions,
csep=csep)
fobj.write(outformat.format(analysis=analysis,
compound=compound,
count=count, logprob=logp,
clogprob=clogprob))
else:
constructions, logp = _viterbi_segment(
model, atoms, args.viterbismooth, args.viterbimaxlen)
analysis = io.format_constructions(constructions, csep=csep)
fobj.write(outformat.format(analysis=analysis,
compound=compound,
count=count, logprob=logp,
clogprob=clogprob))
i += 1
if i % 10000 == 0:
sys.stderr.write(".")
sys.stderr.write("\n")
_logger.info("Done.")
if args.goldstandard is not None:
_logger.info("Evaluating Model")
e = MorfessorEvaluation(io.read_annotations_file(args.goldstandard))
result = e.evaluate_model(model, meta_data={'name': 'MODEL'})
print(result.format(FORMAT_STRINGS['default']))
_logger.info("Done")
def get_evaluation_argparser():
import argparse
#TODO factor out redundancies with get_default_argparser()
standard_parser = get_default_argparser()
parser = argparse.ArgumentParser(
prog="morfessor-evaluate",
epilog="""Simple usage example:
%(prog)s gold_standard model1 model2
""",
description=standard_parser.description,
formatter_class=argparse.RawDescriptionHelpFormatter,
add_help=False
)
add_arg = parser.add_argument_group('evaluation options').add_argument
add_arg('--num-samples', dest='numsamples', type=int, metavar='<int>',
default=10, help='number of samples to take for testing')
add_arg('--sample-size', dest='samplesize', type=int, metavar='<int>',
default=1000, help='size of each testing samples')
add_arg = parser.add_argument_group('formatting options').add_argument
add_arg('--format-string', dest='formatstring', metavar='<format>',
help='Python new style format string used to report evaluation '
'results. The following variables are a value and and action '
'separated with and underscore. E.g. fscore_avg for the '
'average f-score. The available values are "precision", '
'"recall", "fscore", "samplesize" and the available actions: '
'"avg", "max", "min", "values", "count". A last meta-data '
'variable (without action) is "name", the filename of the '
'model See also the format-template option for predefined '
'strings')
add_arg('--format-template', dest='template', metavar='<template>',
default='default',
help='Uses a template string for the format-string options. '
'Available templates are: default, table and latex. '
'If format-string is defined this option is ignored')
add_arg = parser.add_argument_group('file options').add_argument
add_arg('--construction-separator', dest="cseparator", type=_str,
default=' ', metavar='<regexp>',
help="construction separator for test segmentation files"
" (default '%(default)s')")
add_arg('-e', '--encoding', dest='encoding', metavar='<encoding>',
help="encoding of input and output files (if none is given, "
"both the local encoding and UTF-8 are tried)")
add_arg = parser.add_argument_group('logging options').add_argument
add_arg('-v', '--verbose', dest="verbose", type=int, default=1,
metavar='<int>',
help="verbose level; controls what is written to the standard "
"error stream or log file (default %(default)s)")
add_arg('--logfile', dest='log_file', metavar='<file>',
help="write log messages to file in addition to standard "
"error stream")
add_arg = parser.add_argument_group('other options').add_argument
add_arg('-h', '--help', action='help',
help="show this help message and exit")
add_arg('--version', action='version',
version='%(prog)s ' + get_version(),
help="show version number and exit")
add_arg = parser.add_argument
add_arg('goldstandard', metavar='<goldstandard>', nargs=1,
help='gold standard file in standard annotation format')
add_arg('models', metavar='<model>', nargs='+',
help='model files to segment (either binary or Morfessor 1.0 style'
' segmentation models).')
add_arg('-t', '--testsegmentation', dest='test_segmentations', default=[],
action='append',
help='Segmentation of the test set. Note that all words in the '
'gold-standard must be segmented')
return parser
def main_evaluation(args):
""" Separate main for running evaluation and statistical significance
testing. Takes as argument the results of an get_evaluation_argparser()
"""
initialize_logging(args)
io = MorfessorIO(encoding=args.encoding)
ev = MorfessorEvaluation(io.read_annotations_file(args.goldstandard[0]))
results = []
sample_size = args.samplesize
num_samples = args.numsamples
f_string = args.formatstring
if f_string is None:
f_string = FORMAT_STRINGS[args.template]
for f in args.models:
result = ev.evaluate_model(io.read_any_model(f),
configuration=EvaluationConfig(num_samples,
sample_size),
meta_data={'name': os.path.basename(f)})
results.append(result)
print(result.format(f_string))
io.construction_separator = args.cseparator
for f in args.test_segmentations:
segmentation = io.read_segmentation_file(f, False)
result = ev.evaluate_segmentation(segmentation,
configuration=
EvaluationConfig(num_samples,
sample_size),
meta_data={'name':
os.path.basename(f)})
results.append(result)
print(result.format(f_string))
if len(results) > 1 and num_samples > 1:
wsr = WilcoxonSignedRank()
r = wsr.significance_test(results)
WilcoxonSignedRank.print_table(r)
================================================
FILE: morfessor/evaluation.py
================================================
from __future__ import print_function
import collections
import logging
from itertools import chain, product
import math
import random
_logger = logging.getLogger(__name__)
EvaluationConfig = collections.namedtuple('EvaluationConfig',
['num_samples', 'sample_size'])
FORMAT_STRINGS = {
'default': """Filename : {name}
Num samples: {samplesize_count}
Sample size: {samplesize_avg}
F-score : {fscore_avg:.3}
Precision : {precision_avg:.3}
Recall : {recall_avg:.3}""",
'table': "{name:10} {precision_avg:6.3} {recall_avg:6.3} {fscore_avg:6.3}",
'latex': "{name} & {precision_avg:.3} &"
" {recall_avg:.3} & {fscore_avg:.3} \\\\"}
def _sample(compound_list, size, seed):
"""Create a specific size sample from the compound list using a specific
seed"""
return random.Random(seed).sample(compound_list, size)
class MorfessorEvaluationResult(object):
"""A MorfessorEvaluationResult is returned by a MorfessorEvaluation
object. It's purpose is to store the evaluation data and provide nice
formatting options.
Each MorfessorEvaluationResult contains the data of 1 evaluation
(which can have multiple samples).
"""
print_functions = {'avg': lambda x: sum(x) / len(x),
'min': min,
'max': max,
'values': list,
'count': len}
#TODO add maybe std as a print function?
def __init__(self, meta_data=None):
self.meta_data = meta_data
self.precision = []
self.recall = []
self.fscore = []
self.samplesize = []
self._cache = None
def __getitem__(self, item):
"""Provide dict style interface for all values (standard values and
metadata)"""
if self._cache is None:
self._fill_cache()
return self._cache[item]
def add_data_point(self, precision, recall, f_score, sample_size):
"""Method used by MorfessorEvaluation to add the results of a single
sample to the object"""
self.precision.append(precision)
self.recall.append(recall)
self.fscore.append(f_score)
self.samplesize.append(sample_size)
#clear cache
self._cache = None
def __str__(self):
"""Method for default visualization"""
return self.format(FORMAT_STRINGS['default'])
def _fill_cache(self):
""" Pre calculate all variable / function combinations and put them in
cache"""
self._cache = {'{}_{}'.format(val, func_name): func(getattr(self, val))
for val in ('precision', 'recall', 'fscore',
'samplesize')
for func_name, func in self.print_functions.items()}
self._cache.update(self.meta_data)
def _get_cache(self):
""" Fill the cache (if necessary) and return it"""
if self._cache is None:
self._fill_cache()
return self._cache
def format(self, format_string):
""" Format this object. The format string can contain all variables,
e.g. fscore_avg, precision_values or any item from metadata"""
return format_string.format(**self._get_cache())
class MorfessorEvaluation(object):
""" Do the evaluation of one model, on one testset. The basic procedure is
to create, in a stable manner, a number of samples and evaluate them
independently. The stable selection of samples makes it possible to use
the resulting values for Pair-wise statistical significance testing.
reference_annotations is a standard annotation dictionary:
{compound => ([annoation1],.. ) }
"""
def __init__(self, reference_annotations):
self.reference = {}
for compound, analyses in reference_annotations.items():
self.reference[compound] = list(
tuple(self._segmentation_indices(a)) for a in analyses)
self._samples = {}
def _create_samples(self, configuration=EvaluationConfig(10, 1000)):
"""Create, in a stable manner, n testsets of size x as defined in
test_configuration
"""
#TODO: What is a reasonable limit to warn about a too small testset?
if len(self.reference) < (configuration.num_samples *
configuration.sample_size):
_logger.warning("The test set is too small for this sample size")
compound_list = sorted(self.reference.keys())
self._samples[configuration] = [
_sample(compound_list, configuration.sample_size, i) for i in
range(configuration.num_samples)]
def get_samples(self, configuration=EvaluationConfig(10, 1000)):
"""Get a list of samples. A sample is a list of compounds.
This method is stable, so each time it is called with a specific
test_set and configuration it will return the same samples. Also this
method caches the samples in the _samples variable.
"""
if configuration not in self._samples:
self._create_samples(configuration)
return self._samples[configuration]
def _evaluate(self, prediction):
"""Helper method to get the precision and recall of 1 sample"""
def calc_prop_distance(ref, pred):
if len(ref) == 0:
return 1.0
diff = len(set(ref) - set(pred))
return (len(ref) - diff) / float(len(ref))
wordlist = sorted(set(prediction.keys()) & set(self.reference.keys()))
recall_sum = 0.0
precis_sum = 0.0
for word in wordlist:
if len(word) < 2:
continue
recall_sum += max(calc_prop_distance(r, p)
for p, r in product(prediction[word],
self.reference[word]))
precis_sum += max(calc_prop_distance(p, r)
for p, r in product(prediction[word],
self.reference[word]))
precision = precis_sum / len(wordlist)
recall = recall_sum / len(wordlist)
f_score = 2.0 / (1.0 / precision + 1.0 / recall)
return precision, recall, f_score, len(wordlist)
@staticmethod
def _segmentation_indices(annotation):
"""Method to transform a annotation into a tuple of split indices"""
cur_len = 0
for a in annotation[:-1]:
cur_len += len(a)
yield cur_len
def evaluate_model(self, model, configuration=EvaluationConfig(10, 1000),
meta_data=None):
"""Get the prediction of the test samples from the model and do the
evaluation
The meta_data object has preferably at least the key 'name'.
"""
if meta_data is None:
meta_data = {'name': 'UNKNOWN'}
mer = MorfessorEvaluationResult(meta_data)
for i, sample in enumerate(self.get_samples(configuration)):
_logger.debug("Evaluating sample %s", i)
prediction = {}
for compound in sample:
prediction[compound] = [tuple(self._segmentation_indices(
model.viterbi_segment(compound)[0]))]
mer.add_data_point(*self._evaluate(prediction))
return mer
def evaluate_segmentation(self, segmentation,
configuration=EvaluationConfig(10, 1000),
meta_data=None):
"""Method for evaluating an existing segmentation"""
def merge_constructions(constructions):
compound = constructions[0]
for i in range(1, len(constructions)):
compound = compound + constructions[i]
return compound
segmentation = {merge_constructions(x[1]):
[tuple(self._segmentation_indices(x[1]))]
for x in segmentation}
if meta_data is None:
meta_data = {'name': 'UNKNOWN'}
mer = MorfessorEvaluationResult(meta_data)
for i, sample in enumerate(self.get_samples(configuration)):
_logger.debug("Evaluating sample %s", i)
prediction = {k: v for k, v in segmentation.items() if k in sample}
mer.add_data_point(*self._evaluate(prediction))
return mer
class WilcoxonSignedRank(object):
"""Class for doing statistical signficance testing with the Wilcoxon
Signed-Rank test
It implements the Pratt method for handling zero-differences and
applies a 0.5 continuity correction for the z-statistic.
"""
@staticmethod
def _wilcoxon(d, method='pratt', correction=True):
if method not in ('wilcox', 'pratt'):
raise ValueError
if method == 'wilcox':
d = list(filter(lambda a: a != 0, d))
count = len(d)
ranks = WilcoxonSignedRank._rankdata([abs(v) for v in d])
rank_sum_pos = sum(r for r, v in zip(ranks, d) if v > 0)
rank_sum_neg = sum(r for r, v in zip(ranks, d) if v < 0)
test = min(rank_sum_neg, rank_sum_pos)
mean = count * (count + 1) * 0.25
stdev = (count*(count + 1) * (2 * count + 1))
# compensate for duplicate ranks
no_zero_ranks = [r for i, r in enumerate(ranks) if d[i] != 0]
stdev -= 0.5 * sum(x * (x*x-1) for x in
collections.Counter(no_zero_ranks).values())
stdev = math.sqrt(stdev / 24.0)
if correction:
correction = +0.5 if test > mean else -0.5
else:
correction = 0
z = (test - mean - correction) / stdev
return 2 * WilcoxonSignedRank._norm_cum_pdf(abs(z))
@staticmethod
def _rankdata(d):
od = collections.Counter()
for v in d:
od[v] += 1
rank_dict = {}
cur_rank = 1
for val, count in sorted(od.items(), key=lambda x: x[0]):
rank_dict[val] = (cur_rank + (cur_rank + count - 1)) / 2
cur_rank += count
return [rank_dict[v] for v in d]
@staticmethod
def _norm_cum_pdf(z):
"""Pure python implementation of the normal cumulative pdf function"""
return 0.5 - 0.5 * math.erf(z / math.sqrt(2))
def significance_test(self, evaluations, val_property='fscore_values',
name_property='name'):
"""Takes a set of evaluations (which should have the same
test-configuration) and calculates the p-value for the Wilcoxon signed
rank test
Returns a dictionary with (name1,name2) keys and p-values as values.
"""
results = {r[name_property]: r[val_property] for r in evaluations}
if any(len(x) < 10 for x in results.values()):
_logger.error("Too small number of samples for the Wilcoxon test")
return {}
p = {}
for r1, r2 in product(results.keys(), results.keys()):
p[(r1, r2)] = self._wilcoxon([v1-v2
for v1, v2 in zip(results[r1],
results[r2])])
return p
@staticmethod
def print_table(results):
"""Nicely format a results table as returned by significance_test"""
names = sorted(set(r[0] for r in results.keys()))
col_width = max(max(len(n) for n in names), 5)
for h in chain([""], names):
print('{:{width}}'.format(h, width=col_width), end='|')
print()
for name in names:
print('{:{width}}'.format(name, width=col_width), end='|')
for name2 in names:
print('{:{width}.5}'.format(results[(name, name2)],
width=col_width), end='|')
print()
================================================
FILE: morfessor/exception.py
================================================
from __future__ import unicode_literals
class MorfessorException(Exception):
"""Base class for exceptions in this module."""
pass
class ArgumentException(Exception):
"""Exception in command line argument parsing."""
pass
class InvalidCategoryError(MorfessorException):
"""Attempt to load data using a different categorization scheme."""
def __init__(self, category):
super(InvalidCategoryError, self).__init__(
self, 'This model does not recognize the category {}'.format(
category))
class InvalidOperationError(MorfessorException):
def __init__(self, operation, function_name):
super(InvalidOperationError, self).__init__(
self, ('This model does not have a method {}, and therefore cannot'
' perform operation "{}"').format(function_name, operation))
class UnsupportedConfigurationError(MorfessorException):
def __init__(self, reason):
super(UnsupportedConfigurationError, self).__init__(
self, ('This operation is not supported in this program ' +
'configuration. Reason: {}.').format(reason))
class SegmentOnlyModelException(MorfessorException):
def __init__(self):
super(SegmentOnlyModelException, self).__init__(
self, 'This model has been reduced to a segment-only model')
================================================
FILE: morfessor/io.py
================================================
import bz2
import codecs
import datetime
import gzip
import locale
import logging
import re
import sys
from . import get_version
from . import utils
try:
# In Python2 import cPickle for better performance
import cPickle as pickle
except ImportError:
import pickle
PY3 = sys.version_info[0] == 3
_logger = logging.getLogger(__name__)
class MorfessorIO(object):
"""Definition for all input and output files. Also handles all
encoding issues.
The only state this class has is the separators used in the data.
Therefore, the same class instance can be used for initializing multiple
files.
"""
def __init__(self, encoding=None, construction_separator=' + ',
comment_start='#', compound_separator=r'\s+',
atom_separator=None, lowercase=False):
self.encoding = encoding
self.construction_separator = construction_separator
self.comment_start = comment_start
self.compound_sep_re = re.compile(compound_separator, re.UNICODE)
self.atom_separator = atom_separator
if atom_separator is not None:
self._atom_sep_re = re.compile(atom_separator, re.UNICODE)
self.lowercase = lowercase
self._version = get_version()
def read_segmentation_file(self, file_name, has_counts=True, **kwargs):
"""Read segmentation file.
File format:
<count> <construction1><sep><construction2><sep>...<constructionN>
"""
_logger.info("Reading segmentations from '%s'...", file_name)
for line in self._read_text_file(file_name):
if has_counts:
count, compound_str = line.split(' ', 1)
else:
count, compound_str = 1, line
constructions = tuple(
self._split_atoms(constr)
for constr in compound_str.split(self.construction_separator))
if self.atom_separator is None:
compound = "".join(constructions)
else:
compound = tuple(atom for constr in constructions
for atom in constr)
yield int(count), compound, constructions
_logger.info("Done.")
def write_segmentation_file(self, file_name, segmentations, **kwargs):
"""Write segmentation file.
File format:
<count> <construction1><sep><construction2><sep>...<constructionN>
"""
_logger.info("Saving segmentations to '%s'...", file_name)
with self._open_text_file_write(file_name) as file_obj:
d = datetime.datetime.now().replace(microsecond=0)
file_obj.write("# Output from Morfessor Baseline %s, %s\n" %
(self._version, d))
for count, _, segmentation in segmentations:
if self.atom_separator is None:
s = self.construction_separator.join(segmentation)
else:
s = self.construction_separator.join(
(self.atom_separator.join(constr)
for constr in segmentation))
file_obj.write("%d %s\n" % (count, s))
_logger.info("Done.")
def read_corpus_files(self, file_names):
"""Read one or more corpus files.
Yield for each compound found (1, compound_atoms).
"""
for file_name in file_names:
for item in self.read_corpus_file(file_name):
yield item
def read_corpus_list_files(self, file_names):
"""Read one or more corpus list files.
Yield for each compound found (count, compound_atoms).
"""
for file_name in file_names:
for item in self.read_corpus_list_file(file_name):
yield item
def read_corpus_file(self, file_name):
"""Read one corpus file.
For each compound, yield (1, compound_atoms).
After each line, yield (0, ()).
"""
_logger.info("Reading corpus from '%s'...", file_name)
for line in self._read_text_file(file_name, raw=True):
for compound in self.compound_sep_re.split(line):
if len(compound) > 0:
yield 1, self._split_atoms(compound)
yield 0, ()
_logger.info("Done.")
def read_corpus_list_file(self, file_name):
"""Read a corpus list file.
Each line has the format:
<count> <compound>
Yield tuples (count, compound_atoms) for each compound.
"""
_logger.info("Reading corpus from list '%s'...", file_name)
for line in self._read_text_file(file_name):
try:
count, compound = line.split(None, 1)
yield int(count), self._split_atoms(compound)
except ValueError:
yield 1, self._split_atoms(line)
_logger.info("Done.")
def read_annotations_file(self, file_name, construction_separator=' ',
analysis_sep=','):
"""Read a annotations file.
Each line has the format:
<compound> <constr1> <constr2>... <constrN>, <constr1>...<constrN>, ...
Yield tuples (compound, list(analyses)).
"""
annotations = {}
_logger.info("Reading annotations from '%s'...", file_name)
for line in self._read_text_file(file_name):
compound, analyses_line = line.split(None, 1)
if compound not in annotations:
annotations[compound] = []
if analysis_sep is not None:
for analysis in analyses_line.split(analysis_sep):
analysis = analysis.strip()
annotations[compound].append(
analysis.strip().split(construction_separator))
else:
annotations[compound].append(
analyses_line.split(construction_separator))
_logger.info("Done.")
return annotations
def write_lexicon_file(self, file_name, lexicon):
"""Write to a Lexicon file all constructions and their counts."""
_logger.info("Saving model lexicon to '%s'...", file_name)
with self._open_text_file_write(file_name) as file_obj:
for construction, count in lexicon:
file_obj.write("%d %s\n" % (count, construction))
_logger.info("Done.")
def read_binary_model_file(self, file_name):
"""Read a pickled model from file."""
_logger.info("Loading model from '%s'...", file_name)
model = self.read_binary_file(file_name)
_logger.info("Done.")
return model
@staticmethod
def read_binary_file(file_name):
"""Read a pickled object from a file."""
with open(file_name, 'rb') as fobj:
obj = pickle.load(fobj)
return obj
def write_binary_model_file(self, file_name, model):
"""Pickle a model to a file."""
_logger.info("Saving model to '%s'...", file_name)
self.write_binary_file(file_name, model)
_logger.info("Done.")
@staticmethod
def write_binary_file(file_name, obj):
"""Pickle an object into a file."""
with open(file_name, 'wb') as fobj:
pickle.dump(obj, fobj, pickle.HIGHEST_PROTOCOL)
def write_parameter_file(self, file_name, params):
"""Write learned or estimated parameters to a file"""
with self._open_text_file_write(file_name) as file_obj:
d = datetime.datetime.now().replace(microsecond=0)
file_obj.write(
'# Parameters for Morfessor {}, {}\n'.format(
self._version, d))
for (key, val) in params.items():
file_obj.write('{}:\t{}\n'.format(key, val))
def read_parameter_file(self, file_name):
"""Read learned or estimated parameters from a file"""
params = {}
line_re = re.compile(r'^(.*)\s*:\s*(.*)$')
for line in self._read_text_file(file_name):
m = line_re.match(line.rstrip())
if m:
key = m.group(1)
val = m.group(2)
try:
val = float(val)
except ValueError:
pass
params[key] = val
return params
def read_any_model(self, file_name):
"""Read a file that is either a binary model or a Morfessor 1.0 style
model segmentation. This method can not be used on standard input as
data might need to be read multiple times"""
try:
model = self.read_binary_model_file(file_name)
_logger.info("%s was read as a binary model", file_name)
return model
except BaseException:
pass
from morfessor import BaselineModel
model = BaselineModel()
model.load_segmentations(self.read_segmentation_file(file_name))
_logger.info("%s was read as a segmentation", file_name)
return model
def format_constructions(self, constructions, csep=None, atom_sep=None):
"""Return a formatted string for a list of constructions."""
if csep is None:
csep = self.construction_separator
if atom_sep is None:
atom_sep = self.atom_separator
if utils._is_string(constructions[0]):
# Constructions are strings
return csep.join(constructions)
else:
# Constructions are not strings (should be tuples of strings)
return csep.join(map(lambda x: atom_sep.join(x), constructions))
def _split_atoms(self, construction):
"""Split construction to its atoms."""
if self.atom_separator is None:
return construction
else:
return tuple(self._atom_sep_re.split(construction))
def _open_text_file_write(self, file_name):
"""Open a file for writing with the appropriate compression/encoding"""
if file_name == '-':
file_obj = sys.stdout
if PY3:
return file_obj
elif file_name.endswith('.gz'):
file_obj = gzip.open(file_name, 'wb')
elif file_name.endswith('.bz2'):
file_obj = bz2.BZ2File(file_name, 'wb')
else:
file_obj = open(file_name, 'wb')
if self.encoding is None:
# Take encoding from locale if not set so far
self.encoding = locale.getpreferredencoding()
return codecs.getwriter(self.encoding)(file_obj)
def _open_text_file_read(self, file_name):
"""Open a file for reading with the appropriate compression/encoding"""
if file_name == '-':
if PY3:
inp = sys.stdin
else:
class StdinUnicodeReader:
def __init__(self, encoding):
self.encoding = encoding
if self.encoding is None:
self.encoding = locale.getpreferredencoding()
def __iter__(self):
return self
def next(self):
l = sys.stdin.readline()
if not l:
raise StopIteration()
return l.decode(self.encoding)
inp = StdinUnicodeReader(self.encoding)
else:
if file_name.endswith('.gz'):
file_obj = gzip.open(file_name, 'rb')
elif file_name.endswith('.bz2'):
file_obj = bz2.BZ2File(file_name, 'rb')
else:
file_obj = open(file_name, 'rb')
if self.encoding is None:
# Try to determine encoding if not set so far
self.encoding = self._find_encoding(file_name)
inp = codecs.getreader(self.encoding)(file_obj)
return inp
def _read_text_file(self, file_name, raw=False):
"""Read a text file with the appropriate compression and encoding.
Comments and empty lines are skipped unless raw is True.
"""
inp = self._open_text_file_read(file_name)
try:
for line in inp:
line = line.rstrip()
if not raw and \
(len(line) == 0 or line.startswith(self.comment_start)):
continue
if self.lowercase:
yield line.lower()
else:
yield line
except KeyboardInterrupt:
if file_name == '-':
_logger.info("Finished reading from stdin")
return
else:
raise
@staticmethod
def _find_encoding(*files):
"""Test default encodings on reading files.
If no encoding is given, this method can be used to test which
of the default encodings would work.
"""
test_encodings = ['utf-8', locale.getpreferredencoding()]
for encoding in test_encodings:
ok = True
for f in files:
if f == '-':
continue
try:
if f.endswith('.gz'):
file_obj = gzip.open(f, 'rb')
elif f.endswith('.bz2'):
file_obj = bz2.BZ2File(f, 'rb')
else:
file_obj = open(f, 'rb')
for _ in codecs.getreader(encoding)(file_obj):
pass
except UnicodeDecodeError:
ok = False
break
if ok:
_logger.info("Detected %s encoding", encoding)
return encoding
raise UnicodeError("Can not determine encoding of input files")
================================================
FILE: morfessor/test/__init__.py
================================================
__author__ = 'psmit'
================================================
FILE: morfessor/test/evaluation.py
================================================
import unittest
import itertools
from morfessor.evaluation import WilcoxonSignedRank
class TestWilcoxon(unittest.TestCase):
def setUp(self):
self.obj = WilcoxonSignedRank()
def test_norm_cum_pdf(self):
self.assertAlmostEqual(self.obj._norm_cum_pdf(1.9599639845400), 0.025)
def test_accuracy_wilcoxon(self):
#Same tests as used for scipy.stats.morestats
freq = [1, 4, 16, 15, 8, 4, 5, 1, 2]
nums = range(-4, 5)
x = list(itertools.chain(*[[u] * v for u, v in zip(nums, freq)]))
self.assertEqual(len(x), 56)
p = self.obj._wilcoxon(x, correction=False)
self.assertAlmostEqual(p, 0.00197547303533107)
p = self.obj._wilcoxon(x, "wilcox", correction=False)
self.assertAlmostEqual(p, 0.00641346115861)
x = [120, 114, 181, 188, 180, 146, 121, 191, 132, 113, 127, 112]
y = [133, 143, 119, 189, 112, 199, 198, 113, 115, 121, 142, 187]
p = self.obj._wilcoxon([(a - b) for a, b in zip(x, y)])
self.assertAlmostEqual(p, 0.7240817)
p = self.obj._wilcoxon([(a - b) for a, b in zip(x, y)], correction=False)
self.assertAlmostEqual(p, 0.6948866)
def test_wilcoxon_tie(self):
#Same tests as used for scipy.stats.morestats
p = self.obj._wilcoxon([0.1] * 10, correction=False)
self.assertAlmostEqual(p, 0.001565402)
p = self.obj._wilcoxon([0.1] * 10)
self.assertAlmostEqual(p, 0.001904195)
if __name__ == '__main__':
unittest.main()
================================================
FILE: morfessor/utils.py
================================================
"""Data structures and functions of general utility,
shared between different modules and variants of the software.
"""
import logging
import math
import random
import sys
import types
PY3 = sys.version_info[0] == 3
def _dummy_lru_cache(*args, **kwargs):
return lambda func: func
if PY3:
from functools import lru_cache
else:
try:
# Backport for lru_cache
from backports.functools_lru_cache import lru_cache
except ImportError:
logging.warning(
"LRU cache disabled, install backports.functools_lru_cache to enable.")
lru_cache = _dummy_lru_cache
LOGPROB_ZERO = 1000000
# Progress bar for generators (length unknown):
# Print a dot for every GENERATOR_DOT_FREQ:th dot.
# Set to <= 0 to disable progress bar.
GENERATOR_DOT_FREQ = 500
show_progress_bar = True
def _progress(iter_func):
"""Decorator/function for displaying a progress bar when iterating
through a list.
iter_func can be both a function providing a iterator (for decorator
style use) or an iterator itself.
No progressbar is displayed when the show_progress_bar variable is set to
false.
If the progressbar module is available a fancy percentage style
progressbar is displayed. Otherwise 60 dots are printed as indicator.
"""
if not show_progress_bar:
return iter_func
#Try to see or the progressbar module is available, else fabricate our own
try:
from progressbar import ProgressBar
except ImportError:
class SimpleProgressBar:
"""Create a simple progress bar that prints 60 dots on a single
line, proportional to the progress """
NUM_DOTS = 60
def __init__(self):
self.it = None
self.dotfreq = 100
self.i = 0
def __call__(self, it):
self.it = iter(it)
self.i = 0
# Dot frequency is determined as ceil(len(it) / NUM_DOTS)
self.dotfreq = (len(it) + self.NUM_DOTS - 1) // self.NUM_DOTS
if self.dotfreq < 1:
self.dotfreq = 1
return self
def __iter__(self):
return self
def __next__(self):
self.i += 1
if self.i % self.dotfreq == 0:
sys.stderr.write('.')
sys.stderr.flush()
try:
return next(self.it)
except StopIteration:
sys.stderr.write('\n')
raise
#Needed to be compatible with both Python2 and 3
next = __next__
ProgressBar = SimpleProgressBar
# In case of a decorator (argument is a function),
# wrap the functions result in a ProgressBar and return the new function
if isinstance(iter_func, types.FunctionType):
def i(*args, **kwargs):
if logging.getLogger(__name__).isEnabledFor(logging.INFO):
return ProgressBar()(iter_func(*args, **kwargs))
else:
return iter_func(*args, **kwargs)
return i
#In case of an iterator, wrap it in a ProgressBar and return it.
elif hasattr(iter_func, '__iter__'):
return ProgressBar()(iter_func)
#If all else fails, just return the original.
return iter_func
class Sparse(dict):
"""A defaultdict-like data structure, which tries to remain as sparse
as possible. If a value becomes equal to the default value, it (and the
key associated with it) are transparently removed.
Only supports immutable values, e.g. namedtuples.
"""
def __init__(self, *pargs, **kwargs):
"""Create a new Sparse datastructure.
Keyword arguments:
default: Default value. Unlike defaultdict this should be a
prototype immutable, not a factory.
"""
self._default = kwargs.pop('default')
dict.__init__(self, *pargs, **kwargs)
def __getitem__(self, key):
try:
return dict.__getitem__(self, key)
except KeyError:
return self._default
def __setitem__(self, key, value):
# attribute check is necessary for unpickling
if '_default' in self and value == self._default:
if key in self:
del self[key]
else:
dict.__setitem__(self, key, value)
def ngrams(sequence, n=2):
"""Returns all ngram tokens in an input sequence, for a specified n.
E.g. ngrams(['A', 'B', 'A', 'B', 'D'], n=2) yields
('A', 'B'), ('B', 'A'), ('A', 'B'), ('B', 'D')
"""
window = []
for item in sequence:
window.append(item)
if len(window) > n:
# trim back to size
window = window[-n:]
if len(window) == n:
yield tuple(window)
def minargmin(sequence):
"""Returns the minimum value and the first index at which it can be
found in the input sequence."""
best = (None, None)
for (i, value) in enumerate(sequence):
if best[0] is None or value < best[0]:
best = (value, i)
return best
def zlog(x):
"""Logarithm which uses constant value for log(0) instead of -inf"""
assert x >= 0.0
if x == 0:
return LOGPROB_ZERO
return -math.log(x)
def _nt_zeros(constructor, zero=0):
"""Convenience function to return a namedtuple initialized to zeros,
without needing to know the number of fields."""
zeros = [zero] * len(constructor._fields)
return constructor(*zeros)
def weighted_sample(data, num_samples):
"""Samples with replacement from the data set so that the probability
of each data point being selected is proportional to the occurrence count.
Arguments:
data: A list of tuples (weight, ...)
num_samples: The number of samples to return
Returns:
a sorted list of indices to data
"""
tokens = sum(x[0] for x in data)
token_indices = sorted([random.randint(0, tokens - 1)
for _ in range(num_samples)])
data_indices = []
d = enumerate(x[0] for x in data)
di = 0
ti = -1
for sample_token_index in token_indices:
while ti < sample_token_index:
(di, weight) = next(d)
ti += weight
data_indices.append(di)
return data_indices
def _generator_progress(generator):
"""Prints a progress bar for visualizing flow through a generator.
The length of a generator is not known in advance, so the bar has
no fixed length. GENERATOR_DOT_FREQ controls the frequency of dots.
This function wraps the argument generator, returning a new generator.
"""
if GENERATOR_DOT_FREQ <= 0:
return generator
def _progress_wrapper(generator):
for (i, x) in enumerate(generator):
if i % GENERATOR_DOT_FREQ == 0:
sys.stderr.write('.')
sys.stderr.flush()
yield x
sys.stderr.write('\n')
return _progress_wrapper(generator)
def _is_string(obj):
try:
# Python 2
return isinstance(obj, basestring)
except NameError:
# Python 3
return isinstance(obj, str)
================================================
FILE: scripts/morfessor
================================================
#!/usr/bin/env python
import sys
import morfessor
from morfessor import _logger
def main(argv):
parser = morfessor.get_default_argparser()
try:
args = parser.parse_args(argv)
morfessor.main(args)
except morfessor.ArgumentException as e:
parser.error(e)
except Exception as e:
_logger.error("Fatal Error %s %s" % (type(e), e))
raise
if __name__ == "__main__":
main(sys.argv[1:])
================================================
FILE: scripts/morfessor-evaluate
================================================
#!/usr/bin/env python
import sys
import morfessor
import morfessor.evaluation
from morfessor import _logger
def main(argv):
parser = morfessor.get_evaluation_argparser()
try:
args = parser.parse_args(argv)
morfessor.main_evaluation(args)
except morfessor.ArgumentException as e:
parser.error(e)
except Exception as e:
_logger.error("Fatal Error %s %s" % (type(e), e))
raise
if __name__ == "__main__":
main(sys.argv[1:])
================================================
FILE: scripts/morfessor-segment
================================================
#!/usr/bin/env python
import argparse
import sys
import morfessor
from morfessor import _logger
def main(argv):
parser = morfessor.get_default_argparser()
parser.prog = "morfessor-segment"
parser.epilog = """
Simple usage example (load model.pickled and use it to segment test corpus):
%(prog)s -l model.pickled -o test_corpus.segmented test_corpus.txt
Interactive use (read corpus from user):
%(prog)s -l model.pickled -
"""
keep_options = ['encoding', 'loadfile', 'loadsegfile', 'outfile']
# FIXME Disabled to work around an argparse bug
#'help', 'version']
for action_group in parser._action_groups:
for arg in action_group._group_actions:
if arg.dest not in keep_options:
arg.help = argparse.SUPPRESS
parser.add_argument('testfiles', metavar='<file>', nargs='+',
help='corpus files to segment')
try:
args = parser.parse_args(argv)
morfessor.main(args)
except morfessor.ArgumentException as e:
parser.error(e)
except Exception as e:
_logger.error("Fatal Error %s %s" % (type(e), e))
raise
if __name__ == "__main__":
main(sys.argv[1:])
================================================
FILE: scripts/morfessor-train
================================================
#!/usr/bin/env python
import argparse
import sys
import morfessor
from morfessor import _logger
def main(argv):
parser = morfessor.get_default_argparser()
parser.prog = "morfessor-train"
parser.epilog = """
Simple usage example (train a model and save it to model.pickled):
%(prog)s -s model.pickled training_corpus.txt
Interactive use (read corpus from user):
%(prog)s -m online -v 2 -
"""
keep_options = ['savesegfile', 'savefile', 'trainmode', 'dampening',
'encoding', 'list', 'skips', 'annofile', 'develfile',
'fullretrain', 'threshold', 'morphtypes', 'morphlength',
'corpusweight', 'annotationweight', 'help', 'version']
for action_group in parser._action_groups:
for arg in action_group._group_actions:
if arg.dest not in keep_options:
arg.help = argparse.SUPPRESS
parser.add_argument('trainfiles', metavar='<file>', nargs='+',
help='training data files')
try:
args = parser.parse_args(argv)
morfessor.main(args)
except morfessor.ArgumentException as e:
parser.error(e)
except Exception as e:
_logger.error("Fatal Error {} {}".format(type(e), e))
raise
if __name__ == "__main__":
main(sys.argv[1:])
================================================
FILE: scripts/tools/morphlength_from_annotations.py
================================================
from __future__ import division
import fileinput
def main():
tot_morph_count = 0
tot_length = 0
for line in fileinput.input():
word, segm = line.strip().split(None, 1)
segmentations = segm.split(',')
num_morphs = [len([x for x in s.split(None) if x.strip().strip("~") != ""]) for s in segmentations]
tot_morph_count += sum(num_morphs) / len(num_morphs)
tot_length += len(word)
print(tot_length / tot_morph_count)
if __name__ == "__main__":
main()
================================================
FILE: setup.py
================================================
#!/usr/bin/env python
from codecs import open
from ez_setup import use_setuptools
use_setuptools()
from setuptools import setup
import re
main_py = open('morfessor/__init__.py', encoding='utf-8').read()
metadata = dict(re.findall("__([a-z]+)__ = '([^']+)'", main_py))
requires = [
# 'progressbar',
]
setup(name='Morfessor',
version=metadata['version'],
author=metadata['author'],
author_email='morpho@aalto.fi',
url='http://morpho.aalto.fi/projects/morpho/morfessor2.html',
description='A tool for unsupervised and semi-supervised morphological segmentation',
packages=['morfessor', 'morfessor.test'],
classifiers=[
'Development Status :: 4 - Beta',
'Intended Audience :: Science/Research',
'License :: OSI Approved :: BSD License',
'Operating System :: OS Independent',
'Programming Language :: Python',
'Topic :: Scientific/Engineering',
],
license="BSD",
scripts=['scripts/morfessor',
'scripts/morfessor-train',
'scripts/morfessor-segment',
'scripts/morfessor-evaluate',
],
install_requires=requires,
extras_require={
'docs': [l.strip() for l in open('docs/build_requirements.txt')]
}
)
gitextract_ss7q8p0b/ ├── .gitignore ├── LICENSE ├── MANIFEST.in ├── README ├── docs/ │ ├── Makefile │ ├── README │ ├── build_requirements.txt │ ├── make.bat │ └── source/ │ ├── cmdtools.rst │ ├── conf.py │ ├── filetypes.rst │ ├── general.rst │ ├── index.rst │ ├── installation.rst │ ├── libinterface.rst │ └── license.rst ├── ez_setup.py ├── morfessor/ │ ├── __init__.py │ ├── baseline.py │ ├── cmd.py │ ├── evaluation.py │ ├── exception.py │ ├── io.py │ ├── test/ │ │ ├── __init__.py │ │ └── evaluation.py │ └── utils.py ├── scripts/ │ ├── morfessor │ ├── morfessor-evaluate │ ├── morfessor-segment │ ├── morfessor-train │ └── tools/ │ └── morphlength_from_annotations.py └── setup.py
SYMBOL INDEX (177 symbols across 10 files)
FILE: ez_setup.py
function _python_cmd (line 34) | def _python_cmd(*args):
function _install (line 38) | def _install(tarball, install_args=()):
function _build_egg (line 66) | def _build_egg(egg, tarball, to_dir):
function _do_download (line 95) | def _do_download(version, download_base, to_dir, download_delay):
function use_setuptools (line 107) | def use_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
function download_setuptools (line 139) | def download_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
function _extractall (line 176) | def _extractall(self, path=".", members=None):
function _build_install_args (line 223) | def _build_install_args(options):
function _parse_args (line 235) | def _parse_args():
function main (line 251) | def main(version=DEFAULT_VERSION):
FILE: morfessor/__init__.py
function get_version (line 20) | def get_version():
FILE: morfessor/baseline.py
function _constructions_to_str (line 15) | def _constructions_to_str(constructions):
class BaselineModel (line 33) | class BaselineModel(object):
method __init__ (line 44) | def __init__(self, forcesplit_list=None, corpusweight=None,
method set_corpus_weight_updater (line 93) | def set_corpus_weight_updater(self, corpus_weight):
method _check_segment_only (line 103) | def _check_segment_only(self):
method tokens (line 108) | def tokens(self):
method types (line 113) | def types(self):
method _add_compound (line 117) | def _add_compound(self, compound, c):
method _remove (line 125) | def _remove(self, construction):
method _random_split (line 131) | def _random_split(self, compound, threshold):
method _set_compound_analysis (line 143) | def _set_compound_analysis(self, compound, parts, ptype='rbranch'):
method _update_annotation_choices (line 185) | def _update_annotation_choices(self):
method _best_analysis (line 214) | def _best_analysis(self, choices):
method _force_split (line 231) | def _force_split(self, compound):
method _test_skip (line 248) | def _test_skip(self, construction):
method _viterbi_optimize (line 257) | def _viterbi_optimize(self, compound, addcount=0, maxlen=30):
method _recursive_optimize (line 283) | def _recursive_optimize(self, compound):
method _recursive_split (line 305) | def _recursive_split(self, construction):
method _modify_construction_count (line 356) | def _modify_construction_count(self, construction, dcount):
method _epoch_update (line 390) | def _epoch_update(self, epoch_num):
method segmentation_to_splitloc (line 419) | def segmentation_to_splitloc(constructions):
method _splitloc_to_segmentation (line 429) | def _splitloc_to_segmentation(compound, splitloc):
method _join_constructions (line 444) | def _join_constructions(constructions):
method get_compounds (line 452) | def get_compounds(self):
method get_constructions (line 458) | def get_constructions(self):
method get_cost (line 463) | def get_cost(self):
method get_segmentations (line 471) | def get_segmentations(self):
method load_data (line 479) | def load_data(self, data, freqthreshold=1, count_modifier=None,
method load_segmentations (line 517) | def load_segmentations(self, segmentations):
method set_annotations (line 529) | def set_annotations(self, annotations, annotatedcorpusweight=None):
method segment (line 542) | def segment(self, compound):
method train_batch (line 560) | def train_batch(self, algorithm='recursive', algorithm_params=(),
method train_online (line 625) | def train_online(self, data, count_modifier=None, epoch_interval=10000,
method viterbi_segment (line 719) | def viterbi_segment(self, compound, addcount=1.0, maxlen=30):
method forward_logprob (line 812) | def forward_logprob(self, compound):
method viterbi_nbest (line 861) | def viterbi_nbest(self, compound, n, addcount=1.0, maxlen=30):
method get_corpus_coding_weight (line 962) | def get_corpus_coding_weight(self):
method set_corpus_coding_weight (line 965) | def set_corpus_coding_weight(self, weight):
method make_segment_only (line 969) | def make_segment_only(self):
method clear_segmentation (line 980) | def clear_segmentation(self):
class CorpusWeight (line 985) | class CorpusWeight(object):
method move_direction (line 987) | def move_direction(cls, model, direction, epoch):
class FixedCorpusWeight (line 1000) | class FixedCorpusWeight(CorpusWeight):
method __init__ (line 1001) | def __init__(self, weight):
method update (line 1004) | def update(self, model, _):
class AnnotationCorpusWeight (line 1009) | class AnnotationCorpusWeight(CorpusWeight):
method __init__ (line 1015) | def __init__(self, devel_set, threshold=0.01):
method update (line 1019) | def update(self, model, epoch):
method _boundary_recall (line 1032) | def _boundary_recall(cls, prediction, reference):
method _bpr_evaluation (line 1055) | def _bpr_evaluation(cls, prediction, reference):
method _estimate_segmentation_dir (line 1064) | def _estimate_segmentation_dir(self, segments, annotations):
class MorphLengthCorpusWeight (line 1088) | class MorphLengthCorpusWeight(CorpusWeight):
method __init__ (line 1089) | def __init__(self, morph_lenght, threshold=0.01):
method update (line 1093) | def update(self, model, epoch):
method calc_morph_length (line 1108) | def calc_morph_length(cls, model):
class NumMorphCorpusWeight (line 1122) | class NumMorphCorpusWeight(CorpusWeight):
method __init__ (line 1123) | def __init__(self, num_morph_types, threshold=0.01):
method update (line 1127) | def update(self, model, epoch):
class Encoding (line 1142) | class Encoding(object):
method __init__ (line 1149) | def __init__(self, weight=1.0):
method types (line 1165) | def types(self):
method _logfactorial (line 1173) | def _logfactorial(cls, n):
method frequency_distribution_cost (line 1186) | def frequency_distribution_cost(self):
method permutations_cost (line 1199) | def permutations_cost(self):
method update_count (line 1203) | def update_count(self, construction, old_count, new_count):
method get_cost (line 1211) | def get_cost(self):
class CorpusEncoding (line 1224) | class CorpusEncoding(Encoding):
method __init__ (line 1231) | def __init__(self, lexicon_encoding, weight=1.0):
method types (line 1236) | def types(self):
method frequency_distribution_cost (line 1243) | def frequency_distribution_cost(self):
method get_cost (line 1255) | def get_cost(self):
class AnnotatedCorpusEncoding (line 1270) | class AnnotatedCorpusEncoding(Encoding):
method __init__ (line 1276) | def __init__(self, corpus_coding, weight=None, penalty=-9999.9):
method set_constructions (line 1299) | def set_constructions(self, constructions):
method set_count (line 1308) | def set_count(self, construction, count):
method update_count (line 1318) | def update_count(self, construction, old_count, new_count):
method update_weight (line 1334) | def update_weight(self):
method get_cost (line 1346) | def get_cost(self):
class LexiconEncoding (line 1357) | class LexiconEncoding(Encoding):
method __init__ (line 1360) | def __init__(self):
method types (line 1366) | def types(self):
method add (line 1373) | def add(self, construction):
method remove (line 1384) | def remove(self, construction):
method get_codelength (line 1395) | def get_codelength(self, construction):
FILE: morfessor/cmd.py
function get_default_argparser (line 33) | def get_default_argparser():
function initialize_logging (line 292) | def initialize_logging(args):
function _viterbi_segment (line 331) | def _viterbi_segment(model, atoms, smooth, maxlen):
function _viterbi_nbest (line 336) | def _viterbi_nbest(model, atoms, nbest, smooth, maxlen):
function main (line 340) | def main(args):
function get_evaluation_argparser (line 571) | def get_evaluation_argparser():
function main_evaluation (line 648) | def main_evaluation(args):
FILE: morfessor/evaluation.py
function _sample (line 26) | def _sample(compound_list, size, seed):
class MorfessorEvaluationResult (line 32) | class MorfessorEvaluationResult(object):
method __init__ (line 49) | def __init__(self, meta_data=None):
method __getitem__ (line 59) | def __getitem__(self, item):
method add_data_point (line 67) | def add_data_point(self, precision, recall, f_score, sample_size):
method __str__ (line 78) | def __str__(self):
method _fill_cache (line 82) | def _fill_cache(self):
method _get_cache (line 91) | def _get_cache(self):
method format (line 97) | def format(self, format_string):
class MorfessorEvaluation (line 103) | class MorfessorEvaluation(object):
method __init__ (line 112) | def __init__(self, reference_annotations):
method _create_samples (line 121) | def _create_samples(self, configuration=EvaluationConfig(10, 1000)):
method get_samples (line 136) | def get_samples(self, configuration=EvaluationConfig(10, 1000)):
method _evaluate (line 148) | def _evaluate(self, prediction):
method _segmentation_indices (line 180) | def _segmentation_indices(annotation):
method evaluate_model (line 187) | def evaluate_model(self, model, configuration=EvaluationConfig(10, 1000),
method evaluate_segmentation (line 211) | def evaluate_segmentation(self, segmentation,
class WilcoxonSignedRank (line 240) | class WilcoxonSignedRank(object):
method _wilcoxon (line 250) | def _wilcoxon(d, method='pratt', correction=True):
method _rankdata (line 282) | def _rankdata(d):
method _norm_cum_pdf (line 296) | def _norm_cum_pdf(z):
method significance_test (line 300) | def significance_test(self, evaluations, val_property='fscore_values',
method print_table (line 321) | def print_table(results):
FILE: morfessor/exception.py
class MorfessorException (line 4) | class MorfessorException(Exception):
class ArgumentException (line 9) | class ArgumentException(Exception):
class InvalidCategoryError (line 14) | class InvalidCategoryError(MorfessorException):
method __init__ (line 16) | def __init__(self, category):
class InvalidOperationError (line 22) | class InvalidOperationError(MorfessorException):
method __init__ (line 23) | def __init__(self, operation, function_name):
class UnsupportedConfigurationError (line 29) | class UnsupportedConfigurationError(MorfessorException):
method __init__ (line 30) | def __init__(self, reason):
class SegmentOnlyModelException (line 36) | class SegmentOnlyModelException(MorfessorException):
method __init__ (line 37) | def __init__(self):
FILE: morfessor/io.py
class MorfessorIO (line 24) | class MorfessorIO(object):
method __init__ (line 34) | def __init__(self, encoding=None, construction_separator=' + ',
method read_segmentation_file (line 47) | def read_segmentation_file(self, file_name, has_counts=True, **kwargs):
method write_segmentation_file (line 71) | def write_segmentation_file(self, file_name, segmentations, **kwargs):
method read_corpus_files (line 93) | def read_corpus_files(self, file_names):
method read_corpus_list_files (line 103) | def read_corpus_list_files(self, file_names):
method read_corpus_file (line 113) | def read_corpus_file(self, file_name):
method read_corpus_list_file (line 128) | def read_corpus_list_file(self, file_name):
method read_annotations_file (line 146) | def read_annotations_file(self, file_name, construction_separator=' ',
method write_lexicon_file (line 176) | def write_lexicon_file(self, file_name, lexicon):
method read_binary_model_file (line 184) | def read_binary_model_file(self, file_name):
method read_binary_file (line 192) | def read_binary_file(file_name):
method write_binary_model_file (line 198) | def write_binary_model_file(self, file_name, model):
method write_binary_file (line 205) | def write_binary_file(file_name, obj):
method write_parameter_file (line 210) | def write_parameter_file(self, file_name, params):
method read_parameter_file (line 220) | def read_parameter_file(self, file_name):
method read_any_model (line 236) | def read_any_model(self, file_name):
method format_constructions (line 253) | def format_constructions(self, constructions, csep=None, atom_sep=None):
method _split_atoms (line 266) | def _split_atoms(self, construction):
method _open_text_file_write (line 273) | def _open_text_file_write(self, file_name):
method _open_text_file_read (line 290) | def _open_text_file_read(self, file_name):
method _read_text_file (line 325) | def _read_text_file(self, file_name, raw=False):
method _find_encoding (line 350) | def _find_encoding(*files):
FILE: morfessor/test/evaluation.py
class TestWilcoxon (line 6) | class TestWilcoxon(unittest.TestCase):
method setUp (line 7) | def setUp(self):
method test_norm_cum_pdf (line 10) | def test_norm_cum_pdf(self):
method test_accuracy_wilcoxon (line 13) | def test_accuracy_wilcoxon(self):
method test_wilcoxon_tie (line 36) | def test_wilcoxon_tie(self):
FILE: morfessor/utils.py
function _dummy_lru_cache (line 15) | def _dummy_lru_cache(*args, **kwargs):
function _progress (line 43) | def _progress(iter_func):
class Sparse (line 123) | class Sparse(dict):
method __init__ (line 131) | def __init__(self, *pargs, **kwargs):
method __getitem__ (line 141) | def __getitem__(self, key):
method __setitem__ (line 147) | def __setitem__(self, key, value):
function ngrams (line 156) | def ngrams(sequence, n=2):
function minargmin (line 172) | def minargmin(sequence):
function zlog (line 182) | def zlog(x):
function _nt_zeros (line 190) | def _nt_zeros(constructor, zero=0):
function weighted_sample (line 197) | def weighted_sample(data, num_samples):
function _generator_progress (line 222) | def _generator_progress(generator):
function _is_string (line 244) | def _is_string(obj):
FILE: scripts/tools/morphlength_from_annotations.py
function main (line 5) | def main():
Condensed preview — 32 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (196K chars).
[
{
"path": ".gitignore",
"chars": 334,
"preview": "*.py[cod]\n\n# C extensions\n*.so\n\n# Packages\n*.egg\n*.egg-info\ndist\nbuild\neggs\nparts\nbin\nvar\nsdist\ndevelop-eggs\n.installed."
},
{
"path": "LICENSE",
"chars": 1341,
"preview": "Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.\nAll rights reserved.\n\nRedistribution and use"
},
{
"path": "MANIFEST.in",
"chars": 71,
"preview": "include LICENSE\ninclude ez_setup.py\ninclude docs/build_requirements.txt"
},
{
"path": "README",
"chars": 1137,
"preview": "Morfessor 2.0 - Quick start\n===========================\n\n\nInstallation\n------------\n\nMorfessor 2.0 is installed using se"
},
{
"path": "docs/Makefile",
"chars": 6783,
"preview": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHINXBUILD "
},
{
"path": "docs/README",
"chars": 782,
"preview": "Generating Documentation\n------------------------\n\nThe user instructions for Morfessor 2.0 are available as Sphinx sourc"
},
{
"path": "docs/build_requirements.txt",
"chars": 30,
"preview": "sphinx\nsphinxcontrib-napoleon\n"
},
{
"path": "docs/make.bat",
"chars": 6716,
"preview": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\n"
},
{
"path": "docs/source/cmdtools.rst",
"chars": 13271,
"preview": "Command line tools\n==================\n\nThe installation process installs 4 scripts in the appropriate PATH.\n\nmorfessor\n-"
},
{
"path": "docs/source/conf.py",
"chars": 8361,
"preview": "# -*- coding: utf-8 -*-\n#\n# Morfessor documentation build configuration file, created by\n# sphinx-quickstart on Wed Dec "
},
{
"path": "docs/source/filetypes.rst",
"chars": 2737,
"preview": "Morfessor file types\n====================\n\n.. _binary-model-def:\n\nBinary model\n------------\n\n.. warning::\n\n Pickled m"
},
{
"path": "docs/source/general.rst",
"chars": 2364,
"preview": "General\n=======\n\n.. _morfessor-tech-report:\n\nMorfessor 2.0 Technical Report\n------------------------------\n\nThe work don"
},
{
"path": "docs/source/index.rst",
"chars": 610,
"preview": ".. Morfessor documentation master file, created by\n sphinx-quickstart on Wed Dec 4 13:41:43 2013.\n You can adapt th"
},
{
"path": "docs/source/installation.rst",
"chars": 1659,
"preview": "Installation instructions\n=========================\n\nMorfessor 2.0 is installed using setuptools library for Python. Mor"
},
{
"path": "docs/source/libinterface.rst",
"chars": 2423,
"preview": "Python library interface to Morfessor\n=====================================\n\nMorfessor 2.0 contains a library interface "
},
{
"path": "docs/source/license.rst",
"chars": 1358,
"preview": "License\n=======\nCopyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.\nAll rights reserved.\n\nRedist"
},
{
"path": "ez_setup.py",
"chars": 8596,
"preview": "#!python\n\"\"\"Bootstrap setuptools installation\n\nIf you want to use setuptools in your package's setup.py, just include th"
},
{
"path": "morfessor/__init__.py",
"chars": 1092,
"preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nMorfessor 2.0 - Python implementation of the Morfessor method\n\"\"\"\nimpo"
},
{
"path": "morfessor/baseline.py",
"chars": 54460,
"preview": "import collections\nimport heapq\nimport logging\nimport math\nimport numbers\nimport random\nimport re\n\nfrom .utils import _p"
},
{
"path": "morfessor/cmd.py",
"chars": 31317,
"preview": "# -*- coding: utf-8 -*-\nimport locale\nimport logging\nimport math\nimport random\nimport os.path\nimport sys\nimport time\nimp"
},
{
"path": "morfessor/evaluation.py",
"chars": 11904,
"preview": "from __future__ import print_function\n\nimport collections\nimport logging\nfrom itertools import chain, product\nimport mat"
},
{
"path": "morfessor/exception.py",
"chars": 1362,
"preview": "from __future__ import unicode_literals\n\n\nclass MorfessorException(Exception):\n \"\"\"Base class for exceptions in this "
},
{
"path": "morfessor/io.py",
"chars": 13844,
"preview": "import bz2\nimport codecs\nimport datetime\nimport gzip\nimport locale\nimport logging\nimport re\nimport sys\n\nfrom . import ge"
},
{
"path": "morfessor/test/__init__.py",
"chars": 21,
"preview": "__author__ = 'psmit'\n"
},
{
"path": "morfessor/test/evaluation.py",
"chars": 1524,
"preview": "import unittest\nimport itertools\n\nfrom morfessor.evaluation import WilcoxonSignedRank\n\nclass TestWilcoxon(unittest.TestC"
},
{
"path": "morfessor/utils.py",
"chars": 7272,
"preview": "\"\"\"Data structures and functions of general utility,\nshared between different modules and variants of the software.\n\"\"\"\n"
},
{
"path": "scripts/morfessor",
"chars": 444,
"preview": "#!/usr/bin/env python\n\nimport sys\n\nimport morfessor\nfrom morfessor import _logger\n\n\ndef main(argv):\n parser = morfess"
},
{
"path": "scripts/morfessor-evaluate",
"chars": 486,
"preview": "#!/usr/bin/env python\n\nimport sys\n\nimport morfessor\nimport morfessor.evaluation\n\nfrom morfessor import _logger\n\n\ndef mai"
},
{
"path": "scripts/morfessor-segment",
"chars": 1235,
"preview": "#!/usr/bin/env python\n\nimport argparse\nimport sys\n\nimport morfessor\nfrom morfessor import _logger\n\n\ndef main(argv):\n "
},
{
"path": "scripts/morfessor-train",
"chars": 1322,
"preview": "#!/usr/bin/env python\n\nimport argparse\nimport sys\n\nimport morfessor\nfrom morfessor import _logger\n\n\ndef main(argv):\n "
},
{
"path": "scripts/tools/morphlength_from_annotations.py",
"chars": 513,
"preview": "from __future__ import division\nimport fileinput\n\n\ndef main():\n tot_morph_count = 0\n tot_length = 0\n\n for line "
},
{
"path": "setup.py",
"chars": 1316,
"preview": "#!/usr/bin/env python\n\nfrom codecs import open\nfrom ez_setup import use_setuptools\nuse_setuptools()\n\nfrom setuptools imp"
}
]
About this extraction
This page contains the full source code of the aalto-speech/morfessor GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 32 files (182.3 KB), approximately 43.4k tokens, and a symbol index with 177 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.