Full Code of aalto-speech/morfessor for AI

master 5314d3ebc1be cached

32 files

182.3 KB

43.4k tokens

177 symbols

1 requests

Download .txt

Repository: aalto-speech/morfessor
Branch: master
Commit: 5314d3ebc1be
Files: 32
Total size: 182.3 KB

Directory structure:
gitextract_ss7q8p0b/

├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README
├── docs/
│   ├── Makefile
│   ├── README
│   ├── build_requirements.txt
│   ├── make.bat
│   └── source/
│       ├── cmdtools.rst
│       ├── conf.py
│       ├── filetypes.rst
│       ├── general.rst
│       ├── index.rst
│       ├── installation.rst
│       ├── libinterface.rst
│       └── license.rst
├── ez_setup.py
├── morfessor/
│   ├── __init__.py
│   ├── baseline.py
│   ├── cmd.py
│   ├── evaluation.py
│   ├── exception.py
│   ├── io.py
│   ├── test/
│   │   ├── __init__.py
│   │   └── evaluation.py
│   └── utils.py
├── scripts/
│   ├── morfessor
│   ├── morfessor-evaluate
│   ├── morfessor-segment
│   ├── morfessor-train
│   └── tools/
│       └── morphlength_from_annotations.py
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.py[cod]

# C extensions
*.so

# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
MANIFEST
env*

# Installer logs
pip-log.txt

# Unit test / coverage reports
.coverage
.tox
nosetests.xml

# Translations
*.mo

#Idea IDE
.idea

# Mr Developer
.mr.developer.cfg
.project
.pydevproject


================================================
FILE: LICENSE
================================================
Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

================================================
FILE: MANIFEST.in
================================================
include LICENSE
include ez_setup.py
include docs/build_requirements.txt

================================================
FILE: README
================================================
Morfessor 2.0 - Quick start
===========================


Installation
------------

Morfessor 2.0 is installed using setuptools library for Python. To
build and install the module and scripts to default paths, type

python setup.py install

For details, see http://docs.python.org/install/


Documentation
-------------

User instructions for Morfessor 2.0 are available in the docs directory
as Sphinx source files (see http://sphinx-doc.org/). Instructions how
to build the documentation can be found in docs/README.

The documentation is also available on-line at http://morfessor.readthedocs.org/

Details of the implemented algorithms and methods and a set of
experiments are described in the following technical report:

Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko
Kurimo. Morfessor 2.0: Python Implementation and Extensions for
Morfessor Baseline. Aalto University publication series SCIENCE +
TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN
978-952-60-5501-5.

The report is available online at 

http://urn.fi/URN:ISBN:978-952-60-5501-5


Contact
-------

Questions or feedback? Email: morpho@aalto.fi


================================================
FILE: docs/Makefile
================================================
# Makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
PAPER         =
BUILDDIR      = build

# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif

# Internal variables.
PAPEROPT_a4     = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source

.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext

help:
	@echo "Please use \`make <target>' where <target> is one of"
	@echo "  html       to make standalone HTML files"
	@echo "  dirhtml    to make HTML files named index.html in directories"
	@echo "  singlehtml to make a single large HTML file"
	@echo "  pickle     to make pickle files"
	@echo "  json       to make JSON files"
	@echo "  htmlhelp   to make HTML files and a HTML help project"
	@echo "  qthelp     to make HTML files and a qthelp project"
	@echo "  devhelp    to make HTML files and a Devhelp project"
	@echo "  epub       to make an epub"
	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
	@echo "  text       to make text files"
	@echo "  man        to make manual pages"
	@echo "  texinfo    to make Texinfo files"
	@echo "  info       to make Texinfo files and run them through makeinfo"
	@echo "  gettext    to make PO message catalogs"
	@echo "  changes    to make an overview of all changed/added/deprecated items"
	@echo "  xml        to make Docutils-native XML files"
	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
	@echo "  linkcheck  to check all external links for integrity"
	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"

clean:
	rm -rf $(BUILDDIR)/*

html:
	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

dirhtml:
	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
	@echo
	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."

singlehtml:
	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
	@echo
	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."

pickle:
	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
	@echo
	@echo "Build finished; now you can process the pickle files."

json:
	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
	@echo
	@echo "Build finished; now you can process the JSON files."

htmlhelp:
	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
	@echo
	@echo "Build finished; now you can run HTML Help Workshop with the" \
	      ".hhp project file in $(BUILDDIR)/htmlhelp."

qthelp:
	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
	@echo
	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/Morfessor.qhcp"
	@echo "To view the help file:"
	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/Morfessor.qhc"

devhelp:
	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
	@echo
	@echo "Build finished."
	@echo "To view the help file:"
	@echo "# mkdir -p $$HOME/.local/share/devhelp/Morfessor"
	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/Morfessor"
	@echo "# devhelp"

epub:
	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
	@echo
	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."

latex:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo
	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
	@echo "Run \`make' in that directory to run these through (pdf)latex" \
	      "(use \`make latexpdf' here to do that automatically)."

latexpdf:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through pdflatex..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

latexpdfja:
	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
	@echo "Running LaTeX files through platex and dvipdfmx..."
	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

text:
	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
	@echo
	@echo "Build finished. The text files are in $(BUILDDIR)/text."

man:
	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
	@echo
	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."

texinfo:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo
	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
	@echo "Run \`make' in that directory to run these through makeinfo" \
	      "(use \`make info' here to do that automatically)."

info:
	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
	@echo "Running Texinfo files through makeinfo..."
	make -C $(BUILDDIR)/texinfo info
	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."

gettext:
	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
	@echo
	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."

changes:
	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
	@echo
	@echo "The overview file is in $(BUILDDIR)/changes."

linkcheck:
	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
	@echo
	@echo "Link check complete; look for any errors in the above output " \
	      "or in $(BUILDDIR)/linkcheck/output.txt."

doctest:
	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
	@echo "Testing of doctests in the sources finished, look at the " \
	      "results in $(BUILDDIR)/doctest/output.txt."

xml:
	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
	@echo
	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."

pseudoxml:
	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
	@echo
	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."


================================================
FILE: docs/README
================================================
Generating Documentation
------------------------

The user instructions for Morfessor 2.0 are available as Sphinx source
files (see http://sphinx-doc.org/). To build the documentation you need
both the 'sphinx' and the 'sphinxcontrib-napoleon' package. With a recent
version of pip you could do::

    pip install -e .[docs]

to automatically install the required dependencies for making the docs.

After installing Sphinx, you can generate the documentation in different
formats using the Makefile or make.bat in the directory "docs". For
example, to generate a PDF file, type "make latexpdf", and to generate
a single HTML file, type "make singlehtml". Type "make help" to see
all available formats.

The documentation can also be read online on http://morfessor.readthedocs.org/

================================================
FILE: docs/build_requirements.txt
================================================
sphinx
sphinxcontrib-napoleon


================================================
FILE: docs/make.bat
================================================
@ECHO OFF

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
	set SPHINXBUILD=sphinx-build
)
set BUILDDIR=build
set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source
set I18NSPHINXOPTS=%SPHINXOPTS% source
if NOT "%PAPER%" == "" (
	set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
	set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
)

if "%1" == "" goto help

if "%1" == "help" (
	:help
	echo.Please use `make ^<target^>` where ^<target^> is one of
	echo.  html       to make standalone HTML files
	echo.  dirhtml    to make HTML files named index.html in directories
	echo.  singlehtml to make a single large HTML file
	echo.  pickle     to make pickle files
	echo.  json       to make JSON files
	echo.  htmlhelp   to make HTML files and a HTML help project
	echo.  qthelp     to make HTML files and a qthelp project
	echo.  devhelp    to make HTML files and a Devhelp project
	echo.  epub       to make an epub
	echo.  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter
	echo.  text       to make text files
	echo.  man        to make manual pages
	echo.  texinfo    to make Texinfo files
	echo.  gettext    to make PO message catalogs
	echo.  changes    to make an overview over all changed/added/deprecated items
	echo.  xml        to make Docutils-native XML files
	echo.  pseudoxml  to make pseudoxml-XML files for display purposes
	echo.  linkcheck  to check all external links for integrity
	echo.  doctest    to run all doctests embedded in the documentation if enabled
	goto end
)

if "%1" == "clean" (
	for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
	del /q /s %BUILDDIR%\*
	goto end
)


%SPHINXBUILD% 2> nul
if errorlevel 9009 (
	echo.
	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
	echo.installed, then set the SPHINXBUILD environment variable to point
	echo.to the full path of the 'sphinx-build' executable. Alternatively you
	echo.may add the Sphinx directory to PATH.
	echo.
	echo.If you don't have Sphinx installed, grab it from
	echo.http://sphinx-doc.org/
	exit /b 1
)

if "%1" == "html" (
	%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/html.
	goto end
)

if "%1" == "dirhtml" (
	%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
	goto end
)

if "%1" == "singlehtml" (
	%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
	goto end
)

if "%1" == "pickle" (
	%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can process the pickle files.
	goto end
)

if "%1" == "json" (
	%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can process the JSON files.
	goto end
)

if "%1" == "htmlhelp" (
	%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can run HTML Help Workshop with the ^
.hhp project file in %BUILDDIR%/htmlhelp.
	goto end
)

if "%1" == "qthelp" (
	%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; now you can run "qcollectiongenerator" with the ^
.qhcp project file in %BUILDDIR%/qthelp, like this:
	echo.^> qcollectiongenerator %BUILDDIR%\qthelp\Morfessor.qhcp
	echo.To view the help file:
	echo.^> assistant -collectionFile %BUILDDIR%\qthelp\Morfessor.ghc
	goto end
)

if "%1" == "devhelp" (
	%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished.
	goto end
)

if "%1" == "epub" (
	%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The epub file is in %BUILDDIR%/epub.
	goto end
)

if "%1" == "latex" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "latexpdf" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	cd %BUILDDIR%/latex
	make all-pdf
	cd %BUILDDIR%/..
	echo.
	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "latexpdfja" (
	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
	cd %BUILDDIR%/latex
	make all-pdf-ja
	cd %BUILDDIR%/..
	echo.
	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
	goto end
)

if "%1" == "text" (
	%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The text files are in %BUILDDIR%/text.
	goto end
)

if "%1" == "man" (
	%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The manual pages are in %BUILDDIR%/man.
	goto end
)

if "%1" == "texinfo" (
	%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
	goto end
)

if "%1" == "gettext" (
	%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
	goto end
)

if "%1" == "changes" (
	%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
	if errorlevel 1 exit /b 1
	echo.
	echo.The overview file is in %BUILDDIR%/changes.
	goto end
)

if "%1" == "linkcheck" (
	%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
	if errorlevel 1 exit /b 1
	echo.
	echo.Link check complete; look for any errors in the above output ^
or in %BUILDDIR%/linkcheck/output.txt.
	goto end
)

if "%1" == "doctest" (
	%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
	if errorlevel 1 exit /b 1
	echo.
	echo.Testing of doctests in the sources finished, look at the ^
results in %BUILDDIR%/doctest/output.txt.
	goto end
)

if "%1" == "xml" (
	%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The XML files are in %BUILDDIR%/xml.
	goto end
)

if "%1" == "pseudoxml" (
	%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
	if errorlevel 1 exit /b 1
	echo.
	echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
	goto end
)

:end


================================================
FILE: docs/source/cmdtools.rst
================================================
Command line tools
==================

The installation process installs 4 scripts in the appropriate PATH.

morfessor
---------
The morfessor command is a full-featured script for training, updating models
and segmenting test data.

Loading existing model
~~~~~~~~~~~~~~~~~~~~~~

``-l <file>``
    load :ref:`binary-model-def`
``-L <file>``
    load :ref:`morfessor1-model-def`


Loading data
~~~~~~~~~~~~

``-t <file>, --traindata <file>``
    Input corpus file(s) for training (text or bz2/gzipped text; use '-'
    for standard input; add several times in order to append multiple files).
    Standard, all sentences are split on whitespace and the tokens are used as
    compounds. The ``--traindata-list`` option can be used to read all input
    files as a list of compounds, one compound per line optionally prefixed by
    a count. See :ref:`data-format-options` for changing the delimiters used for
    separating compounds and atoms.
``--traindata-list``
    Interpret all training files as list files instead of corpus files. A list
    file contains one compound per line with optionally a count as prefix.
``-T <file>, --testdata <file>``
    Input corpus file(s) to analyze (text or bz2/gzipped text; use '-' for
    standard input; add several times in order to append multiple files). The
    file is read in the same manner as an input corpus file. See
    :ref:`data-format-options` for changing the delimiters used for
    separating compounds and atoms.


Training model options
~~~~~~~~~~~~~~~~~~~~~~

``-m <mode>, --mode <mode>``
    Morfessor can run in different modes, each doing different actions on the
    model. The modes are:

    none
        Do initialize or train a model. Can be used when just loading a model
        for segmenting new data
    init
        Create new model and load input data. Does not train the model
    batch
        Loads an existing model (which is already initialized with training
        data) and run :ref:`batch-training`
    init+batch
        Create a new model, load input data and run :ref:`batch-training`.
        **Default**
    online
        Create a new model, read and train the model concurrently as described
        in :ref:`online-training`
    online+batch
        First read and train the model concurrently as described in
        :ref:`online-training` and after that retrain the model using
        :ref:`batch-training`


``-a <algorithm>, --algorithm <algorithm>``
    Algorithm to use for training:

    recursive
        Recursive as descirbed in :ref:`recursive-training` **Default**
    viterbi
        Viterbi as described in :ref:`viterbi-training`

``-d <type>, --dampening <type>``
    Method for changing the compound counts in the input data. Options:

    none
        Do not alter the counts of compounds (token based training)
    log
        Change the count :math:`x` of a compound to :math:`\log(x)` (log-token
        based training)
    ones
        Treat all compounds as if they only occured once (type based training)

``-f <list>, --forcesplit <list>``
    A list of atoms that would always cause the compound to be split. By
    default only hyphens (``-``) would force a split. Note the notation of the
    argument list. To have no force split characters, use as an empty string as
    argument (``-f ""``). To split, for example, both hyphen (``-``) and
    apostrophe (``'``) use ``-f "-'"``

``-F <float>, --finish-threshold <float>``
    Stopping threshold. Training stops when the decrease in model cost of the
    last iteration is smaller then finish_threshold * #boundaries; (default
    '0.005')

``-r <seed>, --randseed <seed>``
    Seed for random number generator

``-R <float>, --randsplit <float>``
    Initialize new words by random splitting using the
    given split probability (default no splitting). See :ref:`rand-init`

``--skips``
    Use random skips for frequently seen compounds to
    speed up training. See :ref:`rand-init`

``--batch-minfreq <int>``
    Compound frequency threshold for batch training
    (default 1)
``--max-epochs <int>``
    Hard maximum of epochs in training
``--nosplit-re <regexp>``
    If the expression matches the two surrounding
    characters, do not allow splitting (default None)
``--online-epochint <int>``
    Epoch interval for online training (default 10000)
``--viterbi-smoothing <float>``
    Additive smoothing parameter for Viterbi training and
    segmentation (default 0).
``--viterbi-maxlen <int>``
    Maximum construction length in Viterbi training and
    segmentation (default 30)


Saving model
~~~~~~~~~~~~

``-s <file>``
    save  :ref:`binary-model-def`
``-S <file>``
    save  :ref:`morfessor1-model-def`
``--save-reduced``
    save :ref:`binary-reduced-model-def`

Examples
~~~~~~~~
Training a model from inputdata.txt, saving a :ref:`morfessor1-model-def` and
segmenting the test.txt set: ::

    morfessor -t inputdata.txt -S model.segm -T test.txt

morfessor-train
---------------
The morfessor-train command is a convenience command that enables easier
training for morfessor models.

The basic command structure is: ::

    morfessor-train [arguments] traindata-file [traindata-file ...]

The arguments are identical to the ones for the `morfessor`_ command. The most
relevant are:

``-s <file>``
    save binary model
``-S <file>``
    save Morfessor 1.0 style model
``--save-reduced``
    save reduced binary model

Examples
~~~~~~~~
Train a morfessor model from a wordcount list in ISO_8859-15, doing type based
training, writing the log to logfile and saving them model as model.bin: ::

    morfessor-train --encoding=ISO_8859-15 --traindata-list --logfile=log.log -s model.bin -d ones traindata.txt

morfessor-segment
-----------------
The morfessor-segment command is a convenience command that enables easier
segmentation of test data with a morfessor model.

The basic command structure is: ::

    morfessor-segment [arguments] testcorpus-file [testcorpus-file ...]

The arguments are identical to the ones for the `morfessor`_ command. The most
 relevant are:

``-l <file>``
    load binary model (normal or reduced)
``-L <file>``
    load Morfessor 1.0 style model

Examples
~~~~~~~~
Loading a binary model and segmenting the words in testdata.txt: ::

    morfessor-segment -l model.bin testdata.txt

morfessor-evaluate
------------------
The morfessor-evaluate command is used for evaluating a morfessor model against
a gold-standard. If multiple models are evaluated, it reports statistical
significant differences between them.

The basic command structure is: ::

    morfessor-evaluate [arguments] <goldstandard> <model> [<model> ...]


Positional arguments
~~~~~~~~~~~~~~~~~~~~
``<goldstandard>``
    gold standard file in standard annotation format
``<model>``
    model files to segment (either binary or Morfessor 1.0 style segmentation
    models).

Optional arguments
~~~~~~~~~~~~~~~~~~
``-t TEST_SEGMENTATIONS, --testsegmentation TEST_SEGMENTATIONS``
    Segmentation of the test set. Note that all words in the gold-standard must
     be segmented

``--num-samples <int>``
    number of samples to take for testing
``--sample-size <int>``
    size of each testing samples
``--format-string <format>``
    Python new style format string used to report evaluation results. The
    following variables are a value and and action separated with and
    underscore. E.g. fscore_avg for the average f-score. The available
    values are "precision", "recall", "fscore", "samplesize" and the available
    actions: "avg", "max", "min", "values", "count". A last meta-data variable
    (without action) is "name", the filename of the model. See also the
    format-template option for predefined strings.
``--format-template <template>``
    Uses a template string for the format-string options. Available templates
    are: default, table and latex. If format-string is defined this option is
    ignored.

Examples
~~~~~~~~

Evaluating three different models against a golden standard, outputting the
results in latex table format:::

    morfessor-evaluate --format-template=latex goldstd.txt model1.bin model2.segm model3.bin

.. _data-format-options:

Data format command line options
--------------------------------


``--encoding <encoding>``
    Encoding of input and output files (if none is given, both the local
    encoding and UTF-8 are tried).
``--lowercase``
    lowercase input data
``--traindata-list``
    input file(s) for batch training are lists (one compound per line,
    optionally count as a prefix)
``--atom-separator <regexp>``
    atom separator regexp (default None)
``--compound-separator <regexp>``
    compound separator regexp (default '\s+')
``--analysis-separator <str>``
    separator for different analyses in an annotation file. Use NONE for only
    allowing one analysis per line
``--output-format <format>``
    format string for --output file (default: '{analysis}\\n'). Valid keywords
    are: ``{analysis}`` = constructions of the compound, ``{compound}`` =
    compound string, {count} = count of the compound (currently always 1),
    ``{logprob}`` = log-probability of the analysis, and ``{clogprob}`` =
    log-probability of the compound. Valid escape sequences are ``\n`` (newline)
    and ``\t`` (tabular)
``--output-format-separator <str>``
    construction separator for analysis in --output file (default: ' ')
``--output-newlines``
    for each newline in input, print newline in --output file (default: 'False')




Universal command line options
------------------------------
``--verbose <int>  -v``
    verbose level; controls what is written to the standard error stream or log file (default 1)
``--logfile <file>``
    write log messages to file in addition to standard error stream
``--progressbar``
    Force the progressbar to be displayed (possibly lowers the log level for the standard error stream)
``--help``
    -h show this help message and exit
``--version``
    show version number and exit



Morfessor features
==================

All features below are described in a short format, mainly to guide making the
right choice for a certain parameter. These features are explained in detail in
the :ref:`morfessor-tech-report`.


.. _`batch-training`:

Batch training
--------------
In batch training, each epoch consists of an iteration over the full training
data. Epochs are repeated until the model cost is converged. All training data
needed in the training needs to be loaded before the training starts.

.. _`online-training`:

Online training
---------------
In online training the model is updated while the data is being added. This
allows for rapid testing and prototyping. All data is only processed once,
hence it is advisable to run :ref:`batch-training` afterwards. The size of an
epoch is a fixed, predefined number of compounds processed. The only use of an
epoch for online training is to select the best annotations in semi-supervised
training.

.. _`recursive-training`:

Recursive training
------------------
In recursive training, each compound is processed in the following manner. The
current split for the compound is removed from the model and its constructions
are updated accordingly. After this, all possible splits are tried, by choosing
one split and running the algorithm recursively on the created constructions.

In the end, the best split is selected and the training continues with the next
compound.

.. _`viterbi-training`:

Local Viterbi training
----------------------
In Local Viterbi training the compounds are processed sequentially. Each
compound is removed from the corpus and afterwards segmented using Viterbi
segmentation. The result is put back into the model.

In order to allow new constructions to be created, the smoothing parameter
must be given some non-zero value.

.. _`rand-skips`:

Random skips
------------
In Random skips, frequently seen compounds are skipped in training with a
random probability. As shown in the :ref:`morfessor-tech-report` this speeds
up the training considerably with only a minor loss in model performance.

.. _`rand-init`:

Random initialization
---------------------
In random initialization all compounds are split randomly. Each possible
boundary is made a split with the given probability.

Selecting a good random initialization parameter helps in finding local optima
as long as the split probability is high enough.

.. _`corpusweight`:

Corpusweight (alpha) tuning
---------------------------
An important parameter of the Morfessor Baseline model is the corpusweight
(:math:`\alpha`), which balances the cost of the lexicon and the corpus. There
are different options available for tuning this weight:

Fixed weight (``--corpusweight``)
    The weight is set fixed on the beginning of the training and does not change
Development set (``--develset``)
    A development set is used to balance the corpusweight so that the precision
    and recall of segmenting the developmentset will be equal
Morph length (``--morph-length``)
    The corpusweight is tuned so that the average length of morphs in the
    lexicon will be as desired
Num morph types (``--num-morph-types``)
    The corpusweight is tuned so that there will be approximate the number of
    desired morph types in the lexicon


================================================
FILE: docs/source/conf.py
================================================
# -*- coding: utf-8 -*-
#
# Morfessor documentation build configuration file, created by
# sphinx-quickstart on Wed Dec  4 13:41:43 2013.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

import sys
import os

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))

# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.mathjax',
    'sphinxcontrib.napoleon',
]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix of source filenames.
source_suffix = '.rst'

# The encoding of source files.
#source_encoding = 'utf-8-sig'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = u'Morfessor'
copyright = u'2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '2.0'
# The full version, including alpha/beta/rc tags.
release = '2.0.6'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#language = None

# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []

# The reST default role (used for this markup: `text`) to use for all
# documents.
#default_role = None

# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True

# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
#add_module_names = True

# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

# A list of ignored prefixes for module index sorting.
#modindex_common_prefix = []

# If true, keep warnings as "system message" paragraphs in the built documents.
#keep_warnings = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
html_theme = 'default'

# Theme options are theme-specific and customize the look and feel of a theme
# further.  For a list of options available for each theme, see the
# documentation.
#html_theme_options = {}

# Add any paths that contain custom themes here, relative to this directory.
#html_theme_path = []

# The name for this set of Sphinx documents.  If None, it defaults to
# "<project> v<release> documentation".
#html_title = None

# A shorter title for the navigation bar.  Default is the same as html_title.
#html_short_title = None

# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
#html_logo = None

# The name of an image file (within the static path) to use as favicon of the
# docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
#html_favicon = None

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
#html_extra_path = []

# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = '%b %d, %Y'

# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
#html_use_smartypants = True

# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}

# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}

# If false, no module index is generated.
#html_domain_indices = True

# If false, no index is generated.
#html_use_index = True

# If true, the index is split into individual pages for each letter.
#html_split_index = False

# If true, links to the reST sources are added to the pages.
#html_show_sourcelink = True

# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
#html_show_sphinx = True

# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
#html_show_copyright = True

# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it.  The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''

# This is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = None

# Output file base name for HTML help builder.
htmlhelp_basename = 'Morfessordoc'


# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#'papersize': 'letterpaper',

# The font size ('10pt', '11pt' or '12pt').
#'pointsize': '10pt',

# Additional stuff for the LaTeX preamble.
#'preamble': '',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
  ('index', 'Morfessor.tex', u'Morfessor Documentation',
   u'Sami Virpioja and Peter Smit', 'manual'),
]

# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None

# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False

# If true, show page references after internal links.
#latex_show_pagerefs = False

# If true, show URL addresses after external links.
#latex_show_urls = False

# Documents to append as an appendix to all manuals.
#latex_appendices = []

# If false, no module index is generated.
#latex_domain_indices = True


# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
    ('index', 'morfessor', u'Morfessor Documentation',
     [u'Sami Virpioja and Peter Smit'], 1)
]

# If true, show URL addresses after external links.
#man_show_urls = False


# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
  ('index', 'Morfessor', u'Morfessor Documentation',
   u'Sami Virpioja and Peter Smit', 'Morfessor',
   'Tool for unsupervised and semi-supervised morphological segmentation.',
   'Miscellaneous'),
]

# Documents to append as an appendix to all manuals.
#texinfo_appendices = []

# If false, no module index is generated.
#texinfo_domain_indices = True

# How to display URL addresses: 'footnote', 'no', or 'inline'.
#texinfo_show_urls = 'footnote'

# If true, do not generate a @detailmenu in the "Top" node's menu.
#texinfo_no_detailmenu = False


================================================
FILE: docs/source/filetypes.rst
================================================
Morfessor file types
====================

.. _binary-model-def:

Binary model
------------

.. warning::

    Pickled models are sensitive to bitrot. Sometimes incompatibilities exist
    between Python versions that prevent loading a model stored by a different
    version. Also, next versions of Morfessor are not guaranteed to be able to
    load models of older versions.

The standard format for Morfessor 2.0 is a binary model, generated by pickling
the :ref:`BaselineModel <baseline-model-label>` object. This ensures that all
training-data, annotation-data and weights are exactly the same as when the
model was saved.

.. _binary-reduced-model-def:

Reduced Binary model
--------------------
A reduced Morfessor model contains only that information that is necessary for
segmenting new words using (nbest) viterbi segmentation. Reduced binary models
much smaller that the full models, but no model modificating actions can be
performed.

.. _morfessor1-model-def:

Morfessor 1.0 style text model
------------------------------
Morfessor 2.0 also supports the text model files that are used in Morfessor
1.0. These files consists of one segmentation per line, preceded by a count,
where the constructions are separated by ' + '.

Specification: ::

    <int><space><CONSTRUCTION>[<space>+<space><CONSTRUCTION>]*

Example: ::

    10 kahvi + kakku
    5 kahvi + kilo + n
    24 kahvi + kone + emme

Text corpus file
----------------
A text corpus file is a free format text-file. All lines are split into
compounds using the compound-separator (default <space>). The compounds then
are split into atoms using the atom-separator. Compounds can occur multiple
times and will be counted as such.

Example: ::

    kavhikakku kahvikilon kahvikilon
    kahvikoneemme kahvikakku

Word list file
--------------
A word list corpus file contains one compound per line, possibly preceded by a
count. If multiple entries of the same word occur there counts are summed. If
no count is given, a count of one is assumed (per entry).

Specification: ::

    [<int><space>]<COMPOUND>

Example 1: ::

    10 kahvikakku
    5 kahvikilon
    24 kahvikoneemme

Example 2: ::

    kahvikakku
    kahvikilon
    kahvikoneemme

Annotation file
---------------
An annotation file contains one compound and one or more annotations per
compound on each line. The separators between the annotations (default ', ')
and between the constructions (default ' ') are configurable.

Specification: ::

    <compound> <analysis1construction1>[ <analysis1constructionN>][, <analysis2construction1> [<analysis2constructionN>]*]*

Example: ::

    kahvikakku kahvi kakku, kahvi kak ku
    kahvikilon kahvi kilon
    kahvikoneemme kahvi konee mme, kah vi ko nee mme


================================================
FILE: docs/source/general.rst
================================================
General
=======

.. _morfessor-tech-report:

Morfessor 2.0 Technical Report
------------------------------

The work done in Morfessor 2.0 is described in detail in the Morfessor 2.0
Technical Report [TechRep]_. The report is available for download from
http://urn.fi/URN:ISBN:978-952-60-5501-5.


Terminology
-----------

Unlike previous Morfessor implementations, Morfessor 2.0 is, in
principle, applicable to any string segmentation task. Thus we use
terms that are not specific to morphological segmentation task.

The task of the algorithm is to find a set of *constructions* that
describe the provided training corpus efficiently and accurately. The
training corpus contains a collection of *compounds*, which are the
largest sequences that a single construction can hold. The smallest
pieces of constructions and compounds are called *atoms*.

For example, in morphological segmentation, compounds are word forms,
constructions are morphs, and atoms are characters. In chunking,
compounds are sentences, constructions are phrases, and atoms are
words.

Citing
------

The authors do kindly ask that you cite the Morfessor 2.0 techical report
 [TechRep]_ when using this tool in academic publications.

In addition, when you refer to the Morfessor algorithms, you should cite the
respective publications where they have been introduced. For example, the first
Morfessor algorithm was published in [Creutz2002]_ and the semi-supervised
extension in [Kohonen2010]_. See [TechRep]_ for further information on the
relevant publications.

.. [TechRep] Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN 978-952-60-5501-5.

.. [Creutz2002] Mathias Creutz and Krista Lagus. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL-02, pages 21-30, Philadelphia, Pennsylvania, 11 July, 2002. 

.. [Kohonen2010] Oskar Kohonen, Sami Virpioja and Krista Lagus. Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 78-86, Uppsala, Sweden, July 2010. Association for Computational Linguistics.



================================================
FILE: docs/source/index.rst
================================================
.. Morfessor documentation master file, created by
   sphinx-quickstart on Wed Dec  4 13:41:43 2013.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Morfessor 2.0 documentation
=====================================

.. note:: The Morfessor 2.0 documentation is still a work in progress and
  contains some unfinished parts


Contents:

.. toctree::
   :maxdepth: 2

   license
   general
   installation
   filetypes
   cmdtools
   libinterface


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`



================================================
FILE: docs/source/installation.rst
================================================
Installation instructions
=========================

Morfessor 2.0 is installed using setuptools library for Python. Morfessor can
be installed from the packages available on the
`Morpho project homepage`_ and the `Morfessor Github page`_, or can be
directly installed from the `Python Package Index (PyPI)`_.

The Morfessor packages are created using the current Python packaging
standards, as described on http://docs.python.org/install/. Morfessor packages
are fully compatible with, and recommended to run in, virtual environments as
described on http://virtualenv.org.



Installation from tarball or zip file
-------------------------------------

The Morfessor 2.0 tarball and zip files can be downloaded from the
`Morpho project homepage`_ (latest stable version) or from the
`Morfessor Github page`_  (all versions).



The tarball can be installed in two different ways. The first is to unpack the
tarball or zip file and run::

    python setup.py install

A second method is to use the tool pip on the tarball or zip file directly::

    pip install morfessor-VERSION.tar.gz


Installation from PyPI
----------------------

Morfessor 2.0 is also distributed through the `Python Package Index (PyPI)`_.
This means that tools like pip and easy_install can automatically download and
install the latest version of Morfessor.

Simply type::

    pip install morfessor

or::

    easy_install morfessor

To install the morfessor library and tools.


.. _Morpho project homepage: http://morpho.aalto.fi
.. _Morfessor Github page: https://github.com/aalto-speech/morfessor/releases
.. _Python Package Index (PyPI): https://pypi.python.org/pypi/Morfessor


================================================
FILE: docs/source/libinterface.rst
================================================
Python library interface to Morfessor
=====================================

Morfessor 2.0 contains a library interface in order to be integrated in other
python applications. The public members are documented below and should remain
relatively the same between Morfessor versions. Private members are documented
in the code and can change anytime in releases.

The classes are documented below.

IO class
--------
.. automodule:: morfessor.io
   :members:

.. _baseline-model-label:

Model classes
-------------
.. automodule:: morfessor.baseline
   :members:

Evaluation classes
------------------
.. automodule:: morfessor.evaluation
   :members:


Code Examples for using library interface
=========================================

Segmenting new data using an existing model
-------------------------------------------
::

    import morfessor

    io = morfessor.MorfessorIO()

    model = io.read_binary_model_file('model.bin')

    words = ['words', 'segmenting', 'morfessor', 'unsupervised']

    for word in words:
        print(model.viterbi_segment(word))


Testing type vs token models
----------------------------
::

    import morfessor

    io = morfessor.MorfessorIO()

    train_data = list(io.read_corpus_file('training_data'))

    model_types = morfessor.BaselineModel()
    model_logtokens = morfessor.BaselineModel()
    model_tokens = morfessor.BaselineModel()

    model_types.load_data(train_data, count_modifier=lambda x: 1)
    def log_func(x):
        return int(round(math.log(x + 1, 2)))
    model_logtokens.load_data(train_data, count_modifier=log_func)
    model_tokens.load_data(train_data)

    models = [model_types, model_logtokens, model_tokens]

    for model in models:
        model.train_batch()

    goldstd_data = io.read_annotations_file('gold_std')
    ev = morfessor.MorfessorEvaluation(goldstd_data)
    results = [ev.evaluate_model(m) for m in models]

    wsr = morfessor.WilcoxonSignedRank()
    r = wsr.significance_test(results)
    WilcoxonSignedRank.print_table(r)

The equivalent of this on the command line would be: ::

    morfessor-train -s model_types -d ones training_data
    morfessor-train -s model_logtokens -d log training_data
    morfessor-train -s model_tokens training_data

    morfessor-evaluate gold_std morfessor-train morfessor-train morfessor-train


Testing different amounts of supervision data
---------------------------------------------



================================================
FILE: docs/source/license.rst
================================================
License
=======
Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.


================================================
FILE: ez_setup.py
================================================
#!python
"""Bootstrap setuptools installation

If you want to use setuptools in your package's setup.py, just include this
file in the same directory with it, and add this to the top of your setup.py::

    from ez_setup import use_setuptools
    use_setuptools()

If you want to require a specific version of setuptools, set a download
mirror, or use an alternate download directory, you can do so by supplying
the appropriate options to ``use_setuptools()``.

This file can also be run as a script to install or upgrade setuptools.
"""
import os
import shutil
import sys
import tempfile
import tarfile
import optparse
import subprocess

from distutils import log

try:
    from site import USER_SITE
except ImportError:
    USER_SITE = None

DEFAULT_VERSION = "0.9.6"
DEFAULT_URL = "https://pypi.python.org/packages/source/s/setuptools/"

def _python_cmd(*args):
    args = (sys.executable,) + args
    return subprocess.call(args) == 0

def _install(tarball, install_args=()):
    # extracting the tarball
    tmpdir = tempfile.mkdtemp()
    log.warn('Extracting in %s', tmpdir)
    old_wd = os.getcwd()
    try:
        os.chdir(tmpdir)
        tar = tarfile.open(tarball)
        _extractall(tar)
        tar.close()

        # going in the directory
        subdir = os.path.join(tmpdir, os.listdir(tmpdir)[0])
        os.chdir(subdir)
        log.warn('Now working in %s', subdir)

        # installing
        log.warn('Installing Setuptools')
        if not _python_cmd('setup.py', 'install', *install_args):
            log.warn('Something went wrong during the installation.')
            log.warn('See the error message above.')
            # exitcode will be 2
            return 2
    finally:
        os.chdir(old_wd)
        shutil.rmtree(tmpdir)


def _build_egg(egg, tarball, to_dir):
    # extracting the tarball
    tmpdir = tempfile.mkdtemp()
    log.warn('Extracting in %s', tmpdir)
    old_wd = os.getcwd()
    try:
        os.chdir(tmpdir)
        tar = tarfile.open(tarball)
        _extractall(tar)
        tar.close()

        # going in the directory
        subdir = os.path.join(tmpdir, os.listdir(tmpdir)[0])
        os.chdir(subdir)
        log.warn('Now working in %s', subdir)

        # building an egg
        log.warn('Building a Setuptools egg in %s', to_dir)
        _python_cmd('setup.py', '-q', 'bdist_egg', '--dist-dir', to_dir)

    finally:
        os.chdir(old_wd)
        shutil.rmtree(tmpdir)
    # returning the result
    log.warn(egg)
    if not os.path.exists(egg):
        raise IOError('Could not build the egg.')


def _do_download(version, download_base, to_dir, download_delay):
    egg = os.path.join(to_dir, 'setuptools-%s-py%d.%d.egg'
                       % (version, sys.version_info[0], sys.version_info[1]))
    if not os.path.exists(egg):
        tarball = download_setuptools(version, download_base,
                                      to_dir, download_delay)
        _build_egg(egg, tarball, to_dir)
    sys.path.insert(0, egg)
    import setuptools
    setuptools.bootstrap_install_from = egg


def use_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
                   to_dir=os.curdir, download_delay=15):
    # making sure we use the absolute path
    to_dir = os.path.abspath(to_dir)
    was_imported = 'pkg_resources' in sys.modules or \
        'setuptools' in sys.modules
    try:
        import pkg_resources
    except ImportError:
        return _do_download(version, download_base, to_dir, download_delay)
    try:
        pkg_resources.require("setuptools>=" + version)
        return
    except pkg_resources.VersionConflict:
        e = sys.exc_info()[1]
        if was_imported:
            sys.stderr.write(
            "The required version of setuptools (>=%s) is not available,\n"
            "and can't be installed while this script is running. Please\n"
            "install a more recent version first, using\n"
            "'easy_install -U setuptools'."
            "\n\n(Currently using %r)\n" % (version, e.args[0]))
            sys.exit(2)
        else:
            del pkg_resources, sys.modules['pkg_resources']    # reload ok
            return _do_download(version, download_base, to_dir,
                                download_delay)
    except pkg_resources.DistributionNotFound:
        return _do_download(version, download_base, to_dir,
                            download_delay)


def download_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
                        to_dir=os.curdir, delay=15):
    """Download setuptools from a specified location and return its filename

    `version` should be a valid setuptools version number that is available
    as an egg for download under the `download_base` URL (which should end
    with a '/'). `to_dir` is the directory where the egg will be downloaded.
    `delay` is the number of seconds to pause before an actual download
    attempt.
    """
    # making sure we use the absolute path
    to_dir = os.path.abspath(to_dir)
    try:
        from urllib.request import urlopen
    except ImportError:
        from urllib2 import urlopen
    tgz_name = "setuptools-%s.tar.gz" % version
    url = download_base + tgz_name
    saveto = os.path.join(to_dir, tgz_name)
    src = dst = None
    if not os.path.exists(saveto):  # Avoid repeated downloads
        try:
            log.warn("Downloading %s", url)
            src = urlopen(url)
            # Read/write all in one block, so we don't create a corrupt file
            # if the download is interrupted.
            data = src.read()
            dst = open(saveto, "wb")
            dst.write(data)
        finally:
            if src:
                src.close()
            if dst:
                dst.close()
    return os.path.realpath(saveto)


def _extractall(self, path=".", members=None):
    """Extract all members from the archive to the current working
       directory and set owner, modification time and permissions on
       directories afterwards. `path' specifies a different directory
       to extract to. `members' is optional and must be a subset of the
       list returned by getmembers().
    """
    import copy
    import operator
    from tarfile import ExtractError
    directories = []

    if members is None:
        members = self

    for tarinfo in members:
        if tarinfo.isdir():
            # Extract directories with a safe mode.
            directories.append(tarinfo)
            tarinfo = copy.copy(tarinfo)
            tarinfo.mode = 448  # decimal for oct 0700
        self.extract(tarinfo, path)

    # Reverse sort directories.
    if sys.version_info < (2, 4):
        def sorter(dir1, dir2):
            return cmp(dir1.name, dir2.name)
        directories.sort(sorter)
        directories.reverse()
    else:
        directories.sort(key=operator.attrgetter('name'), reverse=True)

    # Set correct owner, mtime and filemode on directories.
    for tarinfo in directories:
        dirpath = os.path.join(path, tarinfo.name)
        try:
            self.chown(tarinfo, dirpath)
            self.utime(tarinfo, dirpath)
            self.chmod(tarinfo, dirpath)
        except ExtractError:
            e = sys.exc_info()[1]
            if self.errorlevel > 1:
                raise
            else:
                self._dbg(1, "tarfile: %s" % e)


def _build_install_args(options):
    """
    Build the arguments to 'python setup.py install' on the setuptools package
    """
    install_args = []
    if options.user_install:
        if sys.version_info < (2, 6):
            log.warn("--user requires Python 2.6 or later")
            raise SystemExit(1)
        install_args.append('--user')
    return install_args

def _parse_args():
    """
    Parse the command line for options
    """
    parser = optparse.OptionParser()
    parser.add_option(
        '--user', dest='user_install', action='store_true', default=False,
        help='install in user site package (requires Python 2.6 or later)')
    parser.add_option(
        '--download-base', dest='download_base', metavar="URL",
        default=DEFAULT_URL,
        help='alternative URL from where to download the setuptools package')
    options, args = parser.parse_args()
    # positional arguments are ignored
    return options

def main(version=DEFAULT_VERSION):
    """Install or upgrade setuptools and EasyInstall"""
    options = _parse_args()
    tarball = download_setuptools(download_base=options.download_base)
    return _install(tarball, _build_install_args(options))

if __name__ == '__main__':
    sys.exit(main())


================================================
FILE: morfessor/__init__.py
================================================
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Morfessor 2.0 - Python implementation of the Morfessor method
"""
import logging


__all__ = ['MorfessorException', 'ArgumentException', 'MorfessorIO',
           'BaselineModel', 'main', 'get_default_argparser', 'main_evaluation',
           'get_evaluation_argparser']

__version__ = '2.0.6'
__author__ = 'Sami Virpioja, Peter Smit, Stig-Arne Grönroos'
__author_email__ = "morpho@aalto.fi"

_logger = logging.getLogger(__name__)


def get_version():
    return __version__

# The public api imports need to be at the end of the file,
# so that the package global names are available to the modules
# when they are imported.

from .baseline import BaselineModel, FixedCorpusWeight, AnnotationCorpusWeight, \
    NumMorphCorpusWeight, MorphLengthCorpusWeight
from .cmd import main, get_default_argparser, main_evaluation, \
    get_evaluation_argparser
from .exception import MorfessorException, ArgumentException
from .io import MorfessorIO
from .utils import _progress
from .evaluation import MorfessorEvaluation, MorfessorEvaluationResult


================================================
FILE: morfessor/baseline.py
================================================
import collections
import heapq
import logging
import math
import numbers
import random
import re

from .utils import _progress, _is_string
from .exception import MorfessorException, SegmentOnlyModelException

_logger = logging.getLogger(__name__)


def _constructions_to_str(constructions):
    """Return a readable string for a list of constructions."""
    if _is_string(constructions[0]):
        # Constructions are strings
        return ' + '.join(constructions)
    else:
        # Constructions are not strings (should be tuples of strings)
        return ' + '.join(map(lambda x: ' '.join(x), constructions))


# rcount = root count (from corpus)
# count = total count of the node
# splitloc = integer or tuple. Location(s) of the possible splits for virtual
#            constructions; empty tuple or 0 if real construction
ConstrNode = collections.namedtuple('ConstrNode',
                                    ['rcount', 'count', 'splitloc'])


class BaselineModel(object):
    """Morfessor Baseline model class.

    Implements training of and segmenting with a Morfessor model. The model
    is complete agnostic to whether it is used with lists of strings (finding
    phrases in sentences) or strings of characters (finding morphs in words).

    """

    penalty = -9999.9

    def __init__(self, forcesplit_list=None, corpusweight=None,
                 use_skips=False, nosplit_re=None):
        """Initialize a new model instance.

        Arguments:
            forcesplit_list: force segmentations on the characters in
                               the given list
            corpusweight: weight for the corpus cost
            use_skips: randomly skip frequently occurring constructions
                         to speed up training
            nosplit_re: regular expression string for preventing splitting
                          in certain contexts

        """

        # In analyses for each construction a ConstrNode is stored. All
        # training data has a rcount (real count) > 0. All real morphemes
        # have no split locations.
        self._analyses = {}

        # Flag to indicate the model is only useful for segmentation
        self._segment_only = False

        # Cost variables
        self._lexicon_coding = LexiconEncoding()
        self._corpus_coding = CorpusEncoding(self._lexicon_coding)
        self._annot_coding = None

        #Set corpus weight updater
        self.set_corpus_weight_updater(corpusweight)

        # Configuration variables
        self._use_skips = use_skips  # Random skips for frequent constructions
        self._supervised = False

        # Counter for random skipping
        self._counter = collections.Counter()
        if forcesplit_list is None:
            self.forcesplit_list = []
        else:
            self.forcesplit_list = forcesplit_list
        if nosplit_re is None:
            self.nosplit_re = None
        else:
            self.nosplit_re = re.compile(nosplit_re, re.UNICODE)

        # Used only for (semi-)supervised learning
        self.annotations = None

    def set_corpus_weight_updater(self, corpus_weight):
        if corpus_weight is None:
            self._corpus_weight_updater = FixedCorpusWeight(1.0)
        elif isinstance(corpus_weight, numbers.Number):
            self._corpus_weight_updater = FixedCorpusWeight(corpus_weight)
        else:
            self._corpus_weight_updater = corpus_weight

        self._corpus_weight_updater.update(self, 0)

    def _check_segment_only(self):
        if self._segment_only:
            raise SegmentOnlyModelException()

    @property
    def tokens(self):
        """Return the number of construction tokens."""
        return self._corpus_coding.tokens

    @property
    def types(self):
        """Return the number of construction types."""
        return self._corpus_coding.types - 1  # do not include boundary

    def _add_compound(self, compound, c):
        """Add compound with count c to data."""
        self._corpus_coding.boundaries += c
        self._modify_construction_count(compound, c)
        oldrc = self._analyses[compound].rcount
        self._analyses[compound] = \
            self._analyses[compound]._replace(rcount=oldrc + c)

    def _remove(self, construction):
        """Remove construction from model."""
        rcount, count, splitloc = self._analyses[construction]
        self._modify_construction_count(construction, -count)
        return rcount, count

    def _random_split(self, compound, threshold):
        """Return a random split for compound.

        Arguments:
            compound: compound to split
            threshold: probability of splitting at each position

        """
        splitloc = tuple(i for i in range(1, len(compound))
                         if random.random() < threshold)
        return self._splitloc_to_segmentation(compound, splitloc)

    def _set_compound_analysis(self, compound, parts, ptype='rbranch'):
        """Set analysis of compound to according to given segmentation.

        Arguments:
            compound: compound to split
            parts: desired constructions of the compound
            ptype: type of the parse tree to use

        If ptype is 'rbranch', the analysis is stored internally as a
        right-branching tree. If ptype is 'flat', the analysis is stored
        directly to the compound's node.

        """
        if len(parts) == 1:
            rcount, count = self._remove(compound)
            self._analyses[compound] = ConstrNode(rcount, 0, tuple())
            self._modify_construction_count(compound, count)
        elif ptype == 'flat':
            rcount, count = self._remove(compound)
            splitloc = self.segmentation_to_splitloc(parts)
            self._analyses[compound] = ConstrNode(rcount, count, splitloc)
            for constr in parts:
                self._modify_construction_count(constr, count)
        elif ptype == 'rbranch':
            construction = compound
            for p in range(len(parts)):
                rcount, count = self._remove(construction)
                prefix = parts[p]
                if p == len(parts) - 1:
                    self._analyses[construction] = ConstrNode(rcount, 0,
                                                              0)
                    self._modify_construction_count(construction, count)
                else:
                    suffix = self._join_constructions(parts[p + 1:])
                    self._analyses[construction] = ConstrNode(rcount, count,
                                                              len(prefix))
                    self._modify_construction_count(prefix, count)
                    self._modify_construction_count(suffix, count)
                    construction = suffix
        else:
            raise MorfessorException("Unknown parse type '%s'" % ptype)

    def _update_annotation_choices(self):
        """Update the selection of alternative analyses in annotations.

        For semi-supervised models, select the most likely alternative
        analyses included in the annotations of the compounds.

        """
        if not self._supervised:
            return

        # Collect constructions from the most probable segmentations
        # and add missing compounds also to the unannotated data
        constructions = collections.Counter()
        for compound, alternatives in self.annotations.items():
            if compound not in self._analyses:
                self._add_compound(compound, 1)

            analysis, cost = self._best_analysis(alternatives)
            for m in analysis:
                constructions[m] += self._analyses[compound].rcount

        # Apply the selected constructions in annotated corpus coding
        self._annot_coding.set_constructions(constructions)
        for m, f in constructions.items():
            count = 0
            if m in self._analyses and not self._analyses[m].splitloc:
                count = self._analyses[m].count
            self._annot_coding.set_count(m, count)

    def _best_analysis(self, choices):
        """Select the best analysis out of the given choices."""
        bestcost = None
        bestanalysis = None
        for analysis in choices:
            cost = 0.0
            for m in analysis:
                if m in self._analyses and not self._analyses[m].splitloc:
                    cost += (math.log(self._corpus_coding.tokens) -
                             math.log(self._analyses[m].count))
                else:
                    cost -= self.penalty  # penalty is negative
            if bestcost is None or cost < bestcost:
                bestcost = cost
                bestanalysis = analysis
        return bestanalysis, bestcost

    def _force_split(self, compound):
        """Return forced split of the compound."""
        if len(self.forcesplit_list) == 0:
            return [compound]
        clen = len(compound)
        j = 0
        parts = []
        for i in range(0, clen):
            if compound[i] in self.forcesplit_list:
                if len(compound[j:i]) > 0:
                    parts.append(compound[j:i])
                parts.append(compound[i:i + 1])
                j = i + 1
        if j < clen:
            parts.append(compound[j:])
        return [p for p in parts if len(p) > 0]

    def _test_skip(self, construction):
        """Return true if construction should be skipped."""
        if construction in self._counter:
            t = self._counter[construction]
            if random.random() > 1.0 / max(1, t):
                return True
        self._counter[construction] += 1
        return False

    def _viterbi_optimize(self, compound, addcount=0, maxlen=30):
        """Optimize segmentation of the compound using the Viterbi algorithm.

        Arguments:
          compound: compound to optimize
          addcount: constant for additive smoothing of Viterbi probs
          maxlen: maximum length for a construction

        Returns list of segments.

        """
        clen = len(compound)
        if clen == 1:  # Single atom
            return [compound]
        if self._use_skips and self._test_skip(compound):
            return self.segment(compound)
        # Collect forced subsegments
        parts = self._force_split(compound)
        # Use Viterbi algorithm to optimize the subsegments
        constructions = []
        for part in parts:
            constructions += self.viterbi_segment(part, addcount=addcount,
                                                  maxlen=maxlen)[0]
        self._set_compound_analysis(compound, constructions, ptype='flat')
        return constructions

    def _recursive_optimize(self, compound):
        """Optimize segmentation of the compound using recursive splitting.

        Returns list of segments.

        """
        if len(compound) == 1:  # Single atom
            return [compound]
        if self._use_skips and self._test_skip(compound):
            return self.segment(compound)
        # Collect forced subsegments
        parts = self._force_split(compound)
        if len(parts) == 1:
            # just one part
            return self._recursive_split(compound)
        self._set_compound_analysis(compound, parts)
        # Use recursive algorithm to optimize the subsegments
        constructions = []
        for part in parts:
            constructions += self._recursive_split(part)
        return constructions

    def _recursive_split(self, construction):
        """Optimize segmentation of the construction by recursive splitting.

        Returns list of segments.

        """
        if len(construction) == 1:  # Single atom
            return [construction]
        if self._use_skips and self._test_skip(construction):
            return self.segment(construction)
        rcount, count = self._remove(construction)

        # Check all binary splits and no split
        self._modify_construction_count(construction, count)
        mincost = self.get_cost()
        self._modify_construction_count(construction, -count)
        splitloc = 0
        for i in range(1, len(construction)):
            if (self.nosplit_re and
                    self.nosplit_re.match(construction[(i - 1):(i + 1)])):
                continue
            prefix = construction[:i]
            suffix = construction[i:]
            self._modify_construction_count(prefix, count)
            self._modify_construction_count(suffix, count)
            cost = self.get_cost()
            self._modify_construction_count(prefix, -count)
            self._modify_construction_count(suffix, -count)
            if cost <= mincost:
                mincost = cost
                splitloc = i

        if splitloc:
            # Virtual construction
            self._analyses[construction] = ConstrNode(rcount, count,
                                                      splitloc)
            prefix = construction[:splitloc]
            suffix = construction[splitloc:]
            self._modify_construction_count(prefix, count)
            self._modify_construction_count(suffix, count)
            lp = self._recursive_split(prefix)
            if suffix != prefix:
                return lp + self._recursive_split(suffix)
            else:
                return lp + lp
        else:
            # Real construction
            self._analyses[construction] = ConstrNode(rcount, 0, tuple())
            self._modify_construction_count(construction, count)
            return [construction]

    def _modify_construction_count(self, construction, dcount):
        """Modify the count of construction by dcount.

        For virtual constructions, recurses to child nodes in the
        tree. For real constructions, adds/removes construction
        to/from the lexicon whenever necessary.

        """
        if construction in self._analyses:
            rcount, count, splitloc = self._analyses[construction]
        else:
            rcount, count, splitloc = 0, 0, 0
        newcount = count + dcount
        if newcount == 0:
            del self._analyses[construction]
        else:
            self._analyses[construction] = ConstrNode(rcount, newcount,
                                                      splitloc)
        if splitloc:
            # Virtual construction
            children = self._splitloc_to_segmentation(construction, splitloc)
            for child in children:
                self._modify_construction_count(child, dcount)
        else:
            # Real construction
            self._corpus_coding.update_count(construction, count, newcount)
            if self._supervised:
                self._annot_coding.update_count(construction, count, newcount)

            if count == 0 and newcount > 0:
                self._lexicon_coding.add(construction)
            elif count > 0 and newcount == 0:
                self._lexicon_coding.remove(construction)

    def _epoch_update(self, epoch_num):
        """Do model updates that are necessary between training epochs.

        The argument is the number of training epochs finished.

        In practice, this does two things:
        - If random skipping is in use, reset construction counters.
        - If semi-supervised learning is in use and there are alternative
          analyses in the annotated data, select the annotations that are
          most likely given the model parameters. If not hand-set, update
          the weight of the annotated corpus.

        This method should also be run prior to training (with the
        epoch number argument as 0).

        """
        forced_epochs = 0
        if self._corpus_weight_updater.update(self, epoch_num):
            forced_epochs += 2

        if self._use_skips:
            self._counter = collections.Counter()
        if self._supervised:
            self._update_annotation_choices()
            self._annot_coding.update_weight()

        return forced_epochs

    @staticmethod
    def segmentation_to_splitloc(constructions):
        """Return a list of split locations for a segmented compound."""
        splitloc = []
        i = 0
        for c in constructions:
            i += len(c)
            splitloc.append(i)
        return tuple(splitloc[:-1])

    @staticmethod
    def _splitloc_to_segmentation(compound, splitloc):
        """Return segmentation corresponding to the list of split locations."""
        if isinstance(splitloc, numbers.Number):
            return [compound[:splitloc], compound[splitloc:]]
        parts = []
        startpos = 0
        endpos = 0
        for i in range(len(splitloc)):
            endpos = splitloc[i]
            parts.append(compound[startpos:endpos])
            startpos = endpos
        parts.append(compound[endpos:])
        return parts

    @staticmethod
    def _join_constructions(constructions):
        """Append the constructions after each other by addition. Works for
        both lists and strings """
        result = type(constructions[0])()
        for c in constructions:
            result += c
        return result

    def get_compounds(self):
        """Return the compound types stored by the model."""
        self._check_segment_only()
        return [w for w, node in self._analyses.items()
                if node.rcount > 0]

    def get_constructions(self):
        """Return a list of the present constructions and their counts."""
        return sorted((c, node.count) for c, node in self._analyses.items()
                      if not node.splitloc)

    def get_cost(self):
        """Return current model encoding cost."""
        cost = self._corpus_coding.get_cost() + self._lexicon_coding.get_cost()
        if self._supervised:
            return cost + self._annot_coding.get_cost()
        else:
            return cost

    def get_segmentations(self):
        """Retrieve segmentations for all compounds encoded by the model."""
        self._check_segment_only()
        for w in sorted(self._analyses.keys()):
            c = self._analyses[w].rcount
            if c > 0:
                yield c, w, self.segment(w)

    def load_data(self, data, freqthreshold=1, count_modifier=None,
                  init_rand_split=None):
        """Load data to initialize the model for batch training.

        Arguments:
            data: iterator of (count, compound_atoms) tuples
            freqthreshold: discard compounds that occur less than
                             given times in the corpus (default 1)
            count_modifier: function for adjusting the counts of each
                              compound
            init_rand_split: If given, random split the word with
                               init_rand_split as the probability for each
                               split

        Adds the compounds in the corpus to the model lexicon. Returns
        the total cost.

        """
        self._check_segment_only()
        totalcount = collections.Counter()
        for count, atoms in data:
            if len(atoms) > 0:
                totalcount[atoms] += count

        for atoms, count in totalcount.items():
            if count < freqthreshold:
                continue
            if count_modifier is not None:
                self._add_compound(atoms, count_modifier(count))
            else:
                self._add_compound(atoms, count)

            if init_rand_split is not None and init_rand_split > 0:
                parts = self._random_split(atoms, init_rand_split)
                self._set_compound_analysis(atoms, parts)

        return self.get_cost()

    def load_segmentations(self, segmentations):
        """Load model from existing segmentations.

        The argument should be an iterator providing a count, a
        compound, and its segmentation.

        """
        self._check_segment_only()
        for count, compound, segmentation in segmentations:
            self._add_compound(compound, count)
            self._set_compound_analysis(compound, segmentation)

    def set_annotations(self, annotations, annotatedcorpusweight=None):
        """Prepare model for semi-supervised learning with given
         annotations.

         """
        self._check_segment_only()
        self._supervised = True
        self.annotations = annotations
        self._annot_coding = AnnotatedCorpusEncoding(self._corpus_coding,
                                                     weight=
                                                     annotatedcorpusweight)
        self._annot_coding.boundaries = len(self.annotations)

    def segment(self, compound):
        """Segment the compound by looking it up in the model analyses.

        Raises KeyError if compound is not present in the training
        data. For segmenting new words, use viterbi_segment(compound).

        """
        self._check_segment_only()
        rcount, count, splitloc = self._analyses[compound]
        constructions = []
        if splitloc:
            for child in self._splitloc_to_segmentation(compound,
                                                        splitloc):
                constructions += self.segment(child)
        else:
            constructions.append(compound)
        return constructions

    def train_batch(self, algorithm='recursive', algorithm_params=(),
                    finish_threshold=0.005, max_epochs=None):
        """Train the model in batch fashion.

        The model is trained with the data already loaded into the model (by
        using an existing model or calling one of the load_ methods).

        In each iteration (epoch) all compounds in the training data are
        optimized once, in a random order. If applicable, corpus weight,
        annotation cost, and random split counters are recalculated after
        each iteration.

        Arguments:
            algorithm: string in ('recursive', 'viterbi') that indicates
                         the splitting algorithm used.
            algorithm_params: parameters passed to the splitting algorithm.
            finish_threshold: the stopping threshold. Training stops when
                                the improvement of the last iteration is
                                smaller then finish_threshold * #boundaries
            max_epochs: maximum number of epochs to train

        """
        epochs = 0
        forced_epochs = max(1, self._epoch_update(epochs))
        newcost = self.get_cost()
        compounds = list(self.get_compounds())
        _logger.info("Compounds in training data: %s types / %s tokens",
                     len(compounds), self._corpus_coding.boundaries)

        _logger.info("Starting batch training")
        _logger.info("Epochs: %s\tCost: %s", epochs, newcost)

        while True:
            # One epoch
            random.shuffle(compounds)

            for w in _progress(compounds):
                if algorithm == 'recursive':
                    segments = self._recursive_optimize(w, *algorithm_params)
                elif algorithm == 'viterbi':
                    segments = self._viterbi_optimize(w, *algorithm_params)
                else:
                    raise MorfessorException("unknown algorithm '%s'" %
                                             algorithm)
                _logger.debug("#%s -> %s", w, _constructions_to_str(segments))
            epochs += 1

            _logger.debug("Cost before epoch update: %s", self.get_cost())
            forced_epochs = max(forced_epochs, self._epoch_update(epochs))
            oldcost = newcost
            newcost = self.get_cost()

            _logger.info("Epochs: %s\tCost: %s", epochs, newcost)
            if (forced_epochs == 0 and
                    newcost >= oldcost - finish_threshold *
                    self._corpus_coding.boundaries):
                break
            if forced_epochs > 0:
                forced_epochs -= 1
            if max_epochs is not None and epochs >= max_epochs:
                _logger.info("Max number of epochs reached, stop training")
                break
        _logger.info("Done.")
        return epochs, newcost

    def train_online(self, data, count_modifier=None, epoch_interval=10000,
                     algorithm='recursive', algorithm_params=(),
                     init_rand_split=None, max_epochs=None):
        """Train the model in online fashion.

        The model is trained with the data provided in the data argument.
        As example the data could come from a generator linked to standard in
        for live monitoring of the splitting.

        All compounds from data are only optimized once. After online
        training, batch training could be used for further optimization.

        Epochs are defined as a fixed number of compounds. After each epoch (
        like in batch training), the annotation cost, and random split counters
        are recalculated if applicable.

        Arguments:
            data: iterator of (_, compound_atoms) tuples. The first
                    argument is ignored, as every occurence of the
                    compound is taken with count 1
            count_modifier: function for adjusting the counts of each
                              compound
            epoch_interval: number of compounds to process before starting
                              a new epoch
            algorithm: string in ('recursive', 'viterbi') that indicates
                         the splitting algorithm used.
            algorithm_params: parameters passed to the splitting algorithm.
            init_rand_split: probability for random splitting a compound to
                               at any point for initializing the model. None
                               or 0 means no random splitting.
            max_epochs: maximum number of epochs to train

        """
        self._check_segment_only()
        if count_modifier is not None:
            counts = {}

        _logger.info("Starting online training")

        epochs = 0
        i = 0
        more_tokens = True
        while more_tokens:
            self._epoch_update(epochs)
            newcost = self.get_cost()
            _logger.info("Tokens processed: %s\tCost: %s", i, newcost)

            for _ in _progress(range(epoch_interval)):
                try:
                    _, w = next(data)
                except StopIteration:
                    more_tokens = False
                    break

                if len(w) == 0:
                    # Newline in corpus
                    continue

                if count_modifier is not None:
                    if w not in counts:
                        c = 0
                        counts[w] = 1
                        addc = 1
                    else:
                        c = counts[w]
                        counts[w] = c + 1
                        addc = count_modifier(c + 1) - count_modifier(c)
                    if addc > 0:
                        self._add_compound(w, addc)
                else:
                    self._add_compound(w, 1)
                if init_rand_split is not None and init_rand_split > 0:
                    parts = self._random_split(w, init_rand_split)
                    self._set_compound_analysis(w, parts)
                if algorithm == 'recursive':
                    segments = self._recursive_optimize(w, *algorithm_params)
                elif algorithm == 'viterbi':
                    segments = self._viterbi_optimize(w, *algorithm_params)
                else:
                    raise MorfessorException("unknown algorithm '%s'" %
                                             algorithm)
                _logger.debug("#%s: %s -> %s", i, w, _constructions_to_str(segments))
                i += 1

            epochs += 1
            if max_epochs is not None and epochs >= max_epochs:
                _logger.info("Max number of epochs reached, stop training")
                break

        self._epoch_update(epochs)
        newcost = self.get_cost()
        _logger.info("Tokens processed: %s\tCost: %s", i, newcost)
        return epochs, newcost

    def viterbi_segment(self, compound, addcount=1.0, maxlen=30):
        """Find optimal segmentation using the Viterbi algorithm.

        Arguments:
          compound: compound to be segmented
          addcount: constant for additive smoothing (0 = no smoothing)
          maxlen: maximum length for the constructions

        If additive smoothing is applied, new complex construction types can
        be selected during the search. Without smoothing, only new
        single-atom constructions can be selected.

        Returns the most probable segmentation and its log-probability.

        """
        clen = len(compound)
        grid = [(0.0, None)]
        if self._corpus_coding.tokens + self._corpus_coding.boundaries + \
                addcount > 0:
            logtokens = math.log(self._corpus_coding.tokens +
                                 self._corpus_coding.boundaries + addcount)
        else:
            logtokens = 0
        badlikelihood = clen * logtokens + 1.0
        # Viterbi main loop
        for t in range(1, clen + 1):
            # Select the best path to current node.
            # Note that we can come from any node in history.
            bestpath = None
            bestcost = None
            if self.nosplit_re and t < clen and \
                    self.nosplit_re.match(compound[(t-1):(t+1)]):
                grid.append((clen*badlikelihood, t-1))
                continue
            for pt in range(max(0, t - maxlen), t):
                if grid[pt][0] is None:
                    continue
                cost = grid[pt][0]
                construction = compound[pt:t]
                if (construction in self._analyses and
                        not self._analyses[construction].splitloc):
                    if self._analyses[construction].count <= 0:
                        raise MorfessorException(
                            "Construction count of '%s' is %s" %
                            (construction,
                             self._analyses[construction].count))
                    cost += (logtokens -
                             math.log(self._analyses[construction].count +
                                      addcount))
                elif addcount > 0:
                    if self._corpus_coding.tokens == 0:
                        cost += (addcount * math.log(addcount) +
                                 self._lexicon_coding.get_codelength(
                                     construction)
                                 / self._corpus_coding.weight)
                    else:
                        cost += (logtokens - math.log(addcount) +
                                 (((self._lexicon_coding.boundaries +
                                    addcount) *
                                   math.log(self._lexicon_coding.boundaries
                                            + addcount))
                                  - (self._lexicon_coding.boundaries
                                     * math.log(self._lexicon_coding.boundaries))
                                  + self._lexicon_coding.get_codelength(
                                      construction))
                                 / self._corpus_coding.weight)
                elif len(construction) == 1:
                    cost += badlikelihood
                elif self.nosplit_re:
                    # Some splits are forbidden, so longer unknown
                    # constructions have to be allowed
                    cost += len(construction) * badlikelihood
                else:
                    continue
                if bestcost is None or cost < bestcost:
                    bestcost = cost
                    bestpath = pt
            grid.append((bestcost, bestpath))
        constructions = []
        cost, path = grid[-1]
        lt = clen + 1
        while path is not None:
            t = path
            constructions.append(compound[t:lt])
            path = grid[t][1]
            lt = t
        constructions.reverse()
        # Add boundary cost
        cost += (math.log(self._corpus_coding.tokens +
                          self._corpus_coding.boundaries) -
                 math.log(self._corpus_coding.boundaries))
        return constructions, cost

    def forward_logprob(self, compound):
        """Find log-probability of a compound using the forward algorithm.

        Arguments:
          compound: compound to process

        Returns the (negative) log-probability of the compound. If the
        probability is zero, returns a number that is larger than the
        value defined by the penalty attribute of the model object.

        """
        clen = len(compound)
        grid = [0.0]
        if self._corpus_coding.tokens + self._corpus_coding.boundaries > 0:
            logtokens = math.log(self._corpus_coding.tokens +
                                 self._corpus_coding.boundaries)
        else:
            logtokens = 0
        # Forward main loop
        for t in range(1, clen + 1):
            # Sum probabilities from all paths to the current node.
            # Note that we can come from any node in history.
            psum = 0.0
            for pt in range(0, t):
                cost = grid[pt]
                construction = compound[pt:t]
                if (construction in self._analyses and
                        not self._analyses[construction].splitloc):
                    if self._analyses[construction].count <= 0:
                        raise MorfessorException(
                            "Construction count of '%s' is %s" %
                            (construction,
                             self._analyses[construction].count))
                    cost += (logtokens -
                             math.log(self._analyses[construction].count))
                else:
                    continue
                psum += math.exp(-cost)
            if psum > 0:
                grid.append(-math.log(psum))
            else:
                grid.append(-self.penalty)
        cost = grid[-1]
        # Add boundary cost
        cost += (math.log(self._corpus_coding.tokens +
                          self._corpus_coding.boundaries) -
                 math.log(self._corpus_coding.boundaries))
        return cost

    def viterbi_nbest(self, compound, n, addcount=1.0, maxlen=30):
        """Find top-n optimal segmentations using the Viterbi algorithm.

        Arguments:
          compound: compound to be segmented
          n: how many segmentations to return
          addcount: constant for additive smoothing (0 = no smoothing)
          maxlen: maximum length for the constructions

        If additive smoothing is applied, new complex construction types can
        be selected during the search. Without smoothing, only new
        single-atom constructions can be selected.

        Returns the n most probable segmentations and their
        log-probabilities.

        """
        clen = len(compound)
        grid = [[(0.0, None, None)]]
        if self._corpus_coding.tokens + self._corpus_coding.boundaries + \
                addcount > 0:
            logtokens = math.log(self._corpus_coding.tokens +
                                 self._corpus_coding.boundaries + addcount)
        else:
            logtokens = 0
        badlikelihood = clen * logtokens + 1.0
        # Viterbi main loop
        for t in range(1, clen + 1):
            # Select the best path to current node.
            # Note that we can come from any node in history.
            bestn = []
            if self.nosplit_re and t < clen and \
                    self.nosplit_re.match(compound[(t-1):(t+1)]):
                grid.append([(-clen*badlikelihood, t-1, -1)])
                continue
            for pt in range(max(0, t - maxlen), t):
                for k in range(len(grid[pt])):
                    if grid[pt][k][0] is None:
                        continue
                    cost = grid[pt][k][0]
                    construction = compound[pt:t]
                    if (construction in self._analyses and
                            not self._analyses[construction].splitloc):
                        if self._analyses[construction].count <= 0:
                            raise MorfessorException(
                                "Construction count of '%s' is %s" %
                                (construction,
                                 self._analyses[construction].count))
                        cost -= (logtokens -
                                 math.log(self._analyses[construction].count +
                                          addcount))
                    elif addcount > 0:
                        if self._corpus_coding.tokens == 0:
                            cost -= (addcount * math.log(addcount) +
                                     self._lexicon_coding.get_codelength(
                                         construction)
                                     / self._corpus_coding.weight)
                        else:
                            cost -= (logtokens - math.log(addcount) +
                                     (((self._lexicon_coding.boundaries +
                                        addcount) *
                                       math.log(self._lexicon_coding.boundaries
                                                + addcount))
                                      - (self._lexicon_coding.boundaries
                                         * math.log(self._lexicon_coding.
                                                    boundaries))
                                      + self._lexicon_coding.get_codelength(
                                          construction))
                                     / self._corpus_coding.weight)
                    elif len(construction) == 1:
                        cost -= badlikelihood
                    elif self.nosplit_re:
                        # Some splits are forbidden, so longer unknown
                        # constructions have to be allowed
                        cost -= len(construction) * badlikelihood
                    else:
                        continue
                    if len(bestn) < n:
                        heapq.heappush(bestn, (cost, pt, k))
                    else:
                        heapq.heappushpop(bestn, (cost, pt, k))
            grid.append(bestn)
        results = []
        for k in range(len(grid[-1])):
            constructions = []
            cost, path, ki = grid[-1][k]
            lt = clen + 1
            while path is not None:
                t = path
                constructions.append(compound[t:lt])
                path = grid[t][ki][1]
                ki = grid[t][ki][2]
                lt = t
            constructions.reverse()
            # Add boundary cost
            cost -= (math.log(self._corpus_coding.tokens +
                              self._corpus_coding.boundaries) -
                     math.log(self._corpus_coding.boundaries))
            results.append((-cost, constructions))
        return [(constr, cost) for cost, constr in sorted(results)]

    def get_corpus_coding_weight(self):
        return self._corpus_coding.weight

    def set_corpus_coding_weight(self, weight):
        self._check_segment_only()
        self._corpus_coding.weight = weight

    def make_segment_only(self):
        """Reduce the size of this model by removing all non-morphs from the
        analyses. After calling this method it is not possible anymore to call
        any other method that would change the state of the model. Anyway
        doing so would throw an exception.

        """
        self._segment_only = True
        self._analyses = {k: v for (k, v) in self._analyses.items()
                          if not v.splitloc}

    def clear_segmentation(self):
        for compound in list(self.get_compounds()):
            self._set_compound_analysis(compound, [compound])


class CorpusWeight(object):
    @classmethod
    def move_direction(cls, model, direction, epoch):
        if direction != 0:
            weight = model.get_corpus_coding_weight()
            if direction > 0:
                weight *= 1 + 2.0 / epoch
            else:
                weight *= 1.0 / (1 + 2.0 / epoch)
            model.set_corpus_coding_weight(weight)
            _logger.info("Corpus weight set to %s", weight)
            return True
        return False


class FixedCorpusWeight(CorpusWeight):
    def __init__(self, weight):
        self.weight = weight

    def update(self, model, _):
        model.set_corpus_coding_weight(self.weight)
        return False


class AnnotationCorpusWeight(CorpusWeight):
    """Class for using development annotations to update the corpus weight
    during batch training

    """

    def __init__(self, devel_set, threshold=0.01):
        self.data = devel_set
        self.threshold = threshold

    def update(self, model, epoch):
        """Tune model corpus weight based on the precision and
        recall of the development data, trying to keep them equal"""
        if epoch < 1:
            return False
        tmp = self.data.items()
        wlist, annotations = zip(*tmp)
        segments = [model.viterbi_segment(w)[0] for w in wlist]
        d = self._estimate_segmentation_dir(segments, annotations)

        return self.move_direction(model, d, epoch)

    @classmethod
    def _boundary_recall(cls, prediction, reference):
        """Calculate average boundary recall for given segmentations."""
        rec_total = 0
        rec_sum = 0.0
        for pre_list, ref_list in zip(prediction, reference):
            best = -1
            for ref in ref_list:
                # list of internal boundary positions
                ref_b = set(BaselineModel.segmentation_to_splitloc(ref))
                if len(ref_b) == 0:
                    best = 1.0
                    break
                for pre in pre_list:
                    pre_b = set(BaselineModel.segmentation_to_splitloc(pre))
                    r = len(ref_b.intersection(pre_b)) / float(len(ref_b))
                    if r > best:
                        best = r
            if best >= 0:
                rec_sum += best
                rec_total += 1
        return rec_sum, rec_total

    @classmethod
    def _bpr_evaluation(cls, prediction, reference):
        """Return boundary precision, recall, and F-score for segmentations."""
        rec_s, rec_t = cls._boundary_recall(prediction, reference)
        pre_s, pre_t = cls._boundary_recall(reference, prediction)
        rec = rec_s / rec_t
        pre = pre_s / pre_t
        f = 2.0 * pre * rec / (pre + rec)
        return pre, rec, f

    def _estimate_segmentation_dir(self, segments, annotations):
        """Estimate if the given compounds are under- or oversegmented.

        The decision is based on the difference between boundary precision
        and recall values for the given sample of segmented data.

        Arguments:
          segments: list of predicted segmentations
          annotations: list of reference segmentations

        Return 1 in the case of oversegmentation, -1 in the case of
        undersegmentation, and 0 if no changes are required.

        """
        pre, rec, f = self._bpr_evaluation([[x] for x in segments], annotations)
        _logger.info("Boundary evaluation: precision %.4f; recall %.4f", pre, rec)
        if abs(pre - rec) < self.threshold:
            return 0
        elif rec > pre:
            return 1
        else:
            return -1


class MorphLengthCorpusWeight(CorpusWeight):
    def __init__(self, morph_lenght, threshold=0.01):
        self.morph_length = morph_lenght
        self.threshold = threshold

    def update(self, model, epoch):
        if epoch < 1:
            return False
        cur_length = self.calc_morph_length(model)

        _logger.info("Current morph-length: %s", cur_length)

        if (abs(self.morph_length - cur_length) / self.morph_length >
                self.threshold):
            d = abs(self.morph_length - cur_length) / (self.morph_length
                                                       - cur_length)
            return self.move_direction(model, d, epoch)
        return False

    @classmethod
    def calc_morph_length(cls, model):
        total_constructions = 0
        total_atoms = 0
        for compound in model.get_compounds():
            constructions = model.segment(compound)
            for construction in constructions:
                total_constructions += 1
                total_atoms += len(construction)
        if total_constructions > 0:
            return float(total_atoms) / total_constructions
        else:
            return 0.0


class NumMorphCorpusWeight(CorpusWeight):
    def __init__(self, num_morph_types, threshold=0.01):
        self.num_morph_types = num_morph_types
        self.threshold = threshold

    def update(self, model, epoch):
        if epoch < 1:
            return False
        cur_morph_types = model._lexicon_coding.boundaries

        _logger.info("Number of morph types: %s", cur_morph_types)


        if (abs(self.num_morph_types - cur_morph_types) / self.num_morph_types
                > self.threshold):
            d = (abs(self.num_morph_types - cur_morph_types) /
                 (self.num_morph_types - cur_morph_types))
            return self.move_direction(model, d, epoch)
        return False

class Encoding(object):
    """Base class for calculating the entropy (encoding length) of a corpus
    or lexicon.

    Commonly subclassed to redefine specific methods.

    """
    def __init__(self, weight=1.0):
        """Initizalize class

        Arguments:
            weight: weight used for this encoding
        """
        self.logtokensum = 0.0
        self.tokens = 0
        self.boundaries = 0
        self.weight = weight

    # constant used for speeding up logfactorial calculations with Stirling's
    # approximation
    _log2pi = math.log(2 * math.pi)

    @property
    def types(self):
        """Define number of types as 0. types is made a property method to
        ensure easy redefinition in subclasses

        """
        return 0

    @classmethod
    def _logfactorial(cls, n):
        """Calculate logarithm of n!.

        For large n (n > 20), use Stirling's approximation.

        """
        if n < 2:
            return 0.0
        if n < 20:
            return math.log(math.factorial(n))
        logn = math.log(n)
        return n * logn - n + 0.5 * (logn + cls._log2pi)

    def frequency_distribution_cost(self):
        """Calculate -log[(u - 1)! (v - u)! / (v - 1)!]

        v is the number of tokens+boundaries and u the number of types

        """
        if self.types < 2:
            return 0.0
        tokens = self.tokens + self.boundaries
        return (self._logfactorial(tokens - 1) -
                self._logfactorial(self.types - 1) -
                self._logfactorial(tokens - self.types))

    def permutations_cost(self):
        """The permutations cost for the encoding."""
        return -self._logfactorial(self.boundaries)

    def update_count(self, construction, old_count, new_count):
        """Update the counts in the encoding."""
        self.tokens += new_count - old_count
        if old_count > 1:
            self.logtokensum -= old_count * math.log(old_count)
        if new_count > 1:
            self.logtokensum += new_count * math.log(new_count)

    def get_cost(self):
        """Calculate the cost for encoding the corpus/lexicon"""
        if self.boundaries == 0:
            return 0.0

        n = self.tokens + self.boundaries
        return ((n * math.log(n)
                 - self.boundaries * math.log(self.boundaries)
                 - self.logtokensum
                 + self.permutations_cost()) * self.weight
                + self.frequency_distribution_cost())


class CorpusEncoding(Encoding):
    """Encoding the corpus class

    The basic difference to a normal encoding is that the number of types is
    not stored directly but fetched from the lexicon encoding. Also does the
    cost function not contain any permutation cost.
    """
    def __init__(self, lexicon_encoding, weight=1.0):
        super(CorpusEncoding, self).__init__(weight)
        self.lexicon_encoding = lexicon_encoding

    @property
    def types(self):
        """Return the number of types of the corpus, which is the same as the
         number of boundaries in the lexicon + 1

        """
        return self.lexicon_encoding.boundaries + 1

    def frequency_distribution_cost(self):
        """Calculate -log[(M - 1)! (N - M)! / (N - 1)!] for M types and N
        tokens.

        """
        if self.types < 2:
            return 0.0
        tokens = self.tokens
        return (self._logfactorial(tokens - 1) -
                self._logfactorial(self.types - 2) -
                self._logfactorial(tokens - self.types + 1))

    def get_cost(self):
        """Override for the Encoding get_cost function. A corpus does not
        have a permutation cost

        """
        if self.boundaries == 0:
            return 0.0

        n = self.tokens + self.boundaries
        return ((n * math.log(n)
                 - self.boundaries * math.log(self.boundaries)
                 - self.logtokensum) * self.weight
                + self.frequency_distribution_cost())


class AnnotatedCorpusEncoding(Encoding):
    """Encoding the cost of an Annotated Corpus.

    In this encoding constructions that are missing are penalized.

    """
    def __init__(self, corpus_coding, weight=None, penalty=-9999.9):
        """
        Initialize encoding with appropriate meta data

        Arguments:
            corpus_coding: CorpusEncoding instance used for retrieving the
                             number of tokens and boundaries in the corpus
            weight: The weight of this encoding. If the weight is None,
                      it is updated automatically to be in balance with the
                      corpus
            penalty: log penalty used for missing constructions

        """
        super(AnnotatedCorpusEncoding, self).__init__()
        self.do_update_weight = True
        self.weight = 1.0
        if weight is not None:
            self.do_update_weight = False
            self.weight = weight
        self.corpus_coding = corpus_coding
        self.penalty = penalty
        self.constructions = collections.Counter()

    def set_constructions(self, constructions):
        """Method for re-initializing the constructions. The count of the
        constructions must still be set with a call to set_count

        """
        self.constructions = constructions
        self.tokens = sum(constructions.values())
        self.logtokensum = 0.0

    def set_count(self, construction, count):
        """Set an initial count for each construction. Missing constructions
        are penalized
        """
        annot_count = self.constructions[construction]
        if count > 0:
            self.logtokensum += annot_count * math.log(count)
        else:
            self.logtokensum += annot_count * self.penalty

    def update_count(self, construction, old_count, new_count):
        """Update the counts in the Encoding, setting (or removing) a penalty
         for missing constructions

        """
        if construction in self.constructions:
            annot_count = self.constructions[construction]
            if old_count > 0:
                self.logtokensum -= annot_count * math.log(old_count)
            else:
                self.logtokensum -= annot_count * self.penalty
            if new_count > 0:
                self.logtokensum += annot_count * math.log(new_count)
            else:
                self.logtokensum += annot_count * self.penalty

    def update_weight(self):
        """Update the weight of the Encoding by taking the ratio of the
        corpus boundaries and annotated boundaries
        """
        if not self.do_update_weight:
            return
        old = self.weight
        self.weight = (self.corpus_coding.weight *
                       float(self.corpus_coding.boundaries) / self.boundaries)
        if self.weight != old:
            _logger.info("Corpus weight of annotated data set to %s", self.weight)

    def get_cost(self):
        """Return the cost of the Annotation Corpus."""
        if self.boundaries == 0:
            return 0.0
        n = self.tokens + self.boundaries
        return ((n * math.log(self.corpus_coding.tokens +
                              self.corpus_coding.boundaries)
                 - self.boundaries * math.log(self.corpus_coding.boundaries)
                 - self.logtokensum) * self.weight)


class LexiconEncoding(Encoding):
    """Class for calculating the encoding cost for the Lexicon"""

    def __init__(self):
        """Initialize Lexcion Encoding"""
        super(LexiconEncoding, self).__init__()
        self.atoms = collections.Counter()

    @property
    def types(self):
        """Return the number of different atoms in the lexicon + 1 for the
        compound-end-token

        """
        return len(self.atoms) + 1

    def add(self, construction):
        """Add a construction to the lexicon, updating automatically the
        count for its atoms

        """
        self.boundaries += 1
        for atom in construction:
            c = self.atoms[atom]
            self.atoms[atom] = c + 1
            self.update_count(atom, c, c + 1)

    def remove(self, construction):
        """Remove construction from the lexicon, updating automatically the
        count for its atoms

        """
        self.boundaries -= 1
        for atom in construction:
            c = self.atoms[atom]
            self.atoms[atom] = c - 1
            self.update_count(atom, c, c - 1)

    def get_codelength(self, construction):
        """Return an approximate codelength for new construction."""
        l = len(construction) + 1
        cost = l * math.log(self.tokens + l)
        cost -= math.log(self.boundaries + 1)
        for atom in construction:
            if atom in self.atoms:
                c = self.atoms[atom]
            else:
                c = 1
            cost -= math.log(c)
        return cost


================================================
FILE: morfessor/cmd.py
================================================
# -*- coding: utf-8 -*-
import locale
import logging
import math
import random
import os.path
import sys
import time
import string

from . import get_version
from . import utils
from .baseline import BaselineModel, AnnotationCorpusWeight, \
    MorphLengthCorpusWeight, NumMorphCorpusWeight, FixedCorpusWeight
from .exception import ArgumentException
from .io import MorfessorIO
from .evaluation import MorfessorEvaluation, EvaluationConfig, \
    WilcoxonSignedRank, FORMAT_STRINGS

PY3 = sys.version_info[0] == 3

# _str is used to convert command line arguments to the right type (str for PY3, unicode for PY2
if PY3:
    _str = str
else:
    _str = lambda x: unicode(x, encoding=locale.getpreferredencoding())

_logger = logging.getLogger(__name__)

LRU_MAX_SIZE = 1000000


def get_default_argparser():
    import argparse

    parser = argparse.ArgumentParser(
        prog='morfessor.py',
        description="""
Morfessor %s

Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

1.  Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.

2.  Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the following
    disclaimer in the documentation and/or other materials provided
    with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

Command-line arguments:
""" % get_version(),
        epilog="""
Simple usage examples (training and testing):

  %(prog)s -t training_corpus.txt -s model.pickled
  %(prog)s -l model.pickled -T test_corpus.txt -o test_corpus.segmented

Interactive use (read corpus from user):

  %(prog)s -m online -v 2 -t -

""",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        add_help=False)

    # Options for input data files
    add_arg = parser.add_argument_group('input data files').add_argument
    add_arg('-l', '--load', dest="loadfile", default=None, metavar='<file>',
            help="load existing model from file (pickled model object)")
    add_arg('-L', '--load-segmentation', dest="loadsegfile", default=None,
            metavar='<file>',
            help="load existing model from segmentation "
                 "file (Morfessor 1.0 format)")
    add_arg('-t', '--traindata', dest='trainfiles', action='append',
            default=[], metavar='<file>',
            help="input corpus file(s) for training (text or bz2/gzipped text;"
                 " use '-' for standard input; add several times in order to "
                 "append multiple files)")
    add_arg('-T', '--testdata', dest='testfiles', action='append',
            default=[], metavar='<file>',
            help="input corpus file(s) to analyze (text or bz2/gzipped text;  "
                 "use '-' for standard input; add several times in order to "
                 "append multiple files)")

    # Options for output data files
    add_arg = parser.add_argument_group('output data files').add_argument
    add_arg('-o', '--output', dest="outfile", default='-', metavar='<file>',
            help="output file for test data results (for standard output, "
                 "use '-'; default '%(default)s')")
    add_arg('-s', '--save', dest="savefile", default=None, metavar='<file>',
            help="save final model to file (pickled model object)")
    add_arg('-S', '--save-segmentation', dest="savesegfile", default=None,
            metavar='<file>',
            help="save model segmentations to file (Morfessor 1.0 format)")
    add_arg('--save-reduced', dest="savereduced", default=None,
            metavar='<file>',
            help="save final model to file in reduced form (pickled model "
            "object). A model in reduced form can only be used for "
            "segmentation of new words.")
    add_arg('-x', '--lexicon', dest="lexfile", default=None, metavar='<file>',
            help="output final lexicon to given file")
    add_arg('--nbest', dest="nbest", default=1, type=int, metavar='<int>',
            help="output n-best viterbi results")

    # Options for data formats
    add_arg = parser.add_argument_group(
        'data format options').add_argument
    add_arg('-e', '--encoding', dest='encoding', metavar='<encoding>',
            help="encoding of input and output files (if none is given, "
                 "both the local encoding and UTF-8 are tried)")
    add_arg('--lowercase', dest="lowercase", default=False,
            action='store_true',
            help="lowercase input data")
    add_arg('--traindata-list', dest="list", default=False,
            action='store_true',
            help="input file(s) for batch training are lists "
                 "(one compound per line, optionally count as a prefix)")
    add_arg('--atom-separator', dest="separator", type=_str, default=None,
            metavar='<regexp>',
            help="atom separator regexp (default %(default)s)")
    add_arg('--compound-separator', dest="cseparator", type=_str, default=r'\s+',
            metavar='<regexp>',
            help="compound separator regexp (default '%(default)s')")
    add_arg('--analysis-separator', dest='analysisseparator', type=_str,
            default=',', metavar='<str>',
            help="separator for different analyses in an annotation file. Use"
                 "  NONE for only allowing one analysis per line")
    add_arg('--output-format', dest='outputformat', type=_str,
            default=r'{analysis}\n', metavar='<format>',
            help="format string for --output file (default: '%(default)s'). "
            "Valid keywords are: "
            "{analysis} = constructions of the compound, "
            "{compound} = compound string, "
            "{count} = count of the compound (currently always 1), "
            "{logprob} = log-probability of the analysis, and "
            "{clogprob} = log-probability of the compound. Valid escape "
            "sequences are '\\n' (newline) and '\\t' (tabular)")
    add_arg('--output-format-separator', dest='outputformatseparator',
            type=_str, default=' ', metavar='<str>',
            help="construction separator for analysis in --output file "
            "(default: '%(default)s')")
    add_arg('--output-newlines', dest='outputnewlines', default=False,
            action='store_true',
            help="for each newline in input, print newline in --output file "
            "(default: '%(default)s')")

    # Options for model training
    add_arg = parser.add_argument_group(
        'training and segmentation options').add_argument
    add_arg('-m', '--mode', dest="trainmode", default='init+batch',
            metavar='<mode>',
            choices=['none', 'batch', 'init', 'init+batch', 'online',
                     'online+batch'],
            help="training mode ('none', 'init', 'batch', 'init+batch', "
                 "'online', or 'online+batch'; default '%(default)s')")
    add_arg('-a', '--algorithm', dest="algorithm", default='recursive',
            metavar='<algorithm>', choices=['recursive', 'viterbi'],
            help="algorithm type ('recursive', 'viterbi'; default "
                 "'%(default)s')")
    add_arg('-d', '--dampening', dest="dampening", type=_str, default='ones',
            metavar='<type>', choices=['none', 'log', 'ones'],
            help="frequency dampening for training data ('none', 'log', or "
                 "'ones'; default '%(default)s')")
    add_arg('-f', '--forcesplit', dest="forcesplit", type=list, default=['-'],
            metavar='<list>',
            help="force split on given atoms (default '-'). The list argument "
                 "is a string of characthers, use '' for no forced splits.")
    add_arg('-F', '--finish-threshold', dest='finish_threshold', type=float,
            default=0.005, metavar='<float>',
            help="Stopping threshold. Training stops when "
                 "the improvement of the last iteration is"
                 "smaller then finish_threshold * #boundaries; "
                 "(default '%(default)s')")
    add_arg('-r', '--randseed', dest="randseed", default=None,
            metavar='<seed>',
            help="seed for random number generator")
    add_arg('-R', '--randsplit', dest="splitprob", default=None, type=float,
            metavar='<float>',
            help="initialize new words by random splitting using the given "
                 "split probability (default no splitting)")
    add_arg('--skips', dest="skips", default=False, action='store_true',
            help="use random skips for frequently seen compounds to speed up "
                 "training")
    add_arg('--batch-minfreq', dest="freqthreshold", type=int, default=1,
            metavar='<int>',
            help="compound frequency threshold for batch training (default "
                 "%(default)s)")
    add_arg('--max-epochs', dest='maxepochs', type=int, default=None,
            metavar='<int>',
            help='hard maximum of epochs in training')
    add_arg('--nosplit-re', dest="nosplit", type=_str, default=None,
            metavar='<regexp>',
            help="if the expression matches the two surrounding characters, "
                 "do not allow splitting (default %(default)s)")
    add_arg('--online-epochint', dest="epochinterval", type=int,
            default=10000, metavar='<int>',
            help="epoch interval for online training (default %(default)s)")
    add_arg('--viterbi-smoothing', dest="viterbismooth", default=0,
            type=float, metavar='<float>',
            help="additive smoothing parameter for Viterbi training "
                 "and segmentation (default %(default)s)")
    add_arg('--viterbi-maxlen', dest="viterbimaxlen", default=30,
            type=int, metavar='<int>',
            help="maximum construction length in Viterbi training "
                 "and segmentation (default %(default)s)")

    # Options for corpusweight tuning
    add_arg = parser.add_mutually_exclusive_group().add_argument
    add_arg('-D', '--develset', dest="develfile", default=None,
            metavar='<file>',
            help="load annotated data for tuning the corpus weight parameter")
    add_arg('--morph-length', dest='morphlength', default=None, type=float,
            metavar='<float>',
            help="tune the corpusweight to obtain the desired average morph "
                 "length")
    add_arg('--num-morph-types', dest='morphtypes', default=None, type=float,
            metavar='<float>',
            help="tune the corpusweight to obtain the desired number of morph "
                 "types")

    # Options for semi-supervised model training
    add_arg = parser.add_argument_group(
        'semi-supervised training options').add_argument
    add_arg('-w', '--corpusweight', dest="corpusweight", type=float,
            default=1.0, metavar='<float>',
            help="corpus weight parameter (default %(default)s); "
                 "sets the initial value if other tuning options are used")
    add_arg('--weight-threshold', dest='threshold', default=0.01,
            metavar='<float>', type=float,
            help='percentual stopping threshold for corpusweight updaters')
    add_arg('--full-retrain', dest='fullretrain', action='store_true',
            default=False,
            help='do a full retrain after any weights have converged')
    add_arg('-A', '--annotations', dest="annofile", default=None,
            metavar='<file>',
            help="load annotated data for semi-supervised learning")
    add_arg('-W', '--annotationweight', dest="annotationweight",
            type=float, default=None, metavar='<float>',
            help="corpus weight parameter for annotated data (if unset, the "
                 "weight is set to balance the number of tokens in annotated "
                 "and unannotated data sets)")

    # Options for evaluation
    add_arg = parser.add_argument_group('Evaluation options').add_argument
    add_arg('-G', '--goldstandard', dest='goldstandard', default=None,
            metavar='<file>',
            help='If provided, evaluate the model against the gold standard')

    # Options for logging
    add_arg = parser.add_argument_group('logging options').add_argument
    add_arg('-v', '--verbose', dest="verbose", type=int, default=1,
            metavar='<int>',
            help="verbose level; controls what is written to the standard "
                 "error stream or log file (default %(default)s)")
    add_arg('--logfile', dest='log_file', metavar='<file>',
            help="write log messages to file in addition to standard "
                 "error stream")
    add_arg('--progressbar', dest='progress', default=False,
            action='store_true',
            help="Force the progressbar to be displayed (possibly lowers the "
                 "log level for the standard error stream)")

    add_arg = parser.add_argument_group('other options').add_argument
    add_arg('-h', '--help', action='help',
            help="show this help message and exit")
    add_arg('--version', action='version',
            version='%(prog)s ' + get_version(),
            help="show version number and exit")

    return parser


def initialize_logging(args):
    """Initialize loggers based on command line args"""
    if args.verbose >= 2:
        loglevel = logging.DEBUG
    elif args.verbose >= 1:
        loglevel = logging.INFO
    else:
        loglevel = logging.WARNING

    rootlogger = logging.getLogger()
    rootlogger.setLevel(logging.DEBUG)

    logfile_format = '%(asctime)s %(levelname)s:%(message)s'
    date_format = '%Y-%m-%d %H:%M:%S'
    console_format = '%(message)s'

    console_level = loglevel
    if args.log_file is not None or (hasattr(args, 'progress') and args.progress):
        # If logging to a file or progress bar is forced, make INFO
        # the highest level for the error stream
        console_level = max(loglevel, logging.INFO)

    # Console handler
    ch = logging.StreamHandler()
    ch.setLevel(console_level)
    ch.setFormatter(logging.Formatter(console_format))
    rootlogger.addHandler(ch)

    # FileHandler for log_file
    if args.log_file is not None:
        fh = logging.FileHandler(args.log_file, 'w')
        fh.setLevel(loglevel)
        fh.setFormatter(logging.Formatter(logfile_format, date_format))
        rootlogger.addHandler(fh)

    return console_level


@utils.lru_cache(maxsize=LRU_MAX_SIZE)
def _viterbi_segment(model, atoms, smooth, maxlen):
    return model.viterbi_segment(atoms, smooth, maxlen)


@utils.lru_cache(maxsize=LRU_MAX_SIZE)
def _viterbi_nbest(model, atoms, nbest, smooth, maxlen):
    return model.viterbi_nbest(atoms, nbest, smooth,maxlen)


def main(args):

    console_level = initialize_logging(args)

    # If debug messages are printed to screen, only warning messages
    # (or above) should be printed to screen, or if stderr is not a
    # tty (but a pipe or a file), don't show the progressbar
    if (console_level != logging.INFO or
            (hasattr(sys.stderr, 'isatty') and not sys.stderr.isatty())):
        utils.show_progress_bar = False

    # Force progress bar
    if args.progress:
        utils.show_progress_bar = True

    if (args.loadfile is None and
            args.loadsegfile is None and
            len(args.trainfiles) == 0):
        raise ArgumentException("either model file or training data should "
                                "be defined")

    if args.randseed is not None:
        random.seed(args.randseed)

    io = MorfessorIO(encoding=args.encoding,
                     compound_separator=args.cseparator,
                     atom_separator=args.separator,
                     lowercase=args.lowercase)

    # Load exisiting model or create a new one
    if args.loadfile is not None:
        model = io.read_binary_model_file(args.loadfile)

    else:
        model = BaselineModel(forcesplit_list=args.forcesplit,
                              corpusweight=args.corpusweight,
                              use_skips=args.skips,
                              nosplit_re=args.nosplit)

    if args.loadsegfile is not None:
        model.load_segmentations(io.read_segmentation_file(args.loadsegfile))

    analysis_sep = (args.analysisseparator
                    if args.analysisseparator != 'NONE' else None)

    if args.annofile is not None:
        annotations = io.read_annotations_file(args.annofile,
                                               analysis_sep=analysis_sep)
        model.set_annotations(annotations, args.annotationweight)

    if args.develfile is not None:
        develannots = io.read_annotations_file(args.develfile,
                                               analysis_sep=analysis_sep)
        updater = AnnotationCorpusWeight(develannots, args.threshold)
        model.set_corpus_weight_updater(updater)

    if args.morphlength is not None:
        updater = MorphLengthCorpusWeight(args.morphlength, args.threshold)
        model.set_corpus_weight_updater(updater)

    if args.morphtypes is not None:
        updater = NumMorphCorpusWeight(args.morphtypes, args.threshold)
        model.set_corpus_weight_updater(updater)

    start_corpus_weight = model.get_corpus_coding_weight()

    # Set frequency dampening function
    if args.dampening == 'none':
        dampfunc = None
    elif args.dampening == 'log':
        dampfunc = lambda x: int(round(math.log(x + 1, 2)))
    elif args.dampening == 'ones':
        dampfunc = lambda x: 1
    else:
        raise ArgumentException("unknown dampening type '%s'" % args.dampening)

    # Set algorithm parameters
    if args.algorithm == 'viterbi':
        algparams = (args.viterbismooth, args.viterbimaxlen)
    else:
        algparams = ()

    # Train model
    if args.trainmode == 'none':
        pass
    elif args.trainmode == 'batch':
        if len(model.get_compounds()) == 0:
            _logger.warning("Model contains no compounds for batch training."
                            " Use 'init+batch' mode to add new data.")
        else:
            if len(args.trainfiles) > 0:
                _logger.warning("Training mode 'batch' ignores new data "
                                "files. Use 'init+batch' or 'online' to "
                                "add new compounds.")
            ts = time.time()
            e, c = model.train_batch(args.algorithm, algparams,
                                     args.finish_threshold, args.maxepochs)
            te = time.time()
            _logger.info("Epochs: %s", e)
            _logger.info("Final cost: %s", c)
            _logger.info("Training time: %.3fs", (te - ts))
    elif len(args.trainfiles) > 0:
        ts = time.time()
        if args.trainmode == 'init':
            if args.list:
                data = io.read_corpus_list_files(args.trainfiles)
            else:
                data = io.read_corpus_files(args.trainfiles)
            c = model.load_data(data, args.freqthreshold, dampfunc,
                                args.splitprob)
        elif args.trainmode == 'init+batch':
            if args.list:
                data = io.read_corpus_list_files(args.trainfiles)
            else:
                data = io.read_corpus_files(args.trainfiles)
            c = model.load_data(data, args.freqthreshold, dampfunc,
                                args.splitprob)
            e, c = model.train_batch(args.algorithm, algparams,
                                     args.finish_threshold, args.maxepochs)
            _logger.info("Epochs: %s", e)
            if args.fullretrain:
                if abs(model.get_corpus_coding_weight() - start_corpus_weight) > 0.1:
                    model.set_corpus_weight_updater(
                        FixedCorpusWeight(model.get_corpus_coding_weight()))
                    model.clear_segmentation()
                    e, c = model.train_batch(args.algorithm, algparams,
                                             args.finish_threshold,
                                             args.maxepochs)
                    _logger.info("Retrain Epochs: %s", e)
        elif args.trainmode == 'online':
            data = io.read_corpus_files(args.trainfiles)
            e, c = model.train_online(data, dampfunc, args.epochinterval,
                                      args.algorithm, algparams,
                                      args.splitprob, args.maxepochs)
            _logger.info("Epochs: %s", e)
        elif args.trainmode == 'online+batch':
            data = io.read_corpus_files(args.trainfiles)
            e, c = model.train_online(data, dampfunc, args.epochinterval,
                                      args.algorithm, algparams,
                                      args.splitprob, args.maxepochs)
            e, c = model.train_batch(args.algorithm, algparams,
                                     args.finish_threshold, args.maxepochs - e)
            _logger.info("Epochs: %s", e)
            if args.fullretrain:
                if abs(model.get_corpus_coding_weight() - start_corpus_weight) > 0.1:
                    model.clear_segmentation()
                    e, c = model.train_batch(args.algorithm, algparams,
                                             args.finish_threshold,
                                             args.maxepochs)
                    _logger.info("Retrain Epochs: %s", e)
        else:
            raise ArgumentException("unknown training mode '%s'", args.trainmode)
        te = time.time()
        _logger.info("Final cost: %s", c)
        _logger.info("Training time: %.3fs", (te - ts))
    else:
        _logger.warning("No training data files specified.")

    # Save model
    if args.savefile is not None:
        io.write_binary_model_file(args.savefile, model)

    if args.savesegfile is not None:
        io.write_segmentation_file(args.savesegfile, model.get_segmentations())

    # Output lexicon
    if args.lexfile is not None:
        io.write_lexicon_file(args.lexfile, model.get_constructions())

    if args.savereduced is not None:
        model.make_segment_only()
        io.write_binary_model_file(args.savereduced, model)

    # Segment test data
    if len(args.testfiles) > 0:
        _logger.info("Segmenting test data...")
        outformat = args.outputformat
        csep = args.outputformatseparator
        outformat = outformat.replace(r"\n", "\n")
        outformat = outformat.replace(r"\t", "\t")
        keywords = [x[1] for x in string.Formatter().parse(outformat)]
        with io._open_text_file_write(args.outfile) as fobj:
            testdata = io.read_corpus_files(args.testfiles)
            i = 0
            for count, atoms in testdata:
                if io.atom_separator is None:
                    compound = "".join(atoms)
                else:
                    compound = io.atom_separator.join(atoms)
                if len(atoms) == 0:
                    # Newline in corpus
                    if args.outputnewlines:
                        fobj.write("\n")
                    continue
                if "clogprob" in keywords:
                    clogprob = model.forward_logprob(atoms)
                else:
                    clogprob = 0
                if args.nbest > 1:
                    nbestlist = _viterbi_nbest(
                        model, atoms, args.nbest, args.viterbismooth,
                        args.viterbimaxlen)
                    for constructions, logp in nbestlist:
                        analysis = io.format_constructions(constructions,
                                                           csep=csep)
                        fobj.write(outformat.format(analysis=analysis,
                                                    compound=compound,
                                                    count=count, logprob=logp,
                                                    clogprob=clogprob))
                else:
                    constructions, logp = _viterbi_segment(
                        model, atoms, args.viterbismooth, args.viterbimaxlen)
                    analysis = io.format_constructions(constructions, csep=csep)
                    fobj.write(outformat.format(analysis=analysis,
                                                compound=compound,
                                                count=count, logprob=logp,
                                                clogprob=clogprob))
                i += 1
                if i % 10000 == 0:
                    sys.stderr.write(".")
            sys.stderr.write("\n")
        _logger.info("Done.")

    if args.goldstandard is not None:
        _logger.info("Evaluating Model")
        e = MorfessorEvaluation(io.read_annotations_file(args.goldstandard))
        result = e.evaluate_model(model, meta_data={'name': 'MODEL'})
        print(result.format(FORMAT_STRINGS['default']))
        _logger.info("Done")


def get_evaluation_argparser():
    import argparse
    #TODO factor out redundancies with get_default_argparser()
    standard_parser = get_default_argparser()
    parser = argparse.ArgumentParser(
        prog="morfessor-evaluate",
        epilog="""Simple usage example:

  %(prog)s gold_standard model1 model2
""",
        description=standard_parser.description,
        formatter_class=argparse.RawDescriptionHelpFormatter,
        add_help=False
    )

    add_arg = parser.add_argument_group('evaluation options').add_argument
    add_arg('--num-samples', dest='numsamples', type=int, metavar='<int>',
            default=10, help='number of samples to take for testing')
    add_arg('--sample-size', dest='samplesize', type=int, metavar='<int>',
            default=1000, help='size of each testing samples')

    add_arg = parser.add_argument_group('formatting options').add_argument
    add_arg('--format-string', dest='formatstring', metavar='<format>',
            help='Python new style format string used to report evaluation '
                 'results. The following variables are a value and and action '
                 'separated with and underscore. E.g. fscore_avg for the '
                 'average f-score. The available values are "precision", '
                 '"recall", "fscore", "samplesize" and the available actions: '
                 '"avg", "max", "min", "values", "count". A last meta-data '
                 'variable (without action) is "name", the filename of the '
                 'model See also the format-template option for predefined '
                 'strings')
    add_arg('--format-template', dest='template', metavar='<template>',
            default='default',
            help='Uses a template string for the format-string options. '
                 'Available templates are: default, table and latex. '
                 'If format-string is defined this option is ignored')

    add_arg = parser.add_argument_group('file options').add_argument
    add_arg('--construction-separator', dest="cseparator", type=_str,
            default=' ', metavar='<regexp>',
            help="construction separator for test segmentation files"
                 " (default '%(default)s')")
    add_arg('-e', '--encoding', dest='encoding', metavar='<encoding>',
            help="encoding of input and output files (if none is given, "
                 "both the local encoding and UTF-8 are tried)")

    add_arg = parser.add_argument_group('logging options').add_argument
    add_arg('-v', '--verbose', dest="verbose", type=int, default=1,
            metavar='<int>',
            help="verbose level; controls what is written to the standard "
                 "error stream or log file (default %(default)s)")
    add_arg('--logfile', dest='log_file', metavar='<file>',
            help="write log messages to file in addition to standard "
                 "error stream")

    add_arg = parser.add_argument_group('other options').add_argument
    add_arg('-h', '--help', action='help',
            help="show this help message and exit")
    add_arg('--version', action='version',
            version='%(prog)s ' + get_version(),
            help="show version number and exit")

    add_arg = parser.add_argument
    add_arg('goldstandard', metavar='<goldstandard>', nargs=1,
            help='gold standard file in standard annotation format')
    add_arg('models', metavar='<model>', nargs='+',
            help='model files to segment (either binary or Morfessor 1.0 style'
                 ' segmentation models).')
    add_arg('-t', '--testsegmentation', dest='test_segmentations', default=[],
            action='append',
            help='Segmentation of the test set. Note that all words in the '
                 'gold-standard must be segmented')

    return parser


def main_evaluation(args):
    """ Separate main for running evaluation and statistical significance
    testing. Takes as argument the results of an get_evaluation_argparser()
    """
    initialize_logging(args)

    io = MorfessorIO(encoding=args.encoding)

    ev = MorfessorEvaluation(io.read_annotations_file(args.goldstandard[0]))

    results = []

    sample_size = args.samplesize
    num_samples = args.numsamples

    f_string = args.formatstring
    if f_string is None:
        f_string = FORMAT_STRINGS[args.template]

    for f in args.models:
        result = ev.evaluate_model(io.read_any_model(f),
                                   configuration=EvaluationConfig(num_samples,
                                                                  sample_size),
                                   meta_data={'name': os.path.basename(f)})
        results.append(result)
        print(result.format(f_string))

    io.construction_separator = args.cseparator
    for f in args.test_segmentations:
        segmentation = io.read_segmentation_file(f, False)
        result = ev.evaluate_segmentation(segmentation,
                                          configuration=
                                          EvaluationConfig(num_samples,
                                                           sample_size),
                                          meta_data={'name':
                                                     os.path.basename(f)})
        results.append(result)
        print(result.format(f_string))

    if len(results) > 1 and num_samples > 1:
        wsr = WilcoxonSignedRank()
        r = wsr.significance_test(results)
        WilcoxonSignedRank.print_table(r)


================================================
FILE: morfessor/evaluation.py
================================================
from __future__ import print_function

import collections
import logging
from itertools import chain, product
import math
import random

_logger = logging.getLogger(__name__)

EvaluationConfig = collections.namedtuple('EvaluationConfig',
                                          ['num_samples', 'sample_size'])

FORMAT_STRINGS = {
    'default': """Filename   : {name}
Num samples: {samplesize_count}
Sample size: {samplesize_avg}
F-score    : {fscore_avg:.3}
Precision  : {precision_avg:.3}
Recall     : {recall_avg:.3}""",
    'table': "{name:10} {precision_avg:6.3} {recall_avg:6.3} {fscore_avg:6.3}",
    'latex': "{name} & {precision_avg:.3} &"
             " {recall_avg:.3} & {fscore_avg:.3} \\\\"}


def _sample(compound_list, size, seed):
    """Create a specific size sample from the compound list using a specific
    seed"""
    return random.Random(seed).sample(compound_list, size)


class MorfessorEvaluationResult(object):
    """A MorfessorEvaluationResult is returned by a MorfessorEvaluation
    object. It's purpose is to store the evaluation data and provide nice
    formatting options.

    Each MorfessorEvaluationResult contains the data of 1 evaluation
    (which can have multiple samples).

    """

    print_functions = {'avg': lambda x: sum(x) / len(x),
                       'min': min,
                       'max': max,
                       'values': list,
                       'count': len}
    #TODO add maybe std as a print function?

    def __init__(self, meta_data=None):
        self.meta_data = meta_data

        self.precision = []
        self.recall = []
        self.fscore = []
        self.samplesize = []

        self._cache = None

    def __getitem__(self, item):
        """Provide dict style interface for all values (standard values and
        metadata)"""
        if self._cache is None:
            self._fill_cache()

        return self._cache[item]

    def add_data_point(self, precision, recall, f_score, sample_size):
        """Method used by MorfessorEvaluation to add the results of a single
        sample to the object"""
        self.precision.append(precision)
        self.recall.append(recall)
        self.fscore.append(f_score)
        self.samplesize.append(sample_size)

        #clear cache
        self._cache = None

    def __str__(self):
        """Method for default visualization"""
        return self.format(FORMAT_STRINGS['default'])

    def _fill_cache(self):
        """ Pre calculate all variable / function combinations and put them in
        cache"""
        self._cache = {'{}_{}'.format(val, func_name): func(getattr(self, val))
                       for val in ('precision', 'recall', 'fscore',
                                   'samplesize')
                       for func_name, func in self.print_functions.items()}
        self._cache.update(self.meta_data)

    def _get_cache(self):
        """ Fill the cache (if necessary) and return it"""
        if self._cache is None:
            self._fill_cache()
        return self._cache

    def format(self, format_string):
        """ Format this object. The format string can contain all variables,
        e.g. fscore_avg, precision_values or any item from metadata"""
        return format_string.format(**self._get_cache())


class MorfessorEvaluation(object):
    """ Do the evaluation of one model, on one testset. The basic procedure is
    to create, in a stable manner, a number of samples and evaluate them
    independently. The stable selection of samples makes it possible to use
    the resulting values for Pair-wise statistical significance testing.

    reference_annotations is a standard annotation dictionary:
    {compound => ([annoation1],.. ) }
    """
    def __init__(self, reference_annotations):
        self.reference = {}

        for compound, analyses in reference_annotations.items():
            self.reference[compound] = list(
                tuple(self._segmentation_indices(a)) for a in analyses)

        self._samples = {}

    def _create_samples(self, configuration=EvaluationConfig(10, 1000)):
        """Create, in a stable manner, n testsets of size x as defined in
        test_configuration
        """

        #TODO: What is a reasonable limit to warn about a too small testset?
        if len(self.reference) < (configuration.num_samples *
                                  configuration.sample_size):
            _logger.warning("The test set is too small for this sample size")

        compound_list = sorted(self.reference.keys())
        self._samples[configuration] = [
            _sample(compound_list, configuration.sample_size, i) for i in
            range(configuration.num_samples)]

    def get_samples(self, configuration=EvaluationConfig(10, 1000)):
        """Get a list of samples. A sample is a list of compounds.

        This method is stable, so each time it is called with a specific
        test_set and configuration it will return the same samples. Also this
        method caches the samples in the _samples variable.

        """
        if configuration not in self._samples:
            self._create_samples(configuration)
        return self._samples[configuration]

    def _evaluate(self, prediction):
        """Helper method to get the precision and recall of 1 sample"""
        def calc_prop_distance(ref, pred):
            if len(ref) == 0:
                return 1.0
            diff = len(set(ref) - set(pred))
            return (len(ref) - diff) / float(len(ref))

        wordlist = sorted(set(prediction.keys()) & set(self.reference.keys()))

        recall_sum = 0.0
        precis_sum = 0.0

        for word in wordlist:
            if len(word) < 2:
                continue

            recall_sum += max(calc_prop_distance(r, p)
                              for p, r in product(prediction[word],
                                                  self.reference[word]))

            precis_sum += max(calc_prop_distance(p, r)
                              for p, r in product(prediction[word],
                                                  self.reference[word]))

        precision = precis_sum / len(wordlist)
        recall = recall_sum / len(wordlist)
        f_score = 2.0 / (1.0 / precision + 1.0 / recall)

        return precision, recall, f_score, len(wordlist)

    @staticmethod
    def _segmentation_indices(annotation):
        """Method to transform a annotation into a tuple of split indices"""
        cur_len = 0
        for a in annotation[:-1]:
            cur_len += len(a)
            yield cur_len

    def evaluate_model(self, model, configuration=EvaluationConfig(10, 1000),
                       meta_data=None):
        """Get the prediction of the test samples from the model and do the
        evaluation

        The meta_data object has preferably at least the key 'name'.

        """
        if meta_data is None:
            meta_data = {'name': 'UNKNOWN'}

        mer = MorfessorEvaluationResult(meta_data)

        for i, sample in enumerate(self.get_samples(configuration)):
            _logger.debug("Evaluating sample %s", i)
            prediction = {}
            for compound in sample:
                prediction[compound] = [tuple(self._segmentation_indices(
                    model.viterbi_segment(compound)[0]))]

            mer.add_data_point(*self._evaluate(prediction))

        return mer

    def evaluate_segmentation(self, segmentation,
                              configuration=EvaluationConfig(10, 1000),
                              meta_data=None):
        """Method for evaluating an existing segmentation"""

        def merge_constructions(constructions):
            compound = constructions[0]
            for i in range(1, len(constructions)):
                compound = compound + constructions[i]
            return compound

        segmentation = {merge_constructions(x[1]):
                        [tuple(self._segmentation_indices(x[1]))]
                        for x in segmentation}

        if meta_data is None:
            meta_data = {'name': 'UNKNOWN'}

        mer = MorfessorEvaluationResult(meta_data)

        for i, sample in enumerate(self.get_samples(configuration)):
            _logger.debug("Evaluating sample %s", i)

            prediction = {k: v for k, v in segmentation.items() if k in sample}
            mer.add_data_point(*self._evaluate(prediction))

        return mer


class WilcoxonSignedRank(object):
    """Class for doing statistical signficance testing with the Wilcoxon
    Signed-Rank test

    It implements the Pratt method for handling zero-differences and
    applies a 0.5 continuity correction for the z-statistic.

    """

    @staticmethod
    def _wilcoxon(d, method='pratt', correction=True):
        if method not in ('wilcox', 'pratt'):
            raise ValueError
        if method == 'wilcox':
            d = list(filter(lambda a: a != 0, d))

        count = len(d)

        ranks = WilcoxonSignedRank._rankdata([abs(v) for v in d])
        rank_sum_pos = sum(r for r, v in zip(ranks, d) if v > 0)
        rank_sum_neg = sum(r for r, v in zip(ranks, d) if v < 0)

        test = min(rank_sum_neg, rank_sum_pos)

        mean = count * (count + 1) * 0.25
        stdev = (count*(count + 1) * (2 * count + 1))
        # compensate for duplicate ranks
        no_zero_ranks = [r for i, r in enumerate(ranks) if d[i] != 0]
        stdev -= 0.5 * sum(x * (x*x-1) for x in
                           collections.Counter(no_zero_ranks).values())

        stdev = math.sqrt(stdev / 24.0)

        if correction:
            correction = +0.5 if test > mean else -0.5
        else:
            correction = 0
        z = (test - mean - correction) / stdev

        return 2 * WilcoxonSignedRank._norm_cum_pdf(abs(z))

    @staticmethod
    def _rankdata(d):
        od = collections.Counter()
        for v in d:
            od[v] += 1

        rank_dict = {}
        cur_rank = 1
        for val, count in sorted(od.items(), key=lambda x: x[0]):
            rank_dict[val] = (cur_rank + (cur_rank + count - 1)) / 2
            cur_rank += count

        return [rank_dict[v] for v in d]

    @staticmethod
    def _norm_cum_pdf(z):
        """Pure python implementation of the normal cumulative pdf function"""
        return 0.5 - 0.5 * math.erf(z / math.sqrt(2))

    def significance_test(self, evaluations, val_property='fscore_values',
                          name_property='name'):
        """Takes a set of evaluations (which should have the same
        test-configuration) and calculates the p-value for the Wilcoxon signed
        rank test

        Returns a dictionary with (name1,name2) keys and p-values as values.
        """
        results = {r[name_property]: r[val_property] for r in evaluations}
        if any(len(x) < 10 for x in results.values()):
            _logger.error("Too small number of samples for the Wilcoxon test")
            return {}
        p = {}
        for r1, r2 in product(results.keys(), results.keys()):
            p[(r1, r2)] = self._wilcoxon([v1-v2
                                          for v1, v2 in zip(results[r1],
                                                            results[r2])])

        return p

    @staticmethod
    def print_table(results):
        """Nicely format a results table as returned by significance_test"""
        names = sorted(set(r[0] for r in results.keys()))

        col_width = max(max(len(n) for n in names), 5)

        for h in chain([""], names):
            print('{:{width}}'.format(h, width=col_width), end='|')
        print()

        for name in names:
            print('{:{width}}'.format(name, width=col_width), end='|')

            for name2 in names:
                print('{:{width}.5}'.format(results[(name, name2)],
                                            width=col_width), end='|')
            print()


================================================
FILE: morfessor/exception.py
================================================
from __future__ import unicode_literals


class MorfessorException(Exception):
    """Base class for exceptions in this module."""
    pass


class ArgumentException(Exception):
    """Exception in command line argument parsing."""
    pass


class InvalidCategoryError(MorfessorException):
    """Attempt to load data using a different categorization scheme."""
    def __init__(self, category):
        super(InvalidCategoryError, self).__init__(
            self, 'This model does not recognize the category {}'.format(
                category))


class InvalidOperationError(MorfessorException):
    def __init__(self, operation, function_name):
        super(InvalidOperationError, self).__init__(
            self, ('This model does not have a method {}, and therefore cannot'
                   ' perform operation "{}"').format(function_name, operation))


class UnsupportedConfigurationError(MorfessorException):
    def __init__(self, reason):
        super(UnsupportedConfigurationError, self).__init__(
            self, ('This operation is not supported in this program ' +
                   'configuration. Reason: {}.').format(reason))


class SegmentOnlyModelException(MorfessorException):
    def __init__(self):
        super(SegmentOnlyModelException, self).__init__(
            self, 'This model has been reduced to a segment-only model')


================================================
FILE: morfessor/io.py
================================================
import bz2
import codecs
import datetime
import gzip
import locale
import logging
import re
import sys

from . import get_version
from . import utils

try:
    # In Python2 import cPickle for better performance
    import cPickle as pickle
except ImportError:
    import pickle

PY3 = sys.version_info[0] == 3

_logger = logging.getLogger(__name__)


class MorfessorIO(object):
    """Definition for all input and output files. Also handles all
    encoding issues.

    The only state this class has is the separators used in the data.
    Therefore, the same class instance can be used for initializing multiple
    files.

    """

    def __init__(self, encoding=None, construction_separator=' + ',
                 comment_start='#', compound_separator=r'\s+',
                 atom_separator=None, lowercase=False):
        self.encoding = encoding
        self.construction_separator = construction_separator
        self.comment_start = comment_start
        self.compound_sep_re = re.compile(compound_separator, re.UNICODE)
        self.atom_separator = atom_separator
        if atom_separator is not None:
            self._atom_sep_re = re.compile(atom_separator, re.UNICODE)
        self.lowercase = lowercase
        self._version = get_version()

    def read_segmentation_file(self, file_name, has_counts=True, **kwargs):
        """Read segmentation file.

        File format:
        <count> <construction1><sep><construction2><sep>...<constructionN>

        """
        _logger.info("Reading segmentations from '%s'...", file_name)
        for line in self._read_text_file(file_name):
            if has_counts:
                count, compound_str = line.split(' ', 1)
            else:
                count, compound_str = 1, line
            constructions = tuple(
                self._split_atoms(constr)
                for constr in compound_str.split(self.construction_separator))
            if self.atom_separator is None:
                compound = "".join(constructions)
            else:
                compound = tuple(atom for constr in constructions
                                 for atom in constr)
            yield int(count), compound, constructions
        _logger.info("Done.")

    def write_segmentation_file(self, file_name, segmentations, **kwargs):
        """Write segmentation file.

        File format:
        <count> <construction1><sep><construction2><sep>...<constructionN>

        """
        _logger.info("Saving segmentations to '%s'...", file_name)
        with self._open_text_file_write(file_name) as file_obj:
            d = datetime.datetime.now().replace(microsecond=0)
            file_obj.write("# Output from Morfessor Baseline %s, %s\n" %
                           (self._version, d))
            for count, _, segmentation in segmentations:
                if self.atom_separator is None:
                    s = self.construction_separator.join(segmentation)
                else:
                    s = self.construction_separator.join(
                        (self.atom_separator.join(constr)
                         for constr in segmentation))
                file_obj.write("%d %s\n" % (count, s))
        _logger.info("Done.")

    def read_corpus_files(self, file_names):
        """Read one or more corpus files.

        Yield for each compound found (1, compound_atoms).

        """
        for file_name in file_names:
            for item in self.read_corpus_file(file_name):
                yield item

    def read_corpus_list_files(self, file_names):
        """Read one or more corpus list files.

        Yield for each compound found (count, compound_atoms).

        """
        for file_name in file_names:
            for item in self.read_corpus_list_file(file_name):
                yield item

    def read_corpus_file(self, file_name):
        """Read one corpus file.

        For each compound, yield (1, compound_atoms).
        After each line, yield (0, ()).

        """
        _logger.info("Reading corpus from '%s'...", file_name)
        for line in self._read_text_file(file_name, raw=True):
            for compound in self.compound_sep_re.split(line):
                if len(compound) > 0:
                    yield 1, self._split_atoms(compound)
            yield 0, ()
        _logger.info("Done.")

    def read_corpus_list_file(self, file_name):
        """Read a corpus list file.

        Each line has the format:
        <count> <compound>

        Yield tuples (count, compound_atoms) for each compound.

        """
        _logger.info("Reading corpus from list '%s'...", file_name)
        for line in self._read_text_file(file_name):
            try:
                count, compound = line.split(None, 1)
                yield int(count), self._split_atoms(compound)
            except ValueError:
                yield 1, self._split_atoms(line)
        _logger.info("Done.")

    def read_annotations_file(self, file_name, construction_separator=' ',
                              analysis_sep=','):
        """Read a annotations file.

        Each line has the format:
        <compound> <constr1> <constr2>... <constrN>, <constr1>...<constrN>, ...

        Yield tuples (compound, list(analyses)).

        """
        annotations = {}
        _logger.info("Reading annotations from '%s'...", file_name)
        for line in self._read_text_file(file_name):
            compound, analyses_line = line.split(None, 1)

            if compound not in annotations:
                annotations[compound] = []

            if analysis_sep is not None:
                for analysis in analyses_line.split(analysis_sep):
                    analysis = analysis.strip()
                    annotations[compound].append(
                        analysis.strip().split(construction_separator))
            else:
                annotations[compound].append(
                    analyses_line.split(construction_separator))

        _logger.info("Done.")
        return annotations

    def write_lexicon_file(self, file_name, lexicon):
        """Write to a Lexicon file all constructions and their counts."""
        _logger.info("Saving model lexicon to '%s'...", file_name)
        with self._open_text_file_write(file_name) as file_obj:
            for construction, count in lexicon:
                file_obj.write("%d %s\n" % (count, construction))
        _logger.info("Done.")

    def read_binary_model_file(self, file_name):
        """Read a pickled model from file."""
        _logger.info("Loading model from '%s'...", file_name)
        model = self.read_binary_file(file_name)
        _logger.info("Done.")
        return model

    @staticmethod
    def read_binary_file(file_name):
        """Read a pickled object from a file."""
        with open(file_name, 'rb') as fobj:
            obj = pickle.load(fobj)
        return obj

    def write_binary_model_file(self, file_name, model):
        """Pickle a model to a file."""
        _logger.info("Saving model to '%s'...", file_name)
        self.write_binary_file(file_name, model)
        _logger.info("Done.")

    @staticmethod
    def write_binary_file(file_name, obj):
        """Pickle an object into a file."""
        with open(file_name, 'wb') as fobj:
            pickle.dump(obj, fobj, pickle.HIGHEST_PROTOCOL)

    def write_parameter_file(self, file_name, params):
        """Write learned or estimated parameters to a file"""
        with self._open_text_file_write(file_name) as file_obj:
            d = datetime.datetime.now().replace(microsecond=0)
            file_obj.write(
                '# Parameters for Morfessor {}, {}\n'.format(
                    self._version, d))
            for (key, val) in params.items():
                file_obj.write('{}:\t{}\n'.format(key, val))

    def read_parameter_file(self, file_name):
        """Read learned or estimated parameters from a file"""
        params = {}
        line_re = re.compile(r'^(.*)\s*:\s*(.*)$')
        for line in self._read_text_file(file_name):
            m = line_re.match(line.rstrip())
            if m:
                key = m.group(1)
                val = m.group(2)
                try:
                    val = float(val)
                except ValueError:
                    pass
                params[key] = val
        return params

    def read_any_model(self, file_name):
        """Read a file that is either a binary model or a Morfessor 1.0 style
        model segmentation. This method can not be used on standard input as
        data might need to be read multiple times"""
        try:
            model = self.read_binary_model_file(file_name)
            _logger.info("%s was read as a binary model", file_name)
            return model
        except BaseException:
            pass

        from morfessor import BaselineModel
        model = BaselineModel()
        model.load_segmentations(self.read_segmentation_file(file_name))
        _logger.info("%s was read as a segmentation", file_name)
        return model

    def format_constructions(self, constructions, csep=None, atom_sep=None):
        """Return a formatted string for a list of constructions."""
        if csep is None:
            csep = self.construction_separator
        if atom_sep is None:
            atom_sep = self.atom_separator
        if utils._is_string(constructions[0]):
            # Constructions are strings
            return csep.join(constructions)
        else:
            # Constructions are not strings (should be tuples of strings)
            return csep.join(map(lambda x: atom_sep.join(x), constructions))

    def _split_atoms(self, construction):
        """Split construction to its atoms."""
        if self.atom_separator is None:
            return construction
        else:
            return tuple(self._atom_sep_re.split(construction))

    def _open_text_file_write(self, file_name):
        """Open a file for writing with the appropriate compression/encoding"""
        if file_name == '-':
            file_obj = sys.stdout
            if PY3:
                return file_obj
        elif file_name.endswith('.gz'):
            file_obj = gzip.open(file_name, 'wb')
        elif file_name.endswith('.bz2'):
            file_obj = bz2.BZ2File(file_name, 'wb')
        else:
            file_obj = open(file_name, 'wb')
        if self.encoding is None:
            # Take encoding from locale if not set so far
            self.encoding = locale.getpreferredencoding()
        return codecs.getwriter(self.encoding)(file_obj)

    def _open_text_file_read(self, file_name):
        """Open a file for reading with the appropriate compression/encoding"""
        if file_name == '-':
            if PY3:
                inp = sys.stdin
            else:
                class StdinUnicodeReader:
                    def __init__(self, encoding):
                        self.encoding = encoding
                        if self.encoding is None:
                            self.encoding = locale.getpreferredencoding()

                    def __iter__(self):
                        return self

                    def next(self):
                        l = sys.stdin.readline()
                        if not l:
                            raise StopIteration()
                        return l.decode(self.encoding)

                inp = StdinUnicodeReader(self.encoding)
        else:
            if file_name.endswith('.gz'):
                file_obj = gzip.open(file_name, 'rb')
            elif file_name.endswith('.bz2'):
                file_obj = bz2.BZ2File(file_name, 'rb')
            else:
                file_obj = open(file_name, 'rb')
            if self.encoding is None:
                # Try to determine encoding if not set so far
                self.encoding = self._find_encoding(file_name)
            inp = codecs.getreader(self.encoding)(file_obj)
        return inp

    def _read_text_file(self, file_name, raw=False):
        """Read a text file with the appropriate compression and encoding.

        Comments and empty lines are skipped unless raw is True.

        """
        inp = self._open_text_file_read(file_name)
        try:
            for line in inp:
                line = line.rstrip()
                if not raw and \
                   (len(line) == 0 or line.startswith(self.comment_start)):
                    continue
                if self.lowercase:
                    yield line.lower()
                else:
                    yield line
        except KeyboardInterrupt:
            if file_name == '-':
                _logger.info("Finished reading from stdin")
                return
            else:
                raise

    @staticmethod
    def _find_encoding(*files):
        """Test default encodings on reading files.

        If no encoding is given, this method can be used to test which
        of the default encodings would work.

        """
        test_encodings = ['utf-8', locale.getpreferredencoding()]
        for encoding in test_encodings:
            ok = True
            for f in files:
                if f == '-':
                    continue
                try:
                    if f.endswith('.gz'):
                        file_obj = gzip.open(f, 'rb')
                    elif f.endswith('.bz2'):
                        file_obj = bz2.BZ2File(f, 'rb')
                    else:
                        file_obj = open(f, 'rb')

                    for _ in codecs.getreader(encoding)(file_obj):
                        pass
                except UnicodeDecodeError:
                    ok = False
                    break
            if ok:
                _logger.info("Detected %s encoding", encoding)
                return encoding

        raise UnicodeError("Can not determine encoding of input files")


================================================
FILE: morfessor/test/__init__.py
================================================
__author__ = 'psmit'


================================================
FILE: morfessor/test/evaluation.py
================================================
import unittest
import itertools

from morfessor.evaluation import WilcoxonSignedRank

class TestWilcoxon(unittest.TestCase):
    def setUp(self):
        self.obj = WilcoxonSignedRank()

    def test_norm_cum_pdf(self):
        self.assertAlmostEqual(self.obj._norm_cum_pdf(1.9599639845400), 0.025)

    def test_accuracy_wilcoxon(self):
        #Same tests as used for scipy.stats.morestats
        freq = [1, 4, 16, 15, 8, 4, 5, 1, 2]
        nums = range(-4, 5)
        x = list(itertools.chain(*[[u] * v for u, v in zip(nums, freq)]))

        self.assertEqual(len(x), 56)

        p = self.obj._wilcoxon(x, correction=False)
        self.assertAlmostEqual(p, 0.00197547303533107)

        p = self.obj._wilcoxon(x, "wilcox", correction=False)
        self.assertAlmostEqual(p, 0.00641346115861)

        x = [120, 114, 181, 188, 180, 146, 121, 191, 132, 113, 127, 112]
        y = [133, 143, 119, 189, 112, 199, 198, 113, 115, 121, 142, 187]

        p = self.obj._wilcoxon([(a - b) for a, b in zip(x, y)])
        self.assertAlmostEqual(p, 0.7240817)
        p = self.obj._wilcoxon([(a - b) for a, b in zip(x, y)], correction=False)
        self.assertAlmostEqual(p, 0.6948866)


    def test_wilcoxon_tie(self):
        #Same tests as used for scipy.stats.morestats

        p = self.obj._wilcoxon([0.1] * 10, correction=False)
        self.assertAlmostEqual(p, 0.001565402)

        p = self.obj._wilcoxon([0.1] * 10)
        self.assertAlmostEqual(p, 0.001904195)



if __name__ == '__main__':
    unittest.main()


================================================
FILE: morfessor/utils.py
================================================
"""Data structures and functions of general utility,
shared between different modules and variants of the software.
"""

import logging
import math
import random
import sys
import types


PY3 = sys.version_info[0] == 3


def _dummy_lru_cache(*args, **kwargs):
    return lambda func: func


if PY3:
    from functools import lru_cache
else:
    try:
        # Backport for lru_cache
        from backports.functools_lru_cache import lru_cache
    except ImportError:
        logging.warning(
            "LRU cache disabled, install backports.functools_lru_cache to enable.")
        lru_cache = _dummy_lru_cache


LOGPROB_ZERO = 1000000


# Progress bar for generators (length unknown):
# Print a dot for every GENERATOR_DOT_FREQ:th dot.
# Set to <= 0 to disable progress bar.
GENERATOR_DOT_FREQ = 500


show_progress_bar = True


def _progress(iter_func):
    """Decorator/function for displaying a progress bar when iterating
    through a list.

    iter_func can be both a function providing a iterator (for decorator
    style use) or an iterator itself.

    No progressbar is displayed when the show_progress_bar variable is set to
     false.

    If the progressbar module is available a fancy percentage style
    progressbar is displayed. Otherwise 60 dots are printed as indicator.

    """

    if not show_progress_bar:
        return iter_func

    #Try to see or the progressbar module is available, else fabricate our own
    try:
        from progressbar import ProgressBar
    except ImportError:
        class SimpleProgressBar:
            """Create a simple progress bar that prints 60 dots on a single
            line, proportional to the progress """
            NUM_DOTS = 60

            def __init__(self):
                self.it = None
                self.dotfreq = 100
                self.i = 0

            def __call__(self, it):
                self.it = iter(it)
                self.i = 0

                # Dot frequency is determined as ceil(len(it) / NUM_DOTS)
                self.dotfreq = (len(it) + self.NUM_DOTS - 1) // self.NUM_DOTS
                if self.dotfreq < 1:
                    self.dotfreq = 1

                return self

            def __iter__(self):
                return self

            def __next__(self):
                self.i += 1
                if self.i % self.dotfreq == 0:
                    sys.stderr.write('.')
                    sys.stderr.flush()
                try:
                    return next(self.it)
                except StopIteration:
                    sys.stderr.write('\n')
                    raise

            #Needed to be compatible with both Python2 and 3
            next = __next__

        ProgressBar = SimpleProgressBar

    # In case of a decorator (argument is a function),
    # wrap the functions result in a ProgressBar and return the new function
    if isinstance(iter_func, types.FunctionType):
        def i(*args, **kwargs):
            if logging.getLogger(__name__).isEnabledFor(logging.INFO):
                return ProgressBar()(iter_func(*args, **kwargs))
            else:
                return iter_func(*args, **kwargs)
        return i

    #In case of an iterator, wrap it in a ProgressBar and return it.
    elif hasattr(iter_func, '__iter__'):
        return ProgressBar()(iter_func)

    #If all else fails, just return the original.
    return iter_func


class Sparse(dict):
    """A defaultdict-like data structure, which tries to remain as sparse
    as possible. If a value becomes equal to the default value, it (and the
    key associated with it) are transparently removed.

    Only supports immutable values, e.g. namedtuples.
    """

    def __init__(self, *pargs, **kwargs):
        """Create a new Sparse datastructure.
        Keyword arguments:
            default: Default value. Unlike defaultdict this should be a
                       prototype immutable, not a factory.
        """

        self._default = kwargs.pop('default')
        dict.__init__(self, *pargs, **kwargs)

    def __getitem__(self, key):
        try:
            return dict.__getitem__(self, key)
        except KeyError:
            return self._default

    def __setitem__(self, key, value):
        # attribute check is necessary for unpickling
        if '_default' in self and value == self._default:
            if key in self:
                del self[key]
        else:
            dict.__setitem__(self, key, value)


def ngrams(sequence, n=2):
    """Returns all ngram tokens in an input sequence, for a specified n.
    E.g. ngrams(['A', 'B', 'A', 'B', 'D'], n=2) yields
    ('A', 'B'), ('B', 'A'), ('A', 'B'), ('B', 'D')
    """

    window = []
    for item in sequence:
        window.append(item)
        if len(window) > n:
            # trim back to size
            window = window[-n:]
        if len(window) == n:
            yield tuple(window)


def minargmin(sequence):
    """Returns the minimum value and the first index at which it can be
    found in the input sequence."""
    best = (None, None)
    for (i, value) in enumerate(sequence):
        if best[0] is None or value < best[0]:
            best = (value, i)
    return best


def zlog(x):
    """Logarithm which uses constant value for log(0) instead of -inf"""
    assert x >= 0.0
    if x == 0:
        return LOGPROB_ZERO
    return -math.log(x)


def _nt_zeros(constructor, zero=0):
    """Convenience function to return a namedtuple initialized to zeros,
    without needing to know the number of fields."""
    zeros = [zero] * len(constructor._fields)
    return constructor(*zeros)


def weighted_sample(data, num_samples):
    """Samples with replacement from the data set so that the probability
    of each data point being selected is proportional to the occurrence count.
    Arguments:
        data: A list of tuples (weight, ...)
        num_samples: The number of samples to return
    Returns:
        a sorted list of indices to data
    """
    tokens = sum(x[0] for x in data)
    token_indices = sorted([random.randint(0, tokens - 1)
                            for _ in range(num_samples)])

    data_indices = []
    d = enumerate(x[0] for x in data)
    di = 0
    ti = -1
    for sample_token_index in token_indices:
        while ti < sample_token_index:
            (di, weight) = next(d)
            ti += weight
        data_indices.append(di)
    return data_indices


def _generator_progress(generator):
    """Prints a progress bar for visualizing flow through a generator.
    The length of a generator is not known in advance, so the bar has
    no fixed length. GENERATOR_DOT_FREQ controls the frequency of dots.

    This function wraps the argument generator, returning a new generator.
    """

    if GENERATOR_DOT_FREQ <= 0:
        return generator

    def _progress_wrapper(generator):
        for (i, x) in enumerate(generator):
            if i % GENERATOR_DOT_FREQ == 0:
                sys.stderr.write('.')
                sys.stderr.flush()
            yield x
        sys.stderr.write('\n')

    return _progress_wrapper(generator)


def _is_string(obj):
    try:
        # Python 2
        return isinstance(obj, basestring)
    except NameError:
        # Python 3
        return isinstance(obj, str)


================================================
FILE: scripts/morfessor
================================================
#!/usr/bin/env python

import sys

import morfessor
from morfessor import _logger


def main(argv):
    parser = morfessor.get_default_argparser()
    try:
        args = parser.parse_args(argv)
        morfessor.main(args)
    except morfessor.ArgumentException as e:
        parser.error(e)
    except Exception as e:
        _logger.error("Fatal Error %s %s" % (type(e), e))
        raise


if __name__ == "__main__":
    main(sys.argv[1:])


================================================
FILE: scripts/morfessor-evaluate
================================================
#!/usr/bin/env python

import sys

import morfessor
import morfessor.evaluation

from morfessor import _logger


def main(argv):
    parser = morfessor.get_evaluation_argparser()
    try:
        args = parser.parse_args(argv)
        morfessor.main_evaluation(args)
    except morfessor.ArgumentException as e:
        parser.error(e)
    except Exception as e:
        _logger.error("Fatal Error %s %s" % (type(e), e))
        raise

if __name__ == "__main__":
    main(sys.argv[1:])


================================================
FILE: scripts/morfessor-segment
================================================
#!/usr/bin/env python

import argparse
import sys

import morfessor
from morfessor import _logger


def main(argv):
    parser = morfessor.get_default_argparser()
    parser.prog = "morfessor-segment"
    parser.epilog = """
Simple usage example (load model.pickled and use it to segment test corpus):

  %(prog)s -l model.pickled -o test_corpus.segmented test_corpus.txt

Interactive use (read corpus from user):

  %(prog)s -l model.pickled -

"""

    keep_options = ['encoding', 'loadfile', 'loadsegfile', 'outfile']
                    # FIXME Disabled to work around an argparse bug
                    #'help', 'version']
    for action_group in parser._action_groups:
        for arg in action_group._group_actions:
            if arg.dest not in keep_options:
                arg.help = argparse.SUPPRESS

    parser.add_argument('testfiles', metavar='<file>', nargs='+',
                        help='corpus files to segment')

    try:
        args = parser.parse_args(argv)
        morfessor.main(args)
    except morfessor.ArgumentException as e:
        parser.error(e)
    except Exception as e:
        _logger.error("Fatal Error %s %s" % (type(e), e))
        raise


if __name__ == "__main__":
    main(sys.argv[1:])


================================================
FILE: scripts/morfessor-train
================================================
#!/usr/bin/env python

import argparse
import sys

import morfessor
from morfessor import _logger


def main(argv):
    parser = morfessor.get_default_argparser()
    parser.prog = "morfessor-train"
    parser.epilog = """
Simple usage example (train a model and save it to model.pickled):

  %(prog)s -s model.pickled training_corpus.txt

Interactive use (read corpus from user):

  %(prog)s -m online -v 2 -

"""

    keep_options = ['savesegfile', 'savefile', 'trainmode', 'dampening',
                    'encoding', 'list', 'skips', 'annofile', 'develfile',
                    'fullretrain', 'threshold', 'morphtypes', 'morphlength',
                    'corpusweight', 'annotationweight', 'help', 'version']
    for action_group in parser._action_groups:
        for arg in action_group._group_actions:
            if arg.dest not in keep_options:
                arg.help = argparse.SUPPRESS

    parser.add_argument('trainfiles', metavar='<file>', nargs='+',
                        help='training data files')

    try:
        args = parser.parse_args(argv)
        morfessor.main(args)
    except morfessor.ArgumentException as e:
        parser.error(e)
    except Exception as e:
        _logger.error("Fatal Error {} {}".format(type(e), e))
        raise


if __name__ == "__main__":
    main(sys.argv[1:])


================================================
FILE: scripts/tools/morphlength_from_annotations.py
================================================
from __future__ import division
import fileinput


def main():
    tot_morph_count = 0
    tot_length = 0

    for line in fileinput.input():
        word, segm = line.strip().split(None, 1)
        segmentations = segm.split(',')
        num_morphs = [len([x for x in s.split(None) if x.strip().strip("~") != ""]) for s in segmentations]

        tot_morph_count += sum(num_morphs) / len(num_morphs)
        tot_length += len(word)

    print(tot_length / tot_morph_count)


if __name__ == "__main__":
    main()

================================================
FILE: setup.py
================================================
#!/usr/bin/env python

from codecs import open
from ez_setup import use_setuptools
use_setuptools()

from setuptools import setup

import re
main_py = open('morfessor/__init__.py', encoding='utf-8').read()
metadata = dict(re.findall("__([a-z]+)__ = '([^']+)'", main_py))

requires = [
    #    'progressbar',
]

setup(name='Morfessor',
      version=metadata['version'],
      author=metadata['author'],
      author_email='morpho@aalto.fi',
      url='http://morpho.aalto.fi/projects/morpho/morfessor2.html',
      description='A tool for unsupervised and semi-supervised morphological segmentation',
      packages=['morfessor', 'morfessor.test'],
      classifiers=[
          'Development Status :: 4 - Beta',
          'Intended Audience :: Science/Research',
          'License :: OSI Approved :: BSD License',
          'Operating System :: OS Independent',
          'Programming Language :: Python',
          'Topic :: Scientific/Engineering',
      ],
      license="BSD",
      scripts=['scripts/morfessor',
               'scripts/morfessor-train',
               'scripts/morfessor-segment',
               'scripts/morfessor-evaluate',
               ],
      install_requires=requires,
      extras_require={
          'docs': [l.strip() for l in open('docs/build_requirements.txt')]
      }
      )

Download .txt

gitextract_ss7q8p0b/

├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README
├── docs/
│   ├── Makefile
│   ├── README
│   ├── build_requirements.txt
│   ├── make.bat
│   └── source/
│       ├── cmdtools.rst
│       ├── conf.py
│       ├── filetypes.rst
│       ├── general.rst
│       ├── index.rst
│       ├── installation.rst
│       ├── libinterface.rst
│       └── license.rst
├── ez_setup.py
├── morfessor/
│   ├── __init__.py
│   ├── baseline.py
│   ├── cmd.py
│   ├── evaluation.py
│   ├── exception.py
│   ├── io.py
│   ├── test/
│   │   ├── __init__.py
│   │   └── evaluation.py
│   └── utils.py
├── scripts/
│   ├── morfessor
│   ├── morfessor-evaluate
│   ├── morfessor-segment
│   ├── morfessor-train
│   └── tools/
│       └── morphlength_from_annotations.py
└── setup.py

Download .txt

SYMBOL INDEX (177 symbols across 10 files)

FILE: ez_setup.py
  function _python_cmd (line 34) | def _python_cmd(*args):
  function _install (line 38) | def _install(tarball, install_args=()):
  function _build_egg (line 66) | def _build_egg(egg, tarball, to_dir):
  function _do_download (line 95) | def _do_download(version, download_base, to_dir, download_delay):
  function use_setuptools (line 107) | def use_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
  function download_setuptools (line 139) | def download_setuptools(version=DEFAULT_VERSION, download_base=DEFAULT_URL,
  function _extractall (line 176) | def _extractall(self, path=".", members=None):
  function _build_install_args (line 223) | def _build_install_args(options):
  function _parse_args (line 235) | def _parse_args():
  function main (line 251) | def main(version=DEFAULT_VERSION):

FILE: morfessor/__init__.py
  function get_version (line 20) | def get_version():

FILE: morfessor/baseline.py
  function _constructions_to_str (line 15) | def _constructions_to_str(constructions):
  class BaselineModel (line 33) | class BaselineModel(object):
    method __init__ (line 44) | def __init__(self, forcesplit_list=None, corpusweight=None,
    method set_corpus_weight_updater (line 93) | def set_corpus_weight_updater(self, corpus_weight):
    method _check_segment_only (line 103) | def _check_segment_only(self):
    method tokens (line 108) | def tokens(self):
    method types (line 113) | def types(self):
    method _add_compound (line 117) | def _add_compound(self, compound, c):
    method _remove (line 125) | def _remove(self, construction):
    method _random_split (line 131) | def _random_split(self, compound, threshold):
    method _set_compound_analysis (line 143) | def _set_compound_analysis(self, compound, parts, ptype='rbranch'):
    method _update_annotation_choices (line 185) | def _update_annotation_choices(self):
    method _best_analysis (line 214) | def _best_analysis(self, choices):
    method _force_split (line 231) | def _force_split(self, compound):
    method _test_skip (line 248) | def _test_skip(self, construction):
    method _viterbi_optimize (line 257) | def _viterbi_optimize(self, compound, addcount=0, maxlen=30):
    method _recursive_optimize (line 283) | def _recursive_optimize(self, compound):
    method _recursive_split (line 305) | def _recursive_split(self, construction):
    method _modify_construction_count (line 356) | def _modify_construction_count(self, construction, dcount):
    method _epoch_update (line 390) | def _epoch_update(self, epoch_num):
    method segmentation_to_splitloc (line 419) | def segmentation_to_splitloc(constructions):
    method _splitloc_to_segmentation (line 429) | def _splitloc_to_segmentation(compound, splitloc):
    method _join_constructions (line 444) | def _join_constructions(constructions):
    method get_compounds (line 452) | def get_compounds(self):
    method get_constructions (line 458) | def get_constructions(self):
    method get_cost (line 463) | def get_cost(self):
    method get_segmentations (line 471) | def get_segmentations(self):
    method load_data (line 479) | def load_data(self, data, freqthreshold=1, count_modifier=None,
    method load_segmentations (line 517) | def load_segmentations(self, segmentations):
    method set_annotations (line 529) | def set_annotations(self, annotations, annotatedcorpusweight=None):
    method segment (line 542) | def segment(self, compound):
    method train_batch (line 560) | def train_batch(self, algorithm='recursive', algorithm_params=(),
    method train_online (line 625) | def train_online(self, data, count_modifier=None, epoch_interval=10000,
    method viterbi_segment (line 719) | def viterbi_segment(self, compound, addcount=1.0, maxlen=30):
    method forward_logprob (line 812) | def forward_logprob(self, compound):
    method viterbi_nbest (line 861) | def viterbi_nbest(self, compound, n, addcount=1.0, maxlen=30):
    method get_corpus_coding_weight (line 962) | def get_corpus_coding_weight(self):
    method set_corpus_coding_weight (line 965) | def set_corpus_coding_weight(self, weight):
    method make_segment_only (line 969) | def make_segment_only(self):
    method clear_segmentation (line 980) | def clear_segmentation(self):
  class CorpusWeight (line 985) | class CorpusWeight(object):
    method move_direction (line 987) | def move_direction(cls, model, direction, epoch):
  class FixedCorpusWeight (line 1000) | class FixedCorpusWeight(CorpusWeight):
    method __init__ (line 1001) | def __init__(self, weight):
    method update (line 1004) | def update(self, model, _):
  class AnnotationCorpusWeight (line 1009) | class AnnotationCorpusWeight(CorpusWeight):
    method __init__ (line 1015) | def __init__(self, devel_set, threshold=0.01):
    method update (line 1019) | def update(self, model, epoch):
    method _boundary_recall (line 1032) | def _boundary_recall(cls, prediction, reference):
    method _bpr_evaluation (line 1055) | def _bpr_evaluation(cls, prediction, reference):
    method _estimate_segmentation_dir (line 1064) | def _estimate_segmentation_dir(self, segments, annotations):
  class MorphLengthCorpusWeight (line 1088) | class MorphLengthCorpusWeight(CorpusWeight):
    method __init__ (line 1089) | def __init__(self, morph_lenght, threshold=0.01):
    method update (line 1093) | def update(self, model, epoch):
    method calc_morph_length (line 1108) | def calc_morph_length(cls, model):
  class NumMorphCorpusWeight (line 1122) | class NumMorphCorpusWeight(CorpusWeight):
    method __init__ (line 1123) | def __init__(self, num_morph_types, threshold=0.01):
    method update (line 1127) | def update(self, model, epoch):
  class Encoding (line 1142) | class Encoding(object):
    method __init__ (line 1149) | def __init__(self, weight=1.0):
    method types (line 1165) | def types(self):
    method _logfactorial (line 1173) | def _logfactorial(cls, n):
    method frequency_distribution_cost (line 1186) | def frequency_distribution_cost(self):
    method permutations_cost (line 1199) | def permutations_cost(self):
    method update_count (line 1203) | def update_count(self, construction, old_count, new_count):
    method get_cost (line 1211) | def get_cost(self):
  class CorpusEncoding (line 1224) | class CorpusEncoding(Encoding):
    method __init__ (line 1231) | def __init__(self, lexicon_encoding, weight=1.0):
    method types (line 1236) | def types(self):
    method frequency_distribution_cost (line 1243) | def frequency_distribution_cost(self):
    method get_cost (line 1255) | def get_cost(self):
  class AnnotatedCorpusEncoding (line 1270) | class AnnotatedCorpusEncoding(Encoding):
    method __init__ (line 1276) | def __init__(self, corpus_coding, weight=None, penalty=-9999.9):
    method set_constructions (line 1299) | def set_constructions(self, constructions):
    method set_count (line 1308) | def set_count(self, construction, count):
    method update_count (line 1318) | def update_count(self, construction, old_count, new_count):
    method update_weight (line 1334) | def update_weight(self):
    method get_cost (line 1346) | def get_cost(self):
  class LexiconEncoding (line 1357) | class LexiconEncoding(Encoding):
    method __init__ (line 1360) | def __init__(self):
    method types (line 1366) | def types(self):
    method add (line 1373) | def add(self, construction):
    method remove (line 1384) | def remove(self, construction):
    method get_codelength (line 1395) | def get_codelength(self, construction):

FILE: morfessor/cmd.py
  function get_default_argparser (line 33) | def get_default_argparser():
  function initialize_logging (line 292) | def initialize_logging(args):
  function _viterbi_segment (line 331) | def _viterbi_segment(model, atoms, smooth, maxlen):
  function _viterbi_nbest (line 336) | def _viterbi_nbest(model, atoms, nbest, smooth, maxlen):
  function main (line 340) | def main(args):
  function get_evaluation_argparser (line 571) | def get_evaluation_argparser():
  function main_evaluation (line 648) | def main_evaluation(args):

FILE: morfessor/evaluation.py
  function _sample (line 26) | def _sample(compound_list, size, seed):
  class MorfessorEvaluationResult (line 32) | class MorfessorEvaluationResult(object):
    method __init__ (line 49) | def __init__(self, meta_data=None):
    method __getitem__ (line 59) | def __getitem__(self, item):
    method add_data_point (line 67) | def add_data_point(self, precision, recall, f_score, sample_size):
    method __str__ (line 78) | def __str__(self):
    method _fill_cache (line 82) | def _fill_cache(self):
    method _get_cache (line 91) | def _get_cache(self):
    method format (line 97) | def format(self, format_string):
  class MorfessorEvaluation (line 103) | class MorfessorEvaluation(object):
    method __init__ (line 112) | def __init__(self, reference_annotations):
    method _create_samples (line 121) | def _create_samples(self, configuration=EvaluationConfig(10, 1000)):
    method get_samples (line 136) | def get_samples(self, configuration=EvaluationConfig(10, 1000)):
    method _evaluate (line 148) | def _evaluate(self, prediction):
    method _segmentation_indices (line 180) | def _segmentation_indices(annotation):
    method evaluate_model (line 187) | def evaluate_model(self, model, configuration=EvaluationConfig(10, 1000),
    method evaluate_segmentation (line 211) | def evaluate_segmentation(self, segmentation,
  class WilcoxonSignedRank (line 240) | class WilcoxonSignedRank(object):
    method _wilcoxon (line 250) | def _wilcoxon(d, method='pratt', correction=True):
    method _rankdata (line 282) | def _rankdata(d):
    method _norm_cum_pdf (line 296) | def _norm_cum_pdf(z):
    method significance_test (line 300) | def significance_test(self, evaluations, val_property='fscore_values',
    method print_table (line 321) | def print_table(results):

FILE: morfessor/exception.py
  class MorfessorException (line 4) | class MorfessorException(Exception):
  class ArgumentException (line 9) | class ArgumentException(Exception):
  class InvalidCategoryError (line 14) | class InvalidCategoryError(MorfessorException):
    method __init__ (line 16) | def __init__(self, category):
  class InvalidOperationError (line 22) | class InvalidOperationError(MorfessorException):
    method __init__ (line 23) | def __init__(self, operation, function_name):
  class UnsupportedConfigurationError (line 29) | class UnsupportedConfigurationError(MorfessorException):
    method __init__ (line 30) | def __init__(self, reason):
  class SegmentOnlyModelException (line 36) | class SegmentOnlyModelException(MorfessorException):
    method __init__ (line 37) | def __init__(self):

FILE: morfessor/io.py
  class MorfessorIO (line 24) | class MorfessorIO(object):
    method __init__ (line 34) | def __init__(self, encoding=None, construction_separator=' + ',
    method read_segmentation_file (line 47) | def read_segmentation_file(self, file_name, has_counts=True, **kwargs):
    method write_segmentation_file (line 71) | def write_segmentation_file(self, file_name, segmentations, **kwargs):
    method read_corpus_files (line 93) | def read_corpus_files(self, file_names):
    method read_corpus_list_files (line 103) | def read_corpus_list_files(self, file_names):
    method read_corpus_file (line 113) | def read_corpus_file(self, file_name):
    method read_corpus_list_file (line 128) | def read_corpus_list_file(self, file_name):
    method read_annotations_file (line 146) | def read_annotations_file(self, file_name, construction_separator=' ',
    method write_lexicon_file (line 176) | def write_lexicon_file(self, file_name, lexicon):
    method read_binary_model_file (line 184) | def read_binary_model_file(self, file_name):
    method read_binary_file (line 192) | def read_binary_file(file_name):
    method write_binary_model_file (line 198) | def write_binary_model_file(self, file_name, model):
    method write_binary_file (line 205) | def write_binary_file(file_name, obj):
    method write_parameter_file (line 210) | def write_parameter_file(self, file_name, params):
    method read_parameter_file (line 220) | def read_parameter_file(self, file_name):
    method read_any_model (line 236) | def read_any_model(self, file_name):
    method format_constructions (line 253) | def format_constructions(self, constructions, csep=None, atom_sep=None):
    method _split_atoms (line 266) | def _split_atoms(self, construction):
    method _open_text_file_write (line 273) | def _open_text_file_write(self, file_name):
    method _open_text_file_read (line 290) | def _open_text_file_read(self, file_name):
    method _read_text_file (line 325) | def _read_text_file(self, file_name, raw=False):
    method _find_encoding (line 350) | def _find_encoding(*files):

FILE: morfessor/test/evaluation.py
  class TestWilcoxon (line 6) | class TestWilcoxon(unittest.TestCase):
    method setUp (line 7) | def setUp(self):
    method test_norm_cum_pdf (line 10) | def test_norm_cum_pdf(self):
    method test_accuracy_wilcoxon (line 13) | def test_accuracy_wilcoxon(self):
    method test_wilcoxon_tie (line 36) | def test_wilcoxon_tie(self):

FILE: morfessor/utils.py
  function _dummy_lru_cache (line 15) | def _dummy_lru_cache(*args, **kwargs):
  function _progress (line 43) | def _progress(iter_func):
  class Sparse (line 123) | class Sparse(dict):
    method __init__ (line 131) | def __init__(self, *pargs, **kwargs):
    method __getitem__ (line 141) | def __getitem__(self, key):
    method __setitem__ (line 147) | def __setitem__(self, key, value):
  function ngrams (line 156) | def ngrams(sequence, n=2):
  function minargmin (line 172) | def minargmin(sequence):
  function zlog (line 182) | def zlog(x):
  function _nt_zeros (line 190) | def _nt_zeros(constructor, zero=0):
  function weighted_sample (line 197) | def weighted_sample(data, num_samples):
  function _generator_progress (line 222) | def _generator_progress(generator):
  function _is_string (line 244) | def _is_string(obj):

FILE: scripts/tools/morphlength_from_annotations.py
  function main (line 5) | def main():

Download .json

Condensed preview — 32 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (196K chars).

[
  {
    "path": ".gitignore",
    "chars": 334,
    "preview": "*.py[cod]\n\n# C extensions\n*.so\n\n# Packages\n*.egg\n*.egg-info\ndist\nbuild\neggs\nparts\nbin\nvar\nsdist\ndevelop-eggs\n.installed."
  },
  {
    "path": "LICENSE",
    "chars": 1341,
    "preview": "Copyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.\nAll rights reserved.\n\nRedistribution and use"
  },
  {
    "path": "MANIFEST.in",
    "chars": 71,
    "preview": "include LICENSE\ninclude ez_setup.py\ninclude docs/build_requirements.txt"
  },
  {
    "path": "README",
    "chars": 1137,
    "preview": "Morfessor 2.0 - Quick start\n===========================\n\n\nInstallation\n------------\n\nMorfessor 2.0 is installed using se"
  },
  {
    "path": "docs/Makefile",
    "chars": 6783,
    "preview": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHINXBUILD "
  },
  {
    "path": "docs/README",
    "chars": 782,
    "preview": "Generating Documentation\n------------------------\n\nThe user instructions for Morfessor 2.0 are available as Sphinx sourc"
  },
  {
    "path": "docs/build_requirements.txt",
    "chars": 30,
    "preview": "sphinx\nsphinxcontrib-napoleon\n"
  },
  {
    "path": "docs/make.bat",
    "chars": 6716,
    "preview": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\n"
  },
  {
    "path": "docs/source/cmdtools.rst",
    "chars": 13271,
    "preview": "Command line tools\n==================\n\nThe installation process installs 4 scripts in the appropriate PATH.\n\nmorfessor\n-"
  },
  {
    "path": "docs/source/conf.py",
    "chars": 8361,
    "preview": "# -*- coding: utf-8 -*-\n#\n# Morfessor documentation build configuration file, created by\n# sphinx-quickstart on Wed Dec "
  },
  {
    "path": "docs/source/filetypes.rst",
    "chars": 2737,
    "preview": "Morfessor file types\n====================\n\n.. _binary-model-def:\n\nBinary model\n------------\n\n.. warning::\n\n    Pickled m"
  },
  {
    "path": "docs/source/general.rst",
    "chars": 2364,
    "preview": "General\n=======\n\n.. _morfessor-tech-report:\n\nMorfessor 2.0 Technical Report\n------------------------------\n\nThe work don"
  },
  {
    "path": "docs/source/index.rst",
    "chars": 610,
    "preview": ".. Morfessor documentation master file, created by\n   sphinx-quickstart on Wed Dec  4 13:41:43 2013.\n   You can adapt th"
  },
  {
    "path": "docs/source/installation.rst",
    "chars": 1659,
    "preview": "Installation instructions\n=========================\n\nMorfessor 2.0 is installed using setuptools library for Python. Mor"
  },
  {
    "path": "docs/source/libinterface.rst",
    "chars": 2423,
    "preview": "Python library interface to Morfessor\n=====================================\n\nMorfessor 2.0 contains a library interface "
  },
  {
    "path": "docs/source/license.rst",
    "chars": 1358,
    "preview": "License\n=======\nCopyright (c) 2012-2019, Sami Virpioja, Peter Smit, and Stig-Arne Grönroos.\nAll rights reserved.\n\nRedist"
  },
  {
    "path": "ez_setup.py",
    "chars": 8596,
    "preview": "#!python\n\"\"\"Bootstrap setuptools installation\n\nIf you want to use setuptools in your package's setup.py, just include th"
  },
  {
    "path": "morfessor/__init__.py",
    "chars": 1092,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\"\"\"\nMorfessor 2.0 - Python implementation of the Morfessor method\n\"\"\"\nimpo"
  },
  {
    "path": "morfessor/baseline.py",
    "chars": 54460,
    "preview": "import collections\nimport heapq\nimport logging\nimport math\nimport numbers\nimport random\nimport re\n\nfrom .utils import _p"
  },
  {
    "path": "morfessor/cmd.py",
    "chars": 31317,
    "preview": "# -*- coding: utf-8 -*-\nimport locale\nimport logging\nimport math\nimport random\nimport os.path\nimport sys\nimport time\nimp"
  },
  {
    "path": "morfessor/evaluation.py",
    "chars": 11904,
    "preview": "from __future__ import print_function\n\nimport collections\nimport logging\nfrom itertools import chain, product\nimport mat"
  },
  {
    "path": "morfessor/exception.py",
    "chars": 1362,
    "preview": "from __future__ import unicode_literals\n\n\nclass MorfessorException(Exception):\n    \"\"\"Base class for exceptions in this "
  },
  {
    "path": "morfessor/io.py",
    "chars": 13844,
    "preview": "import bz2\nimport codecs\nimport datetime\nimport gzip\nimport locale\nimport logging\nimport re\nimport sys\n\nfrom . import ge"
  },
  {
    "path": "morfessor/test/__init__.py",
    "chars": 21,
    "preview": "__author__ = 'psmit'\n"
  },
  {
    "path": "morfessor/test/evaluation.py",
    "chars": 1524,
    "preview": "import unittest\nimport itertools\n\nfrom morfessor.evaluation import WilcoxonSignedRank\n\nclass TestWilcoxon(unittest.TestC"
  },
  {
    "path": "morfessor/utils.py",
    "chars": 7272,
    "preview": "\"\"\"Data structures and functions of general utility,\nshared between different modules and variants of the software.\n\"\"\"\n"
  },
  {
    "path": "scripts/morfessor",
    "chars": 444,
    "preview": "#!/usr/bin/env python\n\nimport sys\n\nimport morfessor\nfrom morfessor import _logger\n\n\ndef main(argv):\n    parser = morfess"
  },
  {
    "path": "scripts/morfessor-evaluate",
    "chars": 486,
    "preview": "#!/usr/bin/env python\n\nimport sys\n\nimport morfessor\nimport morfessor.evaluation\n\nfrom morfessor import _logger\n\n\ndef mai"
  },
  {
    "path": "scripts/morfessor-segment",
    "chars": 1235,
    "preview": "#!/usr/bin/env python\n\nimport argparse\nimport sys\n\nimport morfessor\nfrom morfessor import _logger\n\n\ndef main(argv):\n    "
  },
  {
    "path": "scripts/morfessor-train",
    "chars": 1322,
    "preview": "#!/usr/bin/env python\n\nimport argparse\nimport sys\n\nimport morfessor\nfrom morfessor import _logger\n\n\ndef main(argv):\n    "
  },
  {
    "path": "scripts/tools/morphlength_from_annotations.py",
    "chars": 513,
    "preview": "from __future__ import division\nimport fileinput\n\n\ndef main():\n    tot_morph_count = 0\n    tot_length = 0\n\n    for line "
  },
  {
    "path": "setup.py",
    "chars": 1316,
    "preview": "#!/usr/bin/env python\n\nfrom codecs import open\nfrom ez_setup import use_setuptools\nuse_setuptools()\n\nfrom setuptools imp"
  }
]

About this extraction

This page contains the full source code of the aalto-speech/morfessor GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 32 files (182.3 KB), approximately 43.4k tokens, and a symbol index with 177 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo