Repository: opensemanticsearch/open-semantic-etl
Branch: master
Commit: f51efea6c18f
Files: 123
Total size: 363.3 KB

Directory structure:
gitextract_2awl829e/

├── .github/
│   └── FUNDING.yml
├── .gitignore
├── .gitmodules
├── DEBIAN/
│   ├── conffiles
│   ├── control
│   ├── postinst
│   └── prerm
├── Dockerfile
├── LICENSE
├── build-deb
├── docker-compose.test.yml
├── docker-compose.ubuntu.test.yml
├── docker-entrypoint.sh
├── etc/
│   ├── opensemanticsearch/
│   │   ├── blacklist/
│   │   │   ├── blacklist-url
│   │   │   ├── blacklist-url-prefix
│   │   │   ├── blacklist-url-regex
│   │   │   ├── blacklist-url-suffix
│   │   │   ├── enhance_extract_law/
│   │   │   │   └── blacklist-lawcode-if-no-clause
│   │   │   ├── enhance_zip/
│   │   │   │   ├── blacklist-contenttype
│   │   │   │   ├── blacklist-contenttype-prefix
│   │   │   │   ├── blacklist-contenttype-regex
│   │   │   │   ├── blacklist-contenttype-suffix
│   │   │   │   ├── whitelist-contenttype
│   │   │   │   ├── whitelist-contenttype-prefix
│   │   │   │   ├── whitelist-contenttype-regex
│   │   │   │   └── whitelist-contenttype-suffix
│   │   │   ├── textanalysis/
│   │   │   │   ├── blacklist-fieldname
│   │   │   │   ├── blacklist-fieldname-prefix
│   │   │   │   └── blacklist-fieldname-suffix
│   │   │   ├── whitelist-url
│   │   │   ├── whitelist-url-prefix
│   │   │   ├── whitelist-url-regex
│   │   │   └── whitelist-url-suffix
│   │   ├── connector-files
│   │   ├── connector-web
│   │   ├── enhancer-rdf
│   │   ├── etl
│   │   ├── facets
│   │   ├── filemonitoring/
│   │   │   └── files
│   │   ├── ocr/
│   │   │   └── dictionary.txt
│   │   ├── regex/
│   │   │   ├── email.tsv
│   │   │   ├── iban.tsv
│   │   │   └── phone.tsv
│   │   └── task_priorities
│   └── systemd/
│       └── system/
│           ├── opensemanticetl-filemonitoring.service
│           └── opensemanticetl.service
└── src/
    └── opensemanticetl/
        ├── __init__.py
        ├── clean_title.py
        ├── enhance_annotations.py
        ├── enhance_contenttype_group.py
        ├── enhance_csv.py
        ├── enhance_detect_language_tika_server.py
        ├── enhance_entity_linking.py
        ├── enhance_extract_email.py
        ├── enhance_extract_hashtags.py
        ├── enhance_extract_law.py
        ├── enhance_extract_money.py
        ├── enhance_extract_phone.py
        ├── enhance_extract_text_tika_server.py
        ├── enhance_file_mtime.py
        ├── enhance_file_size.py
        ├── enhance_html.py
        ├── enhance_mapping_id.py
        ├── enhance_mimetype.py
        ├── enhance_multilingual.py
        ├── enhance_ner_spacy.py
        ├── enhance_ner_stanford.py
        ├── enhance_ocr.py
        ├── enhance_path.py
        ├── enhance_pdf_ocr.py
        ├── enhance_pdf_page.py
        ├── enhance_pdf_page_preview.py
        ├── enhance_pst.py
        ├── enhance_rdf.py
        ├── enhance_rdf_annotations_by_http_request.py
        ├── enhance_regex.py
        ├── enhance_sentence_segmentation.py
        ├── enhance_warc.py
        ├── enhance_xml.py
        ├── enhance_xmp.py
        ├── enhance_zip.py
        ├── etl.py
        ├── etl_delete.py
        ├── etl_enrich.py
        ├── etl_file.py
        ├── etl_filedirectory.py
        ├── etl_filemonitoring.py
        ├── etl_hypothesis.py
        ├── etl_plugin_core.py
        ├── etl_rss.py
        ├── etl_sitemap.py
        ├── etl_sparql.py
        ├── etl_twitter_scraper.py
        ├── etl_web.py
        ├── etl_web_crawl.py
        ├── export_elasticsearch.py
        ├── export_json.py
        ├── export_neo4j.py
        ├── export_print.py
        ├── export_queue_files.py
        ├── export_solr.py
        ├── filter_blacklist.py
        ├── filter_file_not_modified.py
        ├── move_indexed_file.py
        ├── requirements.txt
        ├── tasks.py
        ├── test_enhance_detect_language_tika_server.py
        ├── test_enhance_extract_email.py
        ├── test_enhance_extract_law.py
        ├── test_enhance_extract_money.py
        ├── test_enhance_extract_text_tika_server.py
        ├── test_enhance_mapping_id.py
        ├── test_enhance_ner_spacy.py
        ├── test_enhance_path.py
        ├── test_enhance_pdf_ocr.py
        ├── test_enhance_regex.py
        ├── test_enhance_warc.py
        ├── test_etl_file.py
        ├── test_move_indexed_files.py
        └── testdata/
            ├── README.md
            ├── example.warc
            ├── run_integrationtests.sh
            └── run_tests.sh

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/FUNDING.yml
================================================
custom: ['https://www.paypal.me/MMandalka']


================================================
FILE: .gitignore
================================================
__pycache__
.project
.pydevproject
.settings


================================================
FILE: .gitmodules
================================================
[submodule "src/open-semantic-entity-search-api"]
	path = src/open-semantic-entity-search-api
	url = https://github.com/opensemanticsearch/open-semantic-entity-search-api.git
	branch = master
[submodule "src/tesseract-ocr-cache"]
	path = src/tesseract-ocr-cache
	url = https://github.com/opensemanticsearch/tesseract-ocr-cache.git


================================================
FILE: DEBIAN/conffiles
================================================
/etc/opensemanticsearch/etl
/etc/opensemanticsearch/filemonitoring/files
/etc/opensemanticsearch/connector-files
/etc/opensemanticsearch/connector-web
/etc/opensemanticsearch/enhancer-rdf
/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname
/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-prefix
/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-suffix
/etc/opensemanticsearch/blacklist/blacklist-url
/etc/opensemanticsearch/blacklist/blacklist-url-prefix
/etc/opensemanticsearch/blacklist/blacklist-url-suffix
/etc/opensemanticsearch/blacklist/blacklist-url-regex
/etc/opensemanticsearch/blacklist/whitelist-url
/etc/opensemanticsearch/blacklist/whitelist-url-prefix
/etc/opensemanticsearch/blacklist/whitelist-url-suffix
/etc/opensemanticsearch/blacklist/whitelist-url-regex
/etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype
/etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-prefix
/etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-suffix
/etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-regex
/etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype
/etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype-prefix
/etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype-suffix
/etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype-regex


================================================
FILE: DEBIAN/control
================================================
Package: open-semantic-etl
Version: 21.10.18
Section: misc
Priority: optional
Architecture: all
Depends: tika-server(>=0), python3-tika(>=0), curl(>=0), python3-pycurl(>=0), python3-rdflib(>=0), python3-sparqlwrapper(>=0), file(>=0), python3-requests(>=0), python3-pysolr(>=0), python3-dateutil(>=0), python3-lxml(>=0), python3-feedparser(>=0), poppler-utils(>=0), pst-utils(>=0),rabbitmq-server(>=0),python3-pyinotify(>=0),python3-pip(>=0), python3-dev(>=0), build-essential(>=0), libssl-dev(>=0), libffi-dev(>=0), tesseract-ocr(>=0), tesseract-ocr-deu(>=0)
Installed-Size: 100
Maintainer: Markus Mandalka <debian@mandalka.name>
Homepage: https://opensemanticsearch.org/
Description: Crawler to index files and directories to Solr
 Index your files to Solr.
 If tesseract-ocr installed there will be character recognition on images.
 Hint: install ocr language files like tesseract-ocr-deu for german texts.


================================================
FILE: DEBIAN/postinst
================================================
#!/bin/sh

adduser --system --disabled-password opensemanticetl
groupadd -r tesseract_cache
usermod -a -G tesseract_cache opensemanticetl

# rights for OCR cache
chown opensemanticetl:tesseract_cache /var/cache/tesseract
chmod 770 /var/cache/tesseract

# rights for thumbnail dir
chown opensemanticetl /var/opensemanticsearch/media/thumbnails
chmod o+w /var/opensemanticsearch/media/thumbnails


# install dependencies
pip3 install -r /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt


# load our additional systemd service config
systemctl daemon-reload

# start while booting
systemctl enable opensemanticetl
systemctl enable opensemanticetl-filemonitoring

# (re)start after installation (or upgrade)
systemctl restart opensemanticetl


================================================
FILE: DEBIAN/prerm
================================================
#!/bin/sh

systemctl disable opensemanticetl-filemonitoring

systemctl stop opensemanticetl-filemonitoring

systemctl disable opensemanticetl

systemctl stop opensemanticetl

exit 0


================================================
FILE: Dockerfile
================================================
ARG FROM=debian:bullseye
FROM ${FROM}

ENV DEBIAN_FRONTEND=noninteractive
ENV CRYPTOGRAPHY_DONT_BUILD_RUST=1

RUN apt-get update && apt-get install --no-install-recommends --yes \
    build-essential \
    curl \
    file \
    libffi-dev \
    librabbitmq4 \
    libssl-dev \
    poppler-utils \
    pst-utils \
    python3-dateutil \
    python3-dev \
    python3-feedparser \
    python3-lxml \
    python3-pip \
    python3-pycurl \
    python3-pyinotify \
    python3-pysolr \
    python3-rdflib \
    python3-requests \
    python3-scrapy \
    python3-setuptools \
    python3-sparqlwrapper \
    python3-wheel \
    tesseract-ocr \
#    tesseract-ocr-all \
    && apt-get clean -y && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

COPY ./src/opensemanticetl/requirements.txt /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt
# install Python PIP dependecies
RUN pip3 install -r /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt

COPY ./src/opensemanticetl /usr/lib/python3/dist-packages/opensemanticetl
COPY ./src/tesseract-ocr-cache/tesseract_cache /usr/lib/python3/dist-packages/tesseract_cache
COPY ./src/tesseract-ocr-cache/tesseract_fake /usr/lib/python3/dist-packages/tesseract_fake
COPY ./src/open-semantic-entity-search-api/src/entity_linking /usr/lib/python3/dist-packages/entity_linking
COPY ./src/open-semantic-entity-search-api/src/entity_manager /usr/lib/python3/dist-packages/entity_manager

COPY docker-entrypoint.sh /
RUN chmod 755 /docker-entrypoint.sh

# add user
RUN adduser --system --disabled-password opensemanticetl

RUN mkdir /var/cache/tesseract
RUN chown opensemanticetl /var/cache/tesseract

USER opensemanticetl

# start Open Semantic ETL celery workers (reading and executing ETL tasks from message queue)
CMD ["/docker-entrypoint.sh"]


================================================
FILE: LICENSE
================================================
                    GNU GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

                            Preamble

  The GNU General Public License is a free, copyleft license for
software and other kinds of works.

  The licenses for most software and other practical works are designed
to take away your freedom to share and change the works.  By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.  We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors.  You can apply it to
your programs, too.

  When we speak of free software, we are referring to freedom, not
price.  Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.

  To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights.  Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.

  For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received.  You must make sure that they, too, receive
or can get the source code.  And you must show them these terms so they
know their rights.

  Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.

  For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software.  For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.

  Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so.  This is fundamentally incompatible with the aim of
protecting users' freedom to change the software.  The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable.  Therefore, we
have designed this version of the GPL to prohibit the practice for those
products.  If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.

  Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary.  To prevent this, the GPL assures that
patents cannot be used to render the program non-free.

  The precise terms and conditions for copying, distribution and
modification follow.

                       TERMS AND CONDITIONS

  0. Definitions.

  "This License" refers to version 3 of the GNU General Public License.

  "Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.

  "The Program" refers to any copyrightable work licensed under this
License.  Each licensee is addressed as "you".  "Licensees" and
"recipients" may be individuals or organizations.

  To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy.  The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.

  A "covered work" means either the unmodified Program or a work based
on the Program.

  To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy.  Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.

  To "convey" a work means any kind of propagation that enables other
parties to make or receive copies.  Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.

  An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License.  If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.

  1. Source Code.

  The "source code" for a work means the preferred form of the work
for making modifications to it.  "Object code" means any non-source
form of a work.

  A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.

  The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form.  A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.

  The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities.  However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work.  For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.

  The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.

  The Corresponding Source for a work in source code form is that
same work.

  2. Basic Permissions.

  All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met.  This License explicitly affirms your unlimited
permission to run the unmodified Program.  The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work.  This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.

  You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force.  You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright.  Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.

  Conveying under any other circumstances is permitted solely under
the conditions stated below.  Sublicensing is not allowed; section 10
makes it unnecessary.

  3. Protecting Users' Legal Rights From Anti-Circumvention Law.

  No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.

  When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.

  4. Conveying Verbatim Copies.

  You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.

  You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.

  5. Conveying Modified Source Versions.

  You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:

    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.

    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".

    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.

    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.

  A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit.  Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.

  6. Conveying Non-Source Forms.

  You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:

    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.

    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.

    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.

    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.

    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.

  A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.

  A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling.  In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage.  For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product.  A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.

  "Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source.  The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.

  If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information.  But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).

  The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed.  Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.

  Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.

  7. Additional Terms.

  "Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law.  If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.

  When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it.  (Additional permissions may be written to require their own
removal in certain cases when you modify the work.)  You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.

  Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:

    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or

    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or

    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or

    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or

    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or

    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.

  All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10.  If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term.  If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.

  If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.

  Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.

  8. Termination.

  You may not propagate or modify a covered work except as expressly
provided under this License.  Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).

  However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.

  Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.

  Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License.  If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.

  9. Acceptance Not Required for Having Copies.

  You are not required to accept this License in order to receive or
run a copy of the Program.  Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance.  However,
nothing other than this License grants you permission to propagate or
modify any covered work.  These actions infringe copyright if you do
not accept this License.  Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.

  10. Automatic Licensing of Downstream Recipients.

  Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License.  You are not responsible
for enforcing compliance by third parties with this License.

  An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations.  If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.

  You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License.  For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.

  11. Patents.

  A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based.  The
work thus licensed is called the contributor's "contributor version".

  A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version.  For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.

  Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.

  In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement).  To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.

  If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients.  "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.

  If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.

  A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License.  You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.

  Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.

  12. No Surrender of Others' Freedom.

  If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all.  For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.

  13. Use with the GNU Affero General Public License.

  Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work.  The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.

  14. Revised Versions of this License.

  The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time.  Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.

  Each version is given a distinguishing version number.  If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation.  If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.

  If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.

  Later license versions may give you additional or different
permissions.  However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.

  15. Disclaimer of Warranty.

  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

  16. Limitation of Liability.

  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.

  17. Interpretation of Sections 15 and 16.

  If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.

                     END OF TERMS AND CONDITIONS

            How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.

  To do so, attach the following notices to the program.  It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

    {one line to give the program's name and a brief idea of what it does.}
    Copyright (C) {year}  {name of author}

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail.

  If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:

    {project}  Copyright (C) {year}  {fullname}
    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.

The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License.  Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".

  You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<http://www.gnu.org/licenses/>.

  The GNU General Public License does not permit incorporating your program
into proprietary programs.  If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library.  If this is what you want to do, use the GNU Lesser General
Public License instead of this License.  But first, please read
<http://www.gnu.org/philosophy/why-not-lgpl.html>.


================================================
FILE: build-deb
================================================
#!/bin/sh

VERSION=`date +%y.%m.%d`
PACKAGE=open-semantic-etl_${VERSION}.deb
BUILDDIR=/tmp/open-semantic-etl-$$.deb


#
# Build standard package (preconfigured for Solr)
#

echo "Building ${PACKAGE} in temp directory ${BUILDDIR}"

mkdir ${BUILDDIR}
cp -a DEBIAN ${BUILDDIR}/
cp -a etc ${BUILDDIR}/
cp -a usr ${BUILDDIR}/
mkdir -p ${BUILDDIR}/usr/lib/python3/dist-packages
cp -a src/* ${BUILDDIR}/usr/lib/python3/dist-packages/

mkdir -p ${BUILDDIR}/var/cache/tesseract

mkdir -p ${BUILDDIR}/var/opensemanticsearch/media/thumbnails

# Build standard package (preconfigured for Solr)
dpkg -b ${BUILDDIR} ${PACKAGE}


#
# Build alternate package (preconfigured for Elasticsearch)
#

# change config file and set export plugin to Elasticsearch
PACKAGE=open-semantic-etl-elasticsearch_${VERSION}.deb

echo "Building ${PACKAGE} in temp directory ${BUILDDIR}"

# change option "config['export']" in ${BUILDDIR}/etc/opensemanticsearch/etl from "solr" to "elasticsearch" by commenting / uncommenting

sed -r -e "s/(config\['export'\] = 'export_solr')/#\1/g" -e "s/(config\['index'\] = 'core1')/#\1/g" -e "s/(#)(config\['export'\] = 'export_elasticsearch')/\2/"  -e "s/(#)(config\['index'\] = 'opensemanticsearch')/\2/" -i ${BUILDDIR}/etc/opensemanticsearch/etl

# todo: delete dependency on pysolr

# Build the alternate package
dpkg -b ${BUILDDIR} ${PACKAGE}


================================================
FILE: docker-compose.test.yml
================================================
sut:
  build: .
  command: /usr/lib/python3/dist-packages/opensemanticetl/test/run_tests.sh


================================================
FILE: docker-compose.ubuntu.test.yml
================================================
version: '3'
services:
  sut:
    build:
      context: .
      args:
        FROM: ubuntu:focal
    command: /usr/lib/python3/dist-packages/opensemanticetl/test/run_tests.sh


================================================
FILE: docker-entrypoint.sh
================================================
#! /bin/sh

# docker-entrypoint for opensemanticsearch/open-semantic-etl

# wait for the apps container to finish initializing:
while ! curl -m 1 -sf http://apps >/dev/null 2>&1
do
	sleep 1
done

exec /usr/bin/python3 /usr/lib/python3/dist-packages/opensemanticetl/tasks.py


================================================
FILE: etc/opensemanticsearch/blacklist/blacklist-url
================================================
# Blacklist of URLs


================================================
FILE: etc/opensemanticsearch/blacklist/blacklist-url-prefix
================================================
# Blacklist of URL Prefixes like domains or paths


================================================
FILE: etc/opensemanticsearch/blacklist/blacklist-url-regex
================================================
# Blacklist URLs with text patterns by regular expressions (regex)


================================================
FILE: etc/opensemanticsearch/blacklist/blacklist-url-suffix
================================================
# Blacklist of URL Suffixes like file endings
.css
.CSS
.Css

================================================
FILE: etc/opensemanticsearch/blacklist/enhance_extract_law/blacklist-lawcode-if-no-clause
================================================
# Preferred labels of Law codes will be only added to facet "Law code",
# if the following configured (alternate) labels are directly before or after a
# law clause (f.e. in text "abc § 123 CC xyz"), but not if such blacklisted
# (alternate) label stands alone
# (f.e. in text "abc CC xyz" or in "CC: mail@domain) because too ambiguous


# too ambiguous alternate label from Wikidata entity Q206834 "Swiss Civil Code"
CC

# too ambiguous alternate label from Wikidata entity Q56045 "Basic Law for the Federal Republic of Germany"
GG

# too ambiguous alternate label from Wikidata entity Q187719 "Corpus Juris Civilis"
Institutes

# too ambiguous alternate label from Wikidata entity Q7101313 "Oregon Revised Statutes"
ORS


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype
================================================
# Blacklist of contenttypes


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-prefix
================================================
# Blacklist of contenttype prefixes

# Open Office / Libreoffice / MS Office
# Open document format and MS office open xml format is a zip archive with the document as XML, the embedded images and meta data as XML
# Tika will extract the main content, which - if you do not forensics - is enough in most cases.
# So we dont want additional handle each single (metadata) file in this archive, so we deactivate the ZIP plugin for that content type
# Since this is a prefix blacklist, it will stop unzip application/vnd.oasis.opendocument.text, application/vnd.oasis.opendocument.spreadsheet and so on ...

application/vnd.oasis.opendocument.
application/vnd.openxmlformats-officedocument.
application/msword
application/vnd.ms-word.
application/msexcel
application/vnd.ms-excel.
application/mspowerpoint
application/vnd.ms-powerpoint.


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-regex
================================================
# Blacklist contenttypes with text patterns by regular expressions (regex)


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/blacklist-contenttype-suffix
================================================
# Blacklist of contenttype suffixes


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype
================================================
# Whitelist of contenttypes


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype-prefix
================================================
# Whitelist of contenttype prefixes


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype-regex
================================================
# Whitelist contenttypes with text patterns by regular expressions (regex)


================================================
FILE: etc/opensemanticsearch/blacklist/enhance_zip/whitelist-contenttype-suffix
================================================
# Whitelist of contenttype suffixes


================================================
FILE: etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname
================================================
language_s
content_type_ss
content_type_group_ss
AEB Bracket Value_ss
AE Setting_ss
AF Area Height_ss
AF Area Width_ss
AF Area X Positions_ss
AF Area Y Positions_ss
AF Image Height_ss
AF Image Width_ss
AF Point Count_ss
AF Point Selected_ss
AF Points in Focus_ss
Aperture Value_ss
Auto Exposure Bracketing_ss
Auto ISO_ss
Auto Rotate_ss
Base ISO_ss
Bulb Duration_ss
Camera Info Array_ss
Camera Serial Number_ss
Camera Temperature_ss
Camera Type_ss
Canon Model ID_ss
Contrast_ss
Components Configuration_ss
Compressed Bits Per Pixel_ss
Compression_ss
Color Balance Array_ss
Color Space_ss
Color Temperature_ss
Color Tone_ss
Content-Encoding_s
Continuous Drive Mode_ss
Control Mode_ss
Custom Functions_ss
Custom Rendered_ss
created_ss
Creation-Date_ss
Data BitsPerSample_ss
Data PlanarConfiguration_ss
Data Precision_ss
Data SampleFormat_ss
Data SignificantBitsPerSample_ss
date_ss
dc:format_ss
dcterms:created_ss
dcterms:modified_ss
Dimension ImageOrientation_ss
Dimension PixelAspectRatio_ss
Digital Zoom_ss
Display Aperture_ss
Easy Shooting Mode_ss
embeddedResourceType_ss
Exif Version_ss
exif:DateTimeOriginal_ss
exif:ExposureTime_ss
exif:Flash_ss
exif:FocalLength_ss
exif:FNumber_ss
Exif Image Height_ss
Exif Image Width_ss
exif:IsoSpeedRatings_ss
Exposure Bias Value_ss
Exposure Compensation_ss
Exposure Mode_ss
Exposure Time_ss
F-Number_ss
F Number_ss
File Name_ss
File Length_ss
File Modified Date_ss
File Info Array_ss
File Size_ss
Firmware Version_ss
Flash_ss
FlashPix Version_ss
Flash Activity_ss
Flash Details_ss
Flash Exposure Compensation_ss
Flash Guide Number_ss
Focal Length_ss
Flash Mode_ss
Focal Plane Resolution Unit_ss
Focal Plane X Resolution_ss
Focal Plane Y Resolution_ss
Focal Units per mm_ss
Focus Continuous_ss
Focus Distance Lower_ss
Focus Distance Upper_ss
Focus Mode_ss
Focus Type_ss
height_ss
ISO Speed Ratings_ss
IHDR_ss
Image Height_ss
Image Number_ss
Image Size_ss
Image Width_ss
Image Type_ss
Interoperability Index_ss
Interoperability Version_ss
Iso_ss
Last-Modified_ss
Last-Save-Date_ss
Lens Type_ss
Long Focal Length_ss
Macro Mode_ss
Manual Flash Output_ss
Max Aperture_ss
Max Aperture Value_ss
Measured Color Array_ss
Measured EV_ss
meta:creation-date_ss
meta:save-date_ss
Metering Mode_ss
Min Aperture_ss
modified_ss
ND Filter_ss
Number of Components_ss
Number of Tables_ss
Orientation_ss
Optical Zoom Code_ss
pdf:PDFVersion_ss
pdf:docinfo:created_ss
pdf:docinfo:creator_tool_ss
pdf:docinfo:modified_ss
pdf:docinfo:producer_ss
pdf:encrypted_ss
pdf:charsPerPage_ss
pdf:unmappedUnicodeCharsPerPage_ss
Photo Effect_ss
producer_ss
Record Mode_ss
Related Image Height_ss
Related Image Width_ss
Resolution Unit_ss
Resolution Units_ss
Saturation_ss
sBIT sBIT_RGBAlpha_ss
Scene Capture Type_ss
Sensing Method_ss
Sequence Number_ss
Serial Number Format_ss
Slow Shutter_ss
Sharpness_ss
Short Focal Length_ss
Shutter Speed Value_ss
Spot Metering Mode_ss
SRAW Quality_ss
Target Aperture_ss
Target Exposure Time_ss
tiff:BitsPerSample_ss
tiff:ImageLength_ss
tiff:ImageWidth_ss
tiff:Make_ss
tiff:Model_ss
tiff:Orientation_ss
tiff:ResolutionUnit_ss
tiff:XResolution_ss
tiff:YResolution_ss
Thumbnail Height Pixels_ss
Thumbnail Width Pixels_ss
Thumbnail Image Valid Area_ss
Thumbnail Length_ss
Thumbnail Offset_ss
Transparency Alpha_ss
Valid AF Point Count_ss
width_ss
X-Parsed-By_ss
X-TIKA:parse_time_millis_ss
X Resolution_ss
xmpTPg:NPages_ss
xmp:CreatorTool_ss
YCbCr Positioning_ss
Y Resolution_ss
Zoom Source Width_ss
Zoom Target Width_ss


================================================
FILE: etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-prefix
================================================
etl_
X-TIKA
AF Point
Chroma 
Compression 
Component 
Date/Time
Measured EV 
Primary AF Point 
Self Timer 
Unknown Camera Setting 
Unknown tag 
White Balance
access_permission:


================================================
FILE: etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-suffix
================================================
_i
_is
_l
_ls
_b
_bs
_f
_fs
_d
_ds
_f
_fs
_dt
_dts
_uri_ss
_matchtext_ss


================================================
FILE: etc/opensemanticsearch/blacklist/whitelist-url
================================================
# Whitelist of URLs


================================================
FILE: etc/opensemanticsearch/blacklist/whitelist-url-prefix
================================================
# Whitelist of URL Prefixes like domains or paths


================================================
FILE: etc/opensemanticsearch/blacklist/whitelist-url-regex
================================================
# Whitelist URLs with text patterns by regular expressions (regex)


================================================
FILE: etc/opensemanticsearch/blacklist/whitelist-url-suffix
================================================
# Whitelist of URL Suffixes like file endings


================================================
FILE: etc/opensemanticsearch/connector-files
================================================
# -*- coding: utf-8 -*-

# Config for opensemanticsearch-index-file


# print Debug output
#config['verbose'] = True


# Index files again even if indexed before and modification time of file unchanged
#config['force'] = True


#
# Mapping filename to URI
#

# if the users have other path (mountpoint with other path then the servers full path)
# or protocol (http:// instead of file://)
# you can map the servers path to the users path

# default: user can access the file system, so /fullpath/filename will be mapped to file:///fullpath/filename
config['mappings'] = { "/": "file:///" }


# If documents access not via filesystem but via website (http)
# your files in /var/www/documents/ should be mapped to http://www.opensemanticsearch.org/documents/
#config['mappings'] = { "/var/www/documents/": "http://www.opensemanticsearch.org/documents/" }


#
# UI Path navigator: Strip parts of path facet
#

# The path facet is the sidebar component to navigate (sub)paths.
# If all your different directories are in one path like /documents
# or even worse the main content dirs are subdirs like /mnt/fileserver/onesubdir and /mnt/fileserver/othersubdirectory
# you might want that the user can select or navigate the subdirectories directly (which from the content perspective are main dirs)
# instead of forcing the user first navigate to ./mnt, then to ./fileserver and so on...

# this option wont change the uri (which is the base of this option and can be mapped and stripped above),
# it will only change/strip/shorten the path facet in the interactive navigation of the user interface
#config['facet_path_strip_prefix'] = [ "file:///home/", "file://" ]


================================================
FILE: etc/opensemanticsearch/connector-web
================================================
# -*- coding: utf-8 -*-

#
# Config for opensemanticsearch-index-web-crawl
#

#
# common file extensions that are not followed if they occur in links
#

config['webcrawler_deny_extensions'] = [
    # archives
    '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip',

    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
    'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico',

    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',

    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
    'm4a', 'm4v', 'flv', 'webm',

    # office suites (commented, since we want to index office documents)
    #'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg',
    #'odp', 'pdf',

    # other
    'css', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk'
]


# Uncomment, if you do not want to exclude file extensions
# Warning: You might not want to download Gigabytes or Terabytes of archives, videos, CD-ROM/DVD ISOs and so on...

#config['webcrawler_deny_extensions'] = []


================================================
FILE: etc/opensemanticsearch/enhancer-rdf
================================================
# -*- coding: utf-8 -*-

# Config for RDF metadata server

# URL of the meta data server (RDF)
# if set to False don't use additional metadata from server (like tags or annotations)
#
# Templates:
# [uri] for URL of annotated page
# [uri_md5] for MD5 Sum of the URL

config['metaserver'] = False

# Use Drupal as meta server
#config['metaserver'] = [ 'http://localhost/drupal/rdf?uri=[uri]' ]

# Use Semantic Mediawiki as meta server
#config['metaserver'] = [ 'http://localhost/mediawiki/index.php/Special:ExportRDF?xmlmime=rdf&page=[uri_md5]' ]

# Use tagger app as meta server
config['metaserver'] = [ 'http://localhost/search-apps/annotate/rdf?uri=[uri]' ]

# mapping of RDF properties or RDF classes to facets / columns
config['property2facet'] = {
 'http://www.wikidata.org/entity/Q5': 'person_ss',
 'http://www.wikidata.org/entity/Q43229': 'organization_ss',
 'http://www.wikidata.org/entity/Q178706': 'organization_ss',
 'http://www.wikidata.org/entity/Q18810687': 'organization_ss',
 'http://www.wikidata.org/entity/Q2221906': 'location_ss',
 'http://schema.org/Person': 'person_ss',
 'http://schema.org/Organization': 'organization_ss',
 'http://schema.org/Place': 'location_ss',
 'http://schema.org/location': 'location_ss',
 'http://schema.org/address': 'location_ss',
 'http://schema.org/keywords': 'tag_ss',
 'http://schema.org/Comment': 'comment_txt',
 'http://semantic-mediawiki.org/swivt/1.0#specialProperty_dat': 'meta_date_dts'
}


================================================
FILE: etc/opensemanticsearch/etl
================================================
# -*- coding: utf-8 -*-

#
# ETL config for connector(s)
#

# print debug messages
#config['verbose'] = True


#
# Languages for language specific index
#
# Each document is analyzed without grammar rules in the index fields like content, additionally it can be added/copied to language specific index fields/analyzers
# Document language is autodetected by default plugin enhance_detect_language_tika_server

# If index support enhanced analytics for specific languages, we can add/copy data to language specific fields/analyzers
# Set which languages are configured and shall be used in index for language specific analysis/stemming/synonyms
# Default / if not set all languages that are supported will be analyzed additionally language specific
#config['languages'] = ['en','de','fr','hu','it','pt','nl','cz','ro','ru','ar','fa']

# force to language specific analysis additional in this language(s) grammar & synonyms, even if language autodetection detects other language
#config['languages_force'] = ['en','de']


# only use language for language specific analysis which are added / uncommented later
#config['languages'] = []

# add English
#config['languages'].append('en')

# add German / Deutsch
#config['languages'].append('de')

# add French / Francais
#config['languages'].append('fr')

# add Hungarian
#config['languages'].append('hu')

# add Spanish
#config['languages'].append('es')

# add Portuguese
#config['languages'].append('pt')

# add Italian
#config['languages'].append('it')

# add Czech
#config['languages'].append('cz')

# add Dutch
#config['languages'].append('nl')

# add Romanian
#config['languages'].append('ro')

# add Russian
#config['languages'].append('ru')


#
# Index/storage
#

#
# Solr URL and port
#

config['export'] = 'export_solr'

# Solr server
config['solr'] = 'http://localhost:8983/solr/'

# Solr core
config['index'] = 'opensemanticsearch'


#
# Elastic Search
#

#config['export'] = 'export_elasticsearch'

# Index
#config['index'] = 'opensemanticsearch'


#
# Tika for text and metadata extraction
#

# Tika server (with tesseract-ocr-cache)
# default: http://localhost:9998

#config['tika_server'] = 'http://localhost:9998'

# Tika server with fake OCR cache of tesseract-ocr-cache used if OCR in later ETL tasks
# default: http://localhost:9999

#config['tika_server_fake_ocr'] = 'http://localhost:9999'


#
# Annotations
#

# add plugin for annotation/tagging/enrichment of documents
config['plugins'].append('enhance_annotations')

# set alternate URL of annotation server
#config['metadata_server'] = 'http://localhost/search-apps/annotate/json'


#
# RDF Knowledge Graph
#

# add RDF Metadata Plugin for granular import of RDF file statements to entities of knowledge graphs
config['plugins'].append('enhance_rdf')


#
# Config for OCR (automatic text recognition of text in images)
#

# Disable OCR for image files (i.e for more performance and/or because you don't need the text within images or have only photos without photographed text)
#config['ocr'] = False

# Option to disable OCR of embedded images in PDF by Tika
# so (if alternate plugin is enabled) OCR will be done only by alternate
# plugin enhance_pdf_ocr (which else works only as fallback, if Tika exceptions)
#config['ocr_pdf_tika'] = False

# Use OCR cache
config['ocr_cache'] = '/var/cache/tesseract'

# Option to disable OCR cache
#config['ocr_cache'] = None

# Do OCR for images embedded in PDF documents (i.e. designed images or scanned or photographed documents)
config['plugins'].append('enhance_pdf_ocr')

#OCR language

#If other than english you have to install package tesseract-XXX (tesseract language support) for your language
#and set ocr_lang to this value (be careful, the tesseract package for english is "eng" (not "en") german is named "deu", not "de"!)

# set OCR language to English/default
#config['ocr_lang'] = 'eng'

# set OCR language to German/Deutsch
#config['ocr_lang'] = 'deu'

# set multiple OCR languages
config['ocr_lang'] = 'eng+deu'


#
# Regex pattern for extraction
#

# Enable Regex plugin
config['plugins'].append('enhance_regex')

# Regex config for IBAN extraction
config['regex_lists'].append('/etc/opensemanticsearch/regex/iban.tsv')


#
# Email address and email domain extraction
#
config['plugins'].append('enhance_extract_email')


#
# Phone number extraction
#
config['plugins'].append('enhance_extract_phone')


#
# Config for Named Entities Recognition (NER) and Named Entity Linking (NEL)
#

# Enable Entity Linking / Normalization and dictionary based Named Entities Extraction from thesaurus and ontologies
config['plugins'].append('enhance_entity_linking')

# Enable SpaCy NER plugin
config['plugins'].append('enhance_ner_spacy')

# Spacy NER Machine learning classifier (for which language and with which/how many classes)

# Default classifier if no classifier for specific language

# disable NER for languages where no classifier defined in config['spacy_ner_classifiers']
config['spacy_ner_classifier_default'] = None

# Set default classifier to English (only if you are sure, that all documents you index are english)
# config['spacy_ner_classifier_default'] = 'en_core_web_sm'

# Set default classifier to German (only if you are sure, that all documents you index are german)
# config['spacy_ner_classifier_default'] = 'de_core_news_sm'

# Language specific classifiers (mapping to autodetected document language to Spacy classifier / language)
#
# You have to download additional language classifiers for example english (en) or german (de) by
# python3 -m spacy download en
# python3 -m spacy download de
# ...

config['spacy_ner_classifiers'] = {
    'da': 'da_core_news_sm',
    'de': 'de_core_news_sm',
    'en': 'en_core_web_sm',
    'es': 'es_core_news_sm',
    'fr': 'fr_core_news_sm',
    'it': 'it_core_news_sm',
    'lt': 'lt_core_news_sm',
    'nb': 'nb_core_news_sm',
    'nl': 'nl_core_news_sm',
    'pl': 'pl_core_news_sm',
    'pt': 'pt_core_news_sm',
    'ro': 'ro_core_news_sm',
}


# Enable Stanford NER plugin
#config['plugins'].append('enhance_ner_stanford')

# Stanford NER Machine learning classifier (for which language and with how many classes, which need more computing time)

# Default classifier if no classifier for specific language

# disable NER for languages where no classifier defined in config['stanford_ner_classifiers']
config['stanford_ner_classifier_default'] = None

# Set default classifier to English (only if you are sure, that all documents you index are english)
#config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz'

# Set default classifier to German (only if you are sure, that all documents you index are german)
#config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz'

# Language specific classifiers (mapping to autodetected document language)
# Before you have to download additional language classifiers to the configured path
config['stanford_ner_classifiers'] = {
    'en': '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
    'es': '/usr/share/java/stanford-ner/classifiers/spanish.ancora.distsim.s512.crf.ser.gz',
    'de': '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz',
}

# If Stanford NER JAR not in standard path
config['stanford_ner_path_to_jar'] = "/usr/share/java/stanford-ner/stanford-ner.jar"

# Stanford NER Java options like RAM settings
config['stanford_ner_java_options'] = '-mx1000m'


#
# Law clauses extraction
#

config['plugins'].append('enhance_extract_law')


#
# Money extraction
#

config['plugins'].append('enhance_extract_money')


#
# Neo4j graph database
#

# exports named entities and relations to Neo4j graph database

# Enable plugin to export entities and connections to Neo4j graph database
#config['plugins'].append('export_neo4j')

# Neo4j server
#config['neo4j_host'] = 'localhost'

# Username & password
#config['neo4j_user'] = 'neo4j'
#config['neo4j_password'] = 'neo4j'


================================================
FILE: etc/opensemanticsearch/facets
================================================
# Warning: Do not edit here!

# This config file will be overwritten
# by web admin user interface after config changes
# and on initialization by /var/lib/opensemanticsearch/manage.py entities

#
# Default facet config if no facets are configured
#
config['facets'] = {
    'author_ss': {'label': 'Author(s)', 'uri': 'http://schema.org/Author', 'facet_limit': '10', 'snippets_limit': '10'},
    'tag_ss': {'label': 'Tags', 'uri': 'http://schema.org/keywords', 'facet_limit': '10', 'snippets_limit': '10'},
    'annotation_tag_ss': {'label': 'Tags (Hypothesis)', 'uri': 'http://schema.org/keywords', 'facet_limit': '10', 'snippets_limit': '10'},
    'person_ss': {'label': 'Persons', 'uri': 'http://schema.org/Person', 'facet_limit': '10', 'snippets_limit': '10'},
    'organization_ss': {'label': 'Organizations', 'uri': 'http://schema.org/Organization', 'facet_limit': '10', 'snippets_limit': '10'},
    'location_ss': {'label': 'Locations', 'uri': 'http://schema.org/Place', 'facet_limit': '10', 'snippets_limit': '10'},
    'language_s': {'label': 'Language', 'uri': 'http://schema.org/inLanguage', 'facet_limit': '10', 'snippets_limit': '10'},
    'email_ss': {'label': 'Email', 'uri': 'http://schema.org/email', 'facet_limit': '10', 'snippets_limit': '10'},
    'Message-From_ss': {'label': 'Message from', 'uri': 'http://schema.org/sender', 'facet_limit': '10', 'snippets_limit': '10'},
    'Message-To_ss': {'label': 'Message to', 'uri': 'http://schema.org/toRecipient', 'facet_limit': '10', 'snippets_limit': '10'},
    'Message-CC_ss': {'label': 'Message CC', 'uri': 'http://schema.org/ccRecipient', 'facet_limit': '10', 'snippets_limit': '10'},
    'Message-BCC_ss': {'label': 'Message BCC', 'uri': 'http://schema.org/bccRecipient', 'facet_limit': '10', 'snippets_limit': '10'},
    'hashtag_ss': {'label': 'Hashtags', 'uri': 'http://schema.org/keywords', 'facet_limit': '10', 'snippets_limit': '10'},
    'email_domain_ss': {'label': 'Email domain', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'phone_normalized_ss': {'label': 'Phone numbers', 'uri': 'https://schema.org/telephone', 'facet_limit': '10', 'snippets_limit': '10'},
    'phone_ss': {'label': 'Phone numbers', 'uri': 'https://schema.org/telephone', 'facet_limit': '10', 'snippets_limit': '10'},
    'money_ss': {'label': 'Money', 'uri': 'http://schema.org/MonetaryAmount', 'facet_limit': '10', 'snippets_limit': '10'},
    'iban_ss': {'label': 'IBAN', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'law_clause_ss': {'label': 'Law clause', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'law_code_ss': {'label': 'Law code', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'law_code_clause_ss': {'label': 'Law code clause', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'filename_extension_s': {'label': 'Filename extension', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'content_type_group_ss': {'label': 'Content type group', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'content_type_ss': {'label': 'Content type', 'uri': '', 'facet_limit': '10', 'snippets_limit': '10'},
    'law_codes_rdf_ss': {'label': '', 'uri': '', 'facet_limit': '0', 'snippets_limit': '0'},
}


================================================
FILE: etc/opensemanticsearch/filemonitoring/files
================================================


================================================
FILE: etc/opensemanticsearch/ocr/dictionary.txt
================================================


================================================
FILE: etc/opensemanticsearch/regex/email.tsv
================================================
[\w\.-]+@[\w\.-]+	email_ss


================================================
FILE: etc/opensemanticsearch/regex/iban.tsv
================================================
\b[a-zA-Z]{2}(?: ?)[0-9]{2}(?: ?)[a-zA-Z0-9]{4}(?: ?)[0-9]{7}(?: ?)([a-zA-Z0-9]?){0,16}\b	iban_ss


================================================
FILE: etc/opensemanticsearch/regex/phone.tsv
================================================
[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]	phone_ss


================================================
FILE: etc/opensemanticsearch/task_priorities
================================================
# Priorities of document processing in task queue

# The higher the additional priority is, the earlier the document will be processed by task queue.


#
# Priorities in task queue by filename extension
#

# the higher the additional priority, the earlier file with this file name extension will be processed
# the lower the additional priority, the later files with this file name extension will be processed

config['priorities_filename_extension] = {

  '.pdf': 5,
  '.doc': 5,
  '.docx': 5,
  '.xls': 5,
  '.xlsx': 5,
  '.odp': 5,
  '.ppt': 5,
  '.pptx': 5,
  '.eml': 5,
  '.pst': 4,
  '.csv': 4,
  '.tsv': 4,
  '.txt': 4,
  '.htm': 3,
  '.html': 3,
  '.md': 3,
  '.jpg': 1,
  '.jpeg': 1,
  '.gif': 1,
  '.png': 1,
  '.tif': 1,
  '.mp3': 1,
  '.mp4': 1,
  '.wav': 1,
  '.ini': -3,
  '.bat': -4,
  '.apk': -5,
  '.bin': -5,
  '.com': -5,
  '.deb': -5,
  '.exe': -5,
  '.msi': -5,
  '.php': -5,
  '.cache': -5,
  '.h': -5,
  '.pl': -5,
  '.py': -5,
  '.pyc': -5,
  '.js': -5,
  '.css': -5,
  '.ova': -5,
  '.iso': -5,

}


#
# Priorities on parts of filenames
#

# If a configures string is part of the filename, additional priority is set

config['priorities_filename] = {

  'corrupt': 5,
  'illegal': 5,
  'important': 5,
  'relevant': 5,
  'problem': 5,
  'urgent': 5,
  'passwor': 5,
  'account': 4,
  'agreement': 4,
  'bank': 4,
  'complian': 4,
  'cost': 4,
  'contract': 4,
  'legal': 4,
  'treaty': 4,

}


================================================
FILE: etc/systemd/system/opensemanticetl-filemonitoring.service
================================================
[Unit]
Description=Open Semantic ETL filemonitoring
After=network.target

[Service]
Type=simple
User=opensemanticetl
ExecStart=/usr/bin/opensemanticsearch-filemonitoring --fromfile /etc/opensemanticsearch/filemonitoring/files
Restart=always

[Install]
WantedBy=multi-user.target


================================================
FILE: etc/systemd/system/opensemanticetl.service
================================================
[Unit]
Description=Open Semantic ETL
After=network.target

[Service]
Type=simple
User=opensemanticetl
Environment=OMP_THREAD_LIMIT=1
ExecStart=/usr/bin/etl_tasks
Restart=always

[Install]
WantedBy=multi-user.target


================================================
FILE: src/opensemanticetl/__init__.py
================================================


================================================
FILE: src/opensemanticetl/clean_title.py
================================================
import sys

# Replace empty title with useful info from other fields for better usability


class clean_title(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        #
        # if no title but subject (i.e. emails), use subject as document / result title
        #

        try:
            # if no field title exists, but field subject, use it
            if not 'title_txt' in data:
                if 'subject_ss' in data:
                    data['title_txt'] = data['subject_ss']

            else:
                # if title empty and field subject exists, use subjects value
                if not data['title_txt']:
                    if 'subject_ss' in data:
                        if data['subject_ss']:
                            data['title_txt'] = data['subject_ss']

        except:
            sys.stderr.write(
                "Error while trying to clean empty title with subject")

        # if no title yet, use the filename part of URI
        try:
            # if no field title exists, but field subject, use it
            if not 'title_txt' in data:

                # get filename from URI
                filename = parameters['id'].split('/')[-1]

                data['title_txt'] = filename

        except:
            sys.stderr.write(
                "Error while trying to clean empty title with filename")

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_annotations.py
================================================
import os
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

import etl_plugin_core


# Get tags and annotations from annotation server
class enhance_annotations(etl_plugin_core.Plugin):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        # get parameters
        docid = parameters['id']

        if os.getenv('OPEN_SEMANTIC_ETL_METADATA_SERVER'):
            server = os.getenv('OPEN_SEMANTIC_ETL_METADATA_SERVER')
        elif 'metadata_server' in parameters:
            server = parameters['metadata_server']
        else:
            server = 'http://localhost/search-apps/annotate/json'

        adapter = HTTPAdapter(max_retries=Retry(total=10, backoff_factor=1))
        http = requests.Session()
        http.mount("https://", adapter)
        http.mount("http://", adapter)

        response = http.get(server, params={'uri': docid})
        response.raise_for_status()

        annotations = response.json()

        for facet in annotations:
            etl_plugin_core.append(data, facet, annotations[facet])

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_contenttype_group.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

#
# Map/aggregate content type to content type group
#


class enhance_contenttype_group(object):

    fieldname = 'content_type_group_ss'

    contenttype_groups = {
        'application/vnd.ms-excel': 'Spreadsheet',
        'application/vnd.oasis.opendocument.spreadsheet': 'Spreadsheet',
        'application/vnd.oasis.opendocument.spreadsheet-template': 'Spreadseheet template',
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': 'Spreadsheet',
        'application/vnd.openxmlformats-officedocument.spreadsheetml.template': 'Spreadsheet template',
        'text': 'Text document',
        'application/gzip text': 'Text document',
        'application/pdf': 'Text document',
        'application/msword': 'Text document',
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': 'Text document',
        'application/vnd.openxmlformats-officedocument.wordprocessingml.template': 'Text document template',
        'application/vnd.oasis.opendocument.text': 'Text document',
        'application/vnd.oasis.opendocument.text-template': 'Text document template',
        'application/rtf': 'Text document',
        'application/vnd.ms-powerpoint': 'Presentation',
        'application/vnd.oasis.opendocument.presentation': 'Presentation',
        'application/vnd.oasis.opendocument.presentation-template': 'Presentation template',
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': 'Presentation',
        'application/vnd.openxmlformats-officedocument.presentationml.template': 'Presentation template',
        'image': 'Image',
        'audio': 'Audio',
        'video': 'Video',
        'application/mp4': 'Video',
        'application/x-matroska': 'Video',
        'application/vnd.etsi.asic-e+zip': 'Electronic Signature Container',
        'Knowledge graph': 'Knowledge graph',
    }

    suffix_groups = {
        '.csv': "Spreadsheet",
    }

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        content_types = []
        if 'content_type_ss' in data:
            content_types = data['content_type_ss']

        if not isinstance(content_types, list):
            content_types = [content_types]

        groups = []

        for content_type in content_types:

            # Contenttype to group
            for mapped_content_type, group in self.contenttype_groups.items():
                if content_type.startswith(mapped_content_type):
                    if not group in groups:
                        groups.append(group)

            # Suffix to group
            for suffix, group in self.suffix_groups.items():
                if parameters['id'].upper().endswith(suffix.upper()):
                    if not group in groups:
                        groups.append(group)

            if len(groups) > 0:
                data[self.fieldname] = groups

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_csv.py
================================================
import sys
import os
import csv
import urllib.request
from etl import ETL


# import each row of CSV file to index
# write CSV cols to database columns or facets


class enhance_csv(object):

    def __init__(self, verbose=False):

        self.verbose = verbose

        self.config = {}
        self.titles = False
        self.cache = False

        self.encoding = 'utf-8'

        self.delimiter = None

        self.start_row = 1

        self.title_row = 0

        self.cols = []
        self.rows = []
        self.cols_include = False
        self.rows_include = False

        self.sniff_dialect = True

        self.quotechar = None
        self.doublequote = None
        self.escapechar = None

    def read_parameters(self, parameters, data):

        if 'verbose' in parameters:
            if parameters['verbose']:
                self.verbose = True

        if 'encoding' in parameters:
            self.encoding = parameters['encoding']
        elif 'encoding_s' in data:
            self.encoding = data['encoding_s']

        if 'delimiter' in parameters:
            self.delimiter = parameters['delimiter']

        if 'cache' in parameters:
            self.cache = parameters['cache']

        if 'title_row' in parameters:
            if parameters['title_row']:
                self.title_row = parameters['title_row']

        if 'start_row' in parameters:
            if parameters['start_row']:
                self.start_row = parameters['start_row']

        if 'sniff_dialect' in parameters:
            self.sniff_dialect = parameters['sniff_dialect']

        if 'quotechar' in parameters:
            self.quotechar = parameters['quotechar']

        if 'doublequote' in parameters:
            self.doublequote = parameters['doublequote']

        if 'escapechar' in parameters:
            self.escapechar = parameters['escapechar']

        if 'rows' in parameters:
            self.rows = parameters['rows']

        if 'cols' in parameters:
            self.cols = parameters['cols']

        if 'rows_include' in parameters:
            self.rows_include = parameters['rows_include']

        if 'cols_include' in parameters:
            self.cols_include = parameters['cols_include']

    # Todo:

    #
    # If existing CSV parameter settings in CSV manager, use them
    # even if not importing within CSV manager
    #
    def add_csv_parameters_from_meta_settings(self, metaserver):
        pass
        # get csv settings for this file from csvmnager
        # json = get csvserver

        # if delimiter in json:
        #	parameters['delimiter'] = json['delimiters']

    #
    # Build CSV dialect
    #

    # Autodetect and/or construct from parameters
    def get_csv_dialect(self):

        kwargs = {}

        # automatically detect dialect
        sniffed_dialect = False

        if self.sniff_dialect:
            try:
                if self.verbose:
                    print("Opening {} for guessing CSV dialect".format(self.filename))

                csvfile = open(self.filename, newline='',
                               encoding=self.encoding)

                if self.verbose:
                    print("Starting dialect guessing")

                # sniff dialect in first 32 MB
                sniffsize = 33554432
                sniffed_dialect = csv.Sniffer().sniff(csvfile.read(sniffsize))

                if self.verbose:
                    print("Sniffed dialect: {}".format(sniffed_dialect))

            except KeyboardInterrupt:
                raise KeyboardInterrupt

            except BaseException as e:
                sys.stderr.write(
                    "Exception while CSV format autodetection for {}: {}".format(self.filename, e))

            finally:
                csvfile.close()

        if sniffed_dialect:
            kwargs['dialect'] = sniffed_dialect
        else:
            kwargs['dialect'] = 'excel'

        # Overwrite options, if set
        if self.delimiter:
            kwargs['delimiter'] = str(self.delimiter)

        if self.quotechar:
            kwargs['quotechar'] = str(self.quotechar)

        if self.escapechar:
            kwargs['escapechar'] = str(self.escapechar)

        if self.doublequote:
            kwargs['doublequote'] = self.doublequote

        return kwargs

    def set_titles(self, row):

        self.titles = []
        colnumber = 0

        for col in row:

            colnumber += 1

            self.titles.append(col)

        return self.titles

    def export_row_data_to_index(self, data, rownumber):

        parameters = self.config.copy()

        # todo: all content plugins configurated, not only this one
        parameters['plugins'] = [
            'enhance_path',
            'enhance_entity_linking',
            'enhance_multilingual',
        ]

        etl = ETL()

        try:

            etl.process(parameters=parameters, data=data)

        # if exception because user interrupted by keyboard, respect this and abbort
        except KeyboardInterrupt:
            raise KeyboardInterrupt
        except BaseException as e:
            sys.stderr.write(
                "Exception adding CSV row {} : {}".format(rownumber, e))

            if 'raise_pluginexception' in self.config:
                if self.config['raise_pluginexception']:
                    raise e

    def import_row(self, row, rownumber, docid):

        colnumber = 0

        data = {}

        data['content_type_ss'] = "CSV row"

        data['container_s'] = docid

        data['page_i'] = str(rownumber)

        data['id'] = docid + '#' + str(rownumber)

        for col in row:

            colnumber += 1

            exclude_column = False

            if self.cols_include:
                if not colnumber in self.cols:
                    exclude_column = True
            else:
                if colnumber in self.cols:
                    exclude_column = True

            if not exclude_column:

                if self.titles and len(self.titles) >= colnumber:
                    fieldname = self.titles[colnumber - 1] + "_t"
                else:
                    fieldname = 'column_' + str(colnumber).zfill(2) + "_t"

                data[fieldname] = col

                # if number, save as float value, too
                try:
                    if self.titles and len(self.titles) >= colnumber:
                        fieldname = self.titles[colnumber - 1] + "_f"
                    else:
                        fieldname = 'column_' + str(colnumber).zfill(2) + "_f"
                    data[fieldname] = float(col)
                except ValueError:
                    pass

                self.export_row_data_to_index(data=data, rownumber=rownumber)

        return colnumber

    #
    # read parameters, analyze csv dialect and import row by row
    #

    def enhance_csv(self, parameters, data):

        self.config = parameters.copy()

        docid = parameters['id']

        #
        # Read parameters
        #

        self.read_parameters(parameters, data)

        if 'csvmanager' in parameters:
            self.read_csv_parameters_from_meta_settings(
                metaserver=parameters['csvmanager'], docid=docid)

        # Download, if not a file(name) yet but URI reference

        # todo: move to csv manager or downloader plugin that in that case should use etl_web
        if 'filename' in parameters:

            is_tempfile = False

            self.filename = parameters['filename']

            # if exist delete protocoll prefix file://
            if self.filename.startswith("file://"):
                self.filename = self.filename.replace("file://", '', 1)

        else:

            # Download url to an tempfile
            is_tempfile = True
            self.filename, headers = urllib.request.urlretrieve(self.filename)

        #
        # Get CSV dialect parameters
        #

        dialect_kwargs = self.get_csv_dialect()

        if self.verbose:
            print("Opening CSV file with Encoding {} and dialect {}".format(
                self.encoding, dialect_kwargs))

        #
        # Open and read CSV
        #

        csvfile = open(self.filename, newline='', encoding=self.encoding)

        reader = csv.reader(csvfile, **dialect_kwargs)

        # increase limits to maximum, since there are often text fields with longer texts
        csv.field_size_limit(sys.maxsize)

        rownumber = 0

        #
        # Read CSV row by row
        #

        for row in reader:

            rownumber += 1

            #
            # If title row, read column titles
            #
            if rownumber == self.title_row:

                if self.verbose:
                    print("Importing Titles from row {}".format(self.title_row))

                self.set_titles(row)

            #
            # Import data row
            #
            if rownumber >= self.start_row:

                exclude_row = False

                if self.rows_include:
                    if not rownumber in self.rows:
                        exclude_row = True
                else:
                    if rownumber in self.rows:
                        exclude_row = True

                if exclude_row:
                    if self.verbose:
                        print("Excluding row {}".format(rownumber))
                else:

                    if self.verbose:
                        print("Importing row {}".format(rownumber))

                    count_columns = self.import_row(
                        row, rownumber=rownumber, docid=docid)

        #
        # delete if downloaded tempfile
        #
        if not self.cache:
            if is_tempfile:
                os.remove(self.filename)

        #
        # Print stats
        #

        if self.verbose:
            print("Rows: " + str(rownumber))
            print("Cols: " + str(count_columns))

        return rownumber

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        docid = parameters['id']

        # if CSV (file suffix is .csv), enhance it (import row by row)
        if docid.lower().endswith('.csv') or docid.lower().endswith('.tsv') or docid.lower().endswith('.tab'):
            self.enhance_csv(parameters, data)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_detect_language_tika_server.py
================================================
import os
import sys
import time
import requests

# Extract text from filename


class enhance_detect_language_tika_server(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        if os.getenv('OPEN_SEMANTIC_ETL_TIKA_SERVER'):
            tika_server = os.getenv('OPEN_SEMANTIC_ETL_TIKA_SERVER')
        elif 'tika_server' in parameters:
            tika_server = parameters['tika_server']
        else:
            tika_server = 'http://localhost:9998'


        uri = tika_server + '/language/string'

        analyse_fields = ['title_txt', 'content_txt',
                          'description_txt', 'ocr_t', 'ocr_descew_t']

        text = ''
        for field in analyse_fields:
            if field in data:
                text = "{}{}\n".format(text, data[field])

        if verbose:
            print("Calling Tika server for language detection from {}".format(uri))

        retries = 0
        retrytime = 1
        # wait time until next retry will be doubled until reaching maximum of 120 seconds (2 minutes) until next retry
        retrytime_max = 120
        no_connection = True

        while no_connection:
            try:
                if retries > 0:
                    print(
                        'Retrying to connect to Tika server in {} second(s).'.format(retrytime))
                    time.sleep(retrytime)
                    retrytime = retrytime * 2
                    if retrytime > retrytime_max:
                        retrytime = retrytime_max

                r = requests.put(uri, data=text.encode('utf-8'))

                no_connection = False

            except requests.exceptions.ConnectionError as e:
                retries += 1
                sys.stderr.write(
                    "Connection to Tika server (will retry in {} seconds) failed. Exception: {}\n".format(retrytime, e))

        language = r.content.decode('utf-8')

        if verbose:
            print("Detected language: {}".format(language))

        data['language_s'] = language

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_entity_linking.py
================================================
#
# Named Entity Extraction by Open Semantic Entity Search API dictionary
#

import requests
import sys
import time

from entity_linking.entity_linker import Entity_Linker
import etl
import etl_plugin_core


#
# split a taxonomy entry to separated index fields
#
def taxonomy2fields(taxonomy, field, separator="\t", subfields_suffix="_ss"):

    result = {}

    # if not multivalued field, convert to used list/array strucutre
    if not isinstance(taxonomy, list):
        taxonomy = [taxonomy]

    for taxonomy_entry in taxonomy:

        i = 0
        path = ''
        for taxonomy_entry_part in taxonomy_entry.split(separator):

            taxonomy_fieldname = field + '_taxonomy' + str(i) + subfields_suffix

            if not taxonomy_fieldname in result:
                result[taxonomy_fieldname] = []

            if len(path) > 0:
                path += separator

            path += taxonomy_entry_part

            result[taxonomy_fieldname].append(path)

            i += 1

    return result


class enhance_entity_linking(etl_plugin_core.Plugin):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        entity_linking_taggers = ['all_labels_ss_tag']
        if 'entity_linking_taggers' in parameters:
            entity_linking_taggers = parameters['entity_linking_taggers']

        # add taggers for stemming
        entity_linking_taggers_document_language_dependent = {}
        if 'entity_linking_taggers_document_language_dependent' in parameters:
            entity_linking_taggers_document_language_dependent = parameters[
                'entity_linking_taggers_document_language_dependent']

        if 'language_s' in data:
            # is a language specific tagger there for the detected language?
            if data['language_s'] in entity_linking_taggers_document_language_dependent:
                for entity_linking_tagger in entity_linking_taggers_document_language_dependent[data['language_s']]:
                    if not entity_linking_tagger in entity_linking_taggers:
                        entity_linking_taggers.append(entity_linking_tagger)

        openrefine_server = False
        if 'openrefine_server' in parameters:
            openrefine_server = parameters['openrefine_server']

        taxonomy_fields = ['skos_broader_taxonomy_prefLabel_ss']

        # collect/copy to be analyzed text from all fields
        text = etl_plugin_core.get_text(data=data)

        # tag all entities (by different taggers for different analyzers/stemmers)
        for entity_linking_tagger in entity_linking_taggers:

            results = {}

            retries = 0
            retrytime = 1
            # wait time until next retry will be doubled until reaching maximum of 120 seconds (2 minutes) until next retry
            retrytime_max = 120
            no_connection = True

            while no_connection:
                try:
                    if retries > 0:
                        print(
                            'Retrying to connect to Solr tagger in {} second(s).'.format(retrytime))
                        time.sleep(retrytime)
                        retrytime = retrytime * 2
                        if retrytime > retrytime_max:
                            retrytime = retrytime_max

                    # call REST API
                    if openrefine_server:
                        # use REST-API on (remote) HTTP server
                        params = {'text': text}
                        r = requests.post(openrefine_server, params=params)
                        # if bad status code, raise exception
                        r.raise_for_status()

                        results = r.json()

                    else:
                        # use local Python library
                        linker = Entity_Linker()
                        linker.verbose = verbose

                        results = linker.entities(text=text, taggers=[
                                                  entity_linking_tagger], additional_result_fields=taxonomy_fields)

                    no_connection = False

                except KeyboardInterrupt:
                    raise KeyboardInterrupt

                except requests.exceptions.ConnectionError as e:

                    retries += 1

                    if openrefine_server:
                        sys.stderr.write(
                            "Connection to Openrefine server failed (will retry in {} seconds). Exception: {}\n".format(retrytime, e))
                    else:
                        sys.stderr.write(
                            "Connection to Solr text tagger failed (will retry in {} seconds). Exception: {}\n".format(retrytime, e))

                except requests.exceptions.HTTPError as e:
                    if e.response.status_code == 503:

                        retries += 1

                        if openrefine_server:
                            sys.stderr.write(
                                "Openrefine server temporary unavailable (HTTP status code 503). Will retry in {} seconds). Exception: {}\n".format(retrytime, e))
                        else:
                            sys.stderr.write(
                                "Solr temporary unavailable (HTTP status code 503). Will retry in {} seconds). Exception: {}\n".format(retrytime, e))

                    elif e.response.status_code == 400:
                        no_connection = False

                        # if error because of empty entity index for that tagger because no entities imported yet, no error message / index as fail
                        empty_entity_index = False
                        try:
                            errorstatus = e.response.json()
                            if errorstatus['error']['msg'] == 'field ' + entity_linking_tagger + ' has no indexed data':
                                empty_entity_index = True
                        except:
                            pass

                        if not empty_entity_index:
                            etl.error_message(
                                docid=parameters['id'], data=data, plugin='enhance_entity_linking', e=e)

                    else:
                        no_connection = False
                        etl.error_message(
                            docid=parameters['id'], data=data, plugin='enhance_entity_linking', e=e)

                except BaseException as e:
                    no_connection = False
                    etl.error_message(
                        docid=parameters['id'], data=data, plugin='enhance_entity_linking', e=e)

            if verbose:
                print("Named Entity Linking by Tagger {}: {}".format(
                    entity_linking_tagger, results))

            # write entities from result to document facets
            for match in results:
                for candidate in results[match]['result']:
                    if candidate['match']:
                        for facet in candidate['type']:

                            # use different facet for fuzzy/stemmed matches
                            if not entity_linking_tagger == 'all_labels_ss_tag':
                                # do not use another different facet if same stemmer but forced / not document language dependent
                                entity_linking_tagger_withoutforceoption = entity_linking_tagger.replace(
                                    '_stemming_force_', '_stemming_')
                                facet = facet + entity_linking_tagger_withoutforceoption + '_ss'

                            etl_plugin_core.append(data, facet, candidate['name'])
                            etl_plugin_core.append(data, facet + '_uri_ss',
                                       candidate['id'])
                            etl_plugin_core.append(data, facet + '_preflabel_and_uri_ss',
                                       candidate['name'] + ' <' + candidate['id'] + '>')

                            if 'matchtext' in candidate:
                                for matchtext in candidate['matchtext']:
                                    etl_plugin_core.append(
                                        data, facet + '_matchtext_ss', candidate['id'] + "\t" + matchtext)

                            for taxonomy_field in taxonomy_fields:
                                if taxonomy_field in candidate:
                                    separated_taxonomy_fields = taxonomy2fields(
                                        taxonomy=candidate[taxonomy_field], field=facet)
                                    for separated_taxonomy_field in separated_taxonomy_fields:
                                        etl_plugin_core.append(
                                            data, separated_taxonomy_field, separated_taxonomy_fields[separated_taxonomy_field])

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_extract_email.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import re
import etl_plugin_core

#
# extract email addresses
#

class enhance_extract_email(object):
    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        # collect/copy to be analyzed text from all fields
        text = etl_plugin_core.get_text(data=data)
            

        for match in re.finditer('[\w\.-]+@[\w\.-]+', text, re.IGNORECASE):
            value = match.group(0)
            etl_plugin_core.append(data, 'email_ss', value)


        # if extracted email addresses from data, do further analysis for separated specialized facets
        if 'email_ss' in data:

            # extract email adresses of sender (from)
            for match in re.finditer('From: (.* )?([\w\.-]+@[\w\.-]+)', text, re.IGNORECASE):
                value = match.group(2)
                etl_plugin_core.append(data, 'Message-From_ss', value)

            # extract email adresses (to)
            for match in re.finditer('To: (.* )?([\w\.-]+@[\w\.-]+)', text, re.IGNORECASE):
                value = match.group(2)
                etl_plugin_core.append(data, 'Message-To_ss', value)

            # extract the domain part from all emailadresses to facet email domains
            data['email_domain_ss'] = []
            emails = data['email_ss']
            if not isinstance(emails, list):
                emails = [emails]

            for email in emails:
                domain = email.split('@')[1]
                etl_plugin_core.append(data, 'email_domain_ss', domain)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_extract_hashtags.py
================================================
import etl_plugin_core

# Extract text from filename
class enhance_extract_hashtags(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        minimallenght = 3

        # collect/copy to be analyzed text from all fields
        text = etl_plugin_core.get_text(data=data)

        data['hashtag_ss'] = [word for word in text.split() if (
            word.startswith("#") and len(word) > minimallenght)]

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_extract_law.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import re
import etl_plugin_core


#
# get taxonomy for aggregated facets / filters
#

# example: '§ 153 Abs. 1 Satz 2' -> ['§ 153', '§ 153 Absatz 1', '§ 153 Absatz 1 Satz 2']

# todo:

def get_taxonomy(law_clause, law_code = None):

    law_clauses = [law_clause]
    
    return law_clauses


#1.a
#1(2)
#1 (2)


#
# extract law codes
#

class enhance_extract_law(etl_plugin_core.Plugin):
    
    def process(self, parameters=None, data=None):
        
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        clause_prefixes = [
            '§',
            'Article',
            'Artikel',
            'Art',
            'Section',
            'Sec',
        ]

        clause_subsections = [
            'Abschnitt',
            'Absatz',
            'Abs',
            'Sentence',
            'Satz',
            'S',
            'Halbsatz',
            'Number',
            'Nummer',
            'Nr',
            'Buchstabe',
        ]

        text = etl_plugin_core.get_text(data)


        clauses = []

        rule = '(' + '|'.join(clause_prefixes) + ')\W*((\d+\W\w(\W|\b))|(\d+\w?))(\W?(' + '|'.join(clause_subsections) + ')\W*(\d+\w?|\w(\W|\b)))*'
        for match in re.finditer(rule, text, re.IGNORECASE):
            clause = match.group(0)

            clause = clause.strip()

            clauses.append(clause)

            # if "§123" normalize to "§ 123"
            if clause[0] == '§' and not clause[1] == ' ':
                clause = '§ ' + clause[1:]

            etl_plugin_core.append(data, 'law_clause_ss', clause)

        code_matchtexts = etl_plugin_core.get_all_matchtexts(data.get('law_code_ss_matchtext_ss', []))
        code_matchtexts_with_clause = []

        preflabels = {}
        if 'law_code_ss_preflabel_and_uri_ss' in data:
            preflabels = etl_plugin_core.get_preflabels(data['law_code_ss_preflabel_and_uri_ss'])

        if len(clauses)>0 and len(code_matchtexts)>0:

            text = text.replace("\n", " ")

            for code_match_id in code_matchtexts:

                #get only matchtext (without ID/URI of matching entity)
                for code_matchtext in code_matchtexts[code_match_id]:
    
                    for clause in clauses:
                        if clause + " " + code_matchtext in text or code_matchtext + " " + clause in text:
                            
                            code_matchtexts_with_clause.append(code_matchtext)
                            
                            # if "§123" normalize to "§ 123"
                            if clause[0] == '§' and not clause[1] == ' ':
                                clause = '§ ' + clause[1:]
    
                            law_code_preflabel = code_match_id
                            if code_match_id in preflabels:
                                law_code_clause_normalized = clause + " " + preflabels[code_match_id]
                            else:
                                law_code_clause_normalized = clause + " " + code_match_id
     
                            etl_plugin_core.append(data, 'law_code_clause_ss', law_code_clause_normalized)

        if len(code_matchtexts)>0:
            
            blacklist = []
            listfile = open('/etc/opensemanticsearch/blacklist/enhance_extract_law/blacklist-lawcode-if-no-clause')
            for line in listfile:
                line = line.strip()
                if line and not line.startswith("#"):
                    blacklist.append(line)
            listfile.close()

            if not isinstance(data['law_code_ss_matchtext_ss'], list):
                data['law_code_ss_matchtext_ss'] = [data['law_code_ss_matchtext_ss']]

            blacklisted_code_ids = []
            for code_match_id in code_matchtexts:
                for code_matchtext in code_matchtexts[code_match_id]:
                    if code_matchtext in blacklist:
                        if code_matchtext not in code_matchtexts_with_clause:
                            blacklisted_code_ids.append(code_match_id)
                            data['law_code_ss_matchtext_ss'].remove(code_match_id + "\t" + code_matchtext)

            code_matchtexts = etl_plugin_core.get_all_matchtexts(data.get('law_code_ss_matchtext_ss', []))

            if not isinstance(data['law_code_ss'], list):
                data['law_code_ss'] = [data['law_code_ss']]
            if not isinstance(data['law_code_ss_preflabel_and_uri_ss'], list):
                data['law_code_ss_preflabel_and_uri_ss'] = [data['law_code_ss_preflabel_and_uri_ss']]

            for blacklisted_code_id in blacklisted_code_ids:
                if blacklisted_code_id not in code_matchtexts:
                    data['law_code_ss'].remove(preflabels[blacklisted_code_id])
                    data['law_code_ss_preflabel_and_uri_ss'].remove(preflabels[blacklisted_code_id] + ' <' + blacklisted_code_id + '>')

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_extract_money.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import re
import etl_plugin_core
from numerizer import numerize

#
# extract money
#

class enhance_extract_money(etl_plugin_core.Plugin):

    # todo: all other currency signs from Wikidata
    currency_signs = ['$', '€']

    def process(self, parameters=None, data=None):

        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        moneys = set(data.get('money_ss', []))

        text = etl_plugin_core.get_text(data)
        text = text.replace("\n", " ")

        # convert written numbers like "one" and "two million" to integer like "1" and "2000000"
        if 'language_s' in data:
            if data['language_s'] == "en":
                text = numerize(text)

        currencies_escaped = []

        # currency signs
        for currency in self.currency_signs:
            currencies_escaped.append(re.escape(currency))

        # currency labels
        matched_currency_labels = etl_plugin_core.get_all_matchtexts(data.get('currency_ss_matchtext_ss', []))
        for currency_id in matched_currency_labels:
            #get only matchtext (without ID/URI of matching entity)
            for matchtext in matched_currency_labels[currency_id]:
                currencies_escaped.append(re.escape(matchtext))

        regex_part_number = '\d+((\.|\,)\d+)*'
        regex_part_currencies = '(' + '|'.join(currencies_escaped) + ')'

        rule = regex_part_number + '\s?' + regex_part_currencies
        for match in re.finditer(rule, text, re.IGNORECASE):
            moneys.add(match.group(0))

        rule = regex_part_currencies + '\s?' + regex_part_number
        for match in re.finditer(rule, text, re.IGNORECASE):
            moneys.add(match.group(0))

        data['money_ss'] = list(moneys)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_extract_phone.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import re
import etl_plugin_core

#
# normalize phone number (remove all non-numeric chars except leading +)
# so same number is used for aggregations/facet filters, even if written in different formats (with or without space(s) and hyphen(s))
#

def normalize_phonenumber(phone):
    chars = ['+','0','1','2','3','4','5','6','7','8','9']
    phone_normalized = ''
    for char in phone:
        if char in chars:
            # only first +
            if char == '+':
                if not phone_normalized:
                    phone_normalized = '+'
            else:
                phone_normalized += char

    return phone_normalized


#
# extract phone number(s)
#

class enhance_extract_phone(object):
    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        # collect/copy to be analyzed text from all fields
        text = etl_plugin_core.get_text(data=data)

        for match in re.finditer('[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]', text, re.IGNORECASE):
            value = match.group(0)
            etl_plugin_core.append(data, 'phone_ss', value)


        # if extracted phone number(s), normalize to format that can be used for aggregation/filters

        if 'phone_ss' in data:

            phones = data['phone_ss']
            if not isinstance(phones, list):
                phones = [phones]

            for phone in phones:
                phone_normalized = normalize_phonenumber(phone)
                etl_plugin_core.append(data, 'phone_normalized_ss', phone_normalized)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_extract_text_tika_server.py
================================================
import os
import tempfile
import sys
import time
import requests


def in_parsers(parser, parsers):

    for value in parsers:
        if isinstance(value, list):
            for subvalue in value:
                if subvalue == parser:
                    return True
        else:
            if value == parser:
                return True

    return False


# Extract text from file(name)
class enhance_extract_text_tika_server(object):

    mapping = {
        'Content-Type': 'content_type_ss',
        'dc:creator': 'author_ss',
        'Content-Encoding': 'Content-Encoding_ss',
        'dc:title': 'title_txt',
        'dc:subject': 'subject_ss',
    }

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = parameters['filename']

        tika_log_path = tempfile.mkdtemp(prefix="tika-python-")
        os.environ['TIKA_LOG_PATH'] = tika_log_path

        os.environ['TIKA_CLIENT_ONLY'] = 'True'

        import tika
        from tika import parser

        tika.TikaClientOnly = True

        headers = {}

        do_ocr = parameters.get('ocr', False)
        
        do_ocr_pdf_tika = parameters.get('ocr_pdf_tika', True)
        do_ocr_pdf = False
        if 'plugins' in parameters:
            if 'enhance_pdf_ocr' in parameters['plugins'] and do_ocr_pdf_tika:
                do_ocr_pdf = True

        # if only OCR for PDF enabled (enhance_pdf_ocr as fallback and OCR by tika enabled) but not OCR for image files,
        # run OCR only if file ending .pdf so disabled OCR for other file types
        if do_ocr_pdf and not do_ocr:
            contenttype = data.get('content_type_ss', None)
            if isinstance(contenttype, list):
                contenttype = contenttype[0]

            if contenttype == 'application/pdf' or filename.lower().endswith('.pdf'):
                do_ocr_pdf = True
            else:
                do_ocr_pdf = False

        if 'ocr_lang' in parameters:
            headers['X-Tika-OCRLanguage'] = parameters['ocr_lang']
        
        if do_ocr or do_ocr_pdf:

            if os.getenv('OPEN_SEMANTIC_ETL_TIKA_SERVER'):
                tika_server = os.getenv('OPEN_SEMANTIC_ETL_TIKA_SERVER')
            elif 'tika_server' in parameters:
                tika_server = parameters['tika_server']
            else:
                tika_server = 'http://localhost:9998'

            # OCR embedded images in PDF, if not disabled or has to be done by other plugin
            if do_ocr_pdf:
                headers['X-Tika-PDFextractInlineImages'] = 'true'
            else:
                headers['X-Tika-PDFextractInlineImages'] = 'false'

            # set OCR status in indexed document
            data['etl_enhance_extract_text_tika_server_ocr_enabled_b'] = True
            # OCR is enabled, so was done by this Tika call, no images left to OCR
            data['etl_count_images_yet_no_ocr_i'] = 0
        
        else:
            # OCR (yet) disabled, so use the Tika instance using the fake tesseract so we only get OCR results if in cache
            # else we get OCR status [Image (No OCR yet)] in content, so we know that there are images to OCR for later steps

            if os.getenv('OPEN_SEMANTIC_ETL_TIKA_SERVER_FAKECACHE'):
                tika_server = os.getenv('OPEN_SEMANTIC_ETL_TIKA_SERVER_FAKECACHE')
            elif 'tika_server_fake_ocr' in parameters:
                tika_server = parameters['tika_server_fake_ocr']
            else:
                tika_server = 'http://localhost:9999'

            headers['X-Tika-PDFextractInlineImages'] = 'true'

            # set OCR status in indexed document, so next stage knows that yet no OCR
            data['etl_enhance_extract_text_tika_server_ocr_enabled_b'] = False

        #
        # Parse on Apache Tika Server by python-tika
        #
        if verbose:
            print("Parsing by Tika Server on {} with additional headers {}".format(tika_server, headers))

        retries = 0
        retrytime = 1
        # wait time until next retry will be doubled until reaching maximum of 120 seconds (2 minutes) until next retry
        retrytime_max = 120
        no_connection = True

        while no_connection:
            try:
                if retries > 0:
                    print(
                        'Retrying to connect to Tika server in {} second(s).'.format(retrytime))
                    time.sleep(retrytime)
                    retrytime = retrytime * 2
                    if retrytime > retrytime_max:
                        retrytime = retrytime_max

                parsed = parser.from_file(
                    filename=filename,
                    serverEndpoint=tika_server,
                    headers=headers,
                    requestOptions={'timeout': 60000})

                no_connection = False

            except requests.exceptions.ConnectionError as e:
                retries += 1
                sys.stderr.write(
                    "Connection to Tika server (will retry in {} seconds) failed. Exception: {}\n".format(retrytime, e))

        if parsed['content']:
            data['content_txt'] = parsed['content']

        tika_exception = False
        for tika_field in parsed["metadata"]:

            # there is a field name with exceptions, so copy fieldname to failed plugins
            if 'exception' in tika_field.lower():
                tika_exception = True
                parameters['etl_tika_exception'] = True
                if 'etl_error_plugins_ss' not in data:
                    data['etl_error_plugins_ss'] = []
                data['etl_error_plugins_ss'].append(tika_field)

            # copy Tika fields to (mapped) data fields
            if tika_field in self.mapping:
                data[self.mapping[tika_field]] = parsed['metadata'][tika_field]
            else:
                data[tika_field + '_ss'] = parsed['metadata'][tika_field]

        #
        # anaylze and (re)set OCR status to prevent (re)process unnecessary tasks of later stage(s)
        #
        contenttype = data.get('content_type_ss', None)
        if isinstance(contenttype, list):
            contenttype = contenttype[0]

        ocr_status_known = False

        # file was PDF and OCR for PDF enabled, so we know status
        if do_ocr_pdf:
            ocr_status_known = True

        # all OCR cases enabled, so we know status
        if do_ocr and do_ocr_pdf:
            ocr_status_known = True

        # if no kind of OCR done now, we know status because fake tesseract wrapper
        if not do_ocr and not do_ocr_pdf:
            ocr_status_known = True
        
        # if OCR for images done but content type is PDF and OCR of PDF by Tika is disabled
        # (because using other plugin for that) we do not know status for PDF,
        # since Tika runned without inline OCR for PDF
        if do_ocr and not do_ocr_pdf:
            if not contenttype == 'application/pdf':
                ocr_status_known = True

        if ocr_status_known:
            
            # Tika made an tesseract OCR call (if OCR (yet) off, by fake Tesseract CLI wrapper)
            # so there is really something to OCR?
            if not in_parsers('org.apache.tika.parser.ocr.TesseractOCRParser', data['X-TIKA:Parsed-By_ss']):
                # since Tika did not call (fake or cached) tesseract (wrapper), nothing to OCR in this file,
    
                if verbose:
                    print('Tika OCR parser not used, so nothing to OCR in later stages, too')
                
                # so set all OCR plugin status and OCR configs to done,
                # so filter_file_not_modifield in later stage task will prevent reprocessing
                # because of only this yet not runned plugins or OCR configs
                data['etl_enhance_extract_text_tika_server_ocr_enabled_b'] = True
                data['etl_count_images_yet_no_ocr_i'] = 0
    
                if not tika_exception:
                    parameters['etl_nothing_for_ocr'] = True
                    data['etl_enhance_ocr_descew_b'] = True
                    data['etl_enhance_pdf_ocr_b'] = True
    
            else:
                # OCR parser used by Tika, so there was something to OCR
    
                # If in this case the fake tesseract wrapper could get all results from cache,
                # no additional Tika-Server run with OCR enabled needed
                # So set Tika-Server OCR status of tika-server to done
    
                if not do_ocr and 'content_txt' in data:
    
                    if verbose:
                        print("Tika OCR parser was used, so there is something to OCR")
    
                    # how many images yet not OCRd because no result from cache
                    # so we got fake OCR result "[Image (no OCR yet)]"
                    count_images_yet_no_ocr = data['content_txt'].count('[Image (no OCR yet)]')
                    data['etl_count_images_yet_no_ocr_i'] = count_images_yet_no_ocr
    
                    # got all Tika-Server Tesseract OCR results from cache,
                    # so no additional OCR tasks for later stage
                    if count_images_yet_no_ocr == 0:
                        if verbose:
                            print('But could get all OCR results in this stage from OCR cache')
                        # therefore set status like OCR related config
                        # yet runned, so on next stage filter_file_not_modified
                        # wont process document again only because of OCR
                        # (but not reset status of other plugins,
                        # since maybe additional image in changed file)
                        data['etl_enhance_extract_text_tika_server_ocr_enabled_b'] = True
                        data['etl_count_images_yet_no_ocr_i'] = 0

                        # if not a (maybe changed) PDF, set enhance_pdf_ocr to done, too,
                        # so no reprocessing because this additional plugin on later stage
                        if not contenttype == 'application/pdf':
                            data['etl_enhance_pdf_ocr_b'] = True

        tika_log_file = tika_log_path + os.path.sep + 'tika.log'
        if os.path.isfile(tika_log_file):
            os.remove(tika_log_file)

        os.rmdir(tika_log_path)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_file_mtime.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import os.path
import datetime

#
# Add file modification time
#


class enhance_file_mtime(object):
    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = parameters['filename']

        # get modification time from file
        file_mtime = os.path.getmtime(filename)

        # convert mtime to Lucene format
        file_mtime_masked = datetime.datetime.fromtimestamp(
            file_mtime).strftime("%Y-%m-%dT%H:%M:%SZ")

        if verbose:
            print("File modification time: {}".format(file_mtime_masked))

        data['file_modified_dt'] = file_mtime_masked

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_file_size.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import os.path

#
# add file size
#


class enhance_file_size(object):
    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = parameters['filename']

        # get filesize
        file_size = os.path.getsize(filename)

        if verbose:
            print("File size: {}".format(file_size))

        data['file_size_i'] = str(file_size)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_html.py
================================================
#
# Extracts text within configured HTML tags / XML tags
#

from lxml import etree


class enhance_html(object):

    def elements2data(self, element, data, path=None, recursive=True):

        if self.verbose:
            print("Extracting element {}".format(element.tag))

        if path:
            path += "/" + element.tag
        else:
            path = element.tag

        fieldname = path + '_ss'

        text = element.text

        if text:
            text = text.strip()

        if text:
            if fieldname in data:
                data[fieldname].append(text)
            else:
                data[fieldname] = [text]

        if recursive:
            for child in element:
                data = self.elements2data(
                    element=child, path=path, data=data, recursive=True)

        return data

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        self.verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                self.verbose = True

        filename = parameters['filename']

        if 'content_type_ss' in data:
            mimetype = data['content_type_ss']
        else:
            mimetype = parameters['content_type_ss']

        # if connector returns a list, use only first value (which is the only entry of the list)
        if isinstance(mimetype, list):
            mimetype = mimetype[0]

        if mimetype.startswith('application/xhtml+xml'):

            html_extract_tags = []
            if 'html_extract_tags' in parameters:
                html_extract_tags = parameters['html_extract_tags']

            html_extract_tags_and_children = []
            if 'html_extract_tags_and_children' in parameters:
                html_extract_tags_and_children = parameters['html_extract_tags_and_children']

            parser = etree.HTMLParser()

            et = etree.parse(filename, parser)

            for xpath in html_extract_tags:
                for el in et.xpath(xpath):
                    self.elements2data(element=el, data=data, recursive=False)

            for xpath in html_extract_tags_and_children:
                for el in et.xpath(xpath):
                    self.elements2data(element=el, data=data)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_mapping_id.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-


#
# Map paths or domains
#

class enhance_mapping_id(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        if 'mappings' in parameters:
            parameters['id'] = mapping(
                value=parameters['id'], mappings=parameters['mappings'])

        return parameters, data


# Change value with best/deepest mapping
def mapping(value, mappings=None):
    if mapping is None:
        mappings = {}

    max_match_len = -1

    # check all mappings for matching and use the best
    for map_from, map_to in mappings.items():

        # map from matching value?
        if value.startswith(map_from):

            # if from string longer (deeper path), this is the better matching
            match_len = len(map_from)

            if match_len > max_match_len:
                max_match_len = match_len
                best_match_map_from = map_from
                best_match_map_to = map_to

    # if there is a match, replace first occurance of value with mapping
    if max_match_len >= 0:
        value = value.replace(best_match_map_from, best_match_map_to, 1)

    return value


# Change mapped value to origin value
def mapping_reverse(value, mappings=None):
    if mapping is None:
        mappings = {}

    max_match_len = -1

    # check all mappings for matching and use the best
    for map_from, map_to in mappings.items():

        # map from matching value?
        if value.startswith(map_to):

            # if from string longer (deeper path), this is the better matching
            match_len = len(map_to)

            if match_len > max_match_len:
                max_match_len = match_len
                best_match_map_from = map_from
                best_match_map_to = map_to

    # if there is a match, replace first occurance of value with reverse mapping
    if max_match_len >= 0:
        value = value.replace(best_match_map_to, best_match_map_from, 1)

    return value


================================================
FILE: src/opensemanticetl/enhance_mimetype.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import magic


#
# Get MimeType (Which kind of file is this?)
#
class enhance_mimetype(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = parameters['filename']

        mimetype = None

        m = magic.open(magic.MAGIC_MIME)
        m.load()
        mimetype = m.file(filename)
        m.close()

        if verbose:
            print("Detected MimeType: {}".format(mimetype))

        data['content_type_magic_s'] = mimetype

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_multilingual.py
================================================
#
# Multilinguality
#
# Copy content language specific dynamic fields for language specific analysis like stemming, grammar or synonyms
#
# Language has been detected before by plugin enhance_detect_language using Apache Tika / OpenNLP
#


class enhance_multilingual(object):

    verbose = False

    # languages that are defined in index schema for language specific analysis and used if autodetected as documents language
    languages = ['en', 'fr', 'de', 'es', 'hu', 'pt',
                 'nl', 'ro', 'ru', 'it', 'cz', 'ar', 'fa']
    languages_hunspell = ['hu']

    # languages for language specific analysis even if not the autodetected document language
    languages_force = []
    languages_force_hunspell = []

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        if 'verbose' in parameters:
            self.verbose = parameters['verbose']

        if 'languages' in parameters:
            self.languages = parameters['languages']

        if 'languages_hunspell' in parameters:
            self.languages_hunspell = parameters['languages_hunspell']

        if 'languages_force' in parameters:
            self.languages_force = parameters['languages_force']

        if 'languages_force_hunspell' in parameters:
            self.languages_force_hunspell = parameters['languages_force_hunspell']

        if 'languages_exclude_fields' in parameters:
            self.exclude_fields = parameters['languages_exclude_fields']

        if 'languages_exclude_fields_map' in parameters:
            self.exclude_fields_map = parameters['languages_exclude_fields_map']

        language = data.get('language_s', None)

        #
        # exclude fields like technical metadata
        #
    
        exclude_prefix = []
    
        listfile = open('/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-prefix')
        for line in listfile:
            line = line.strip()
            if line and not line.startswith("#"):
                exclude_prefix.append(line)
        listfile.close()
    
        # suffixes of non-text fields like nubers
        exclude_suffix = []
    
        listfile = open('/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-suffix')
        for line in listfile:
            line = line.strip()
            if line and not line.startswith("#"):
                exclude_suffix.append(line)
        listfile.close()
    
        # full fieldnames
        exclude_fields = []
        listfile = open('/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname')
        for line in listfile:
            line = line.strip()
            if line and not line.startswith("#"):
                exclude_fields.append(line)
        listfile.close()
    
        exclude_fields_map = {}

        language_fields = ['_text_']
        language_specific_data = {}

        # language specific analysis for recognized language of document
        # if language support of detected language in index schema
        if language in self.languages:
            language_fields.append("text_txt_" + language)

        if language in self.languages_hunspell:
            language_fields.append("text_txt_hunspell_" + language)

        # fields for language specific analysis by forced languages even if other language or false recognized language
        for language_force in self.languages_force:

            language_field = "text_txt_" + language_force

            if not language_field in language_fields:
                language_fields.append(language_field)

        for language_force in self.languages_force_hunspell:

            language_field = "text_txt_hunspell_" + language_force

            if not language_field in language_fields:
                language_fields.append(language_field)

        # copy each data field to language specific field with suffix _txt_$language
        for fieldname in data:

            exclude = False

            # do not copy excluded fields
            for exclude_field in exclude_fields:
                if fieldname == exclude_field:
                    exclude = True

            for prefix in exclude_prefix:
                if fieldname.startswith(prefix):
                    exclude = True

            for suffix in exclude_suffix:
                if fieldname.endswith(suffix):
                    exclude = True

            if not exclude and data[fieldname]:

                # copy field to default field with added suffixes for language dependent stemming/analysis
                for language_field in language_fields:

                    excluded_by_mapping = False

                    if language_field in exclude_fields_map:
                        if fieldname in exclude_fields_map[language_field]:
                            excluded_by_mapping = True
                            if self.verbose:
                                print("Multilinguality: Excluding field {} to be copied to {} by config of exclude_field_map".format(
                                    fieldname, language_field))

                    if not excluded_by_mapping:
                        if self.verbose:
                            print("Multilinguality: Add {} to {}".format(
                                fieldname, language_field))

                        if not language_field in language_specific_data:
                            language_specific_data[language_field] = []

                        if isinstance(data[fieldname], list):
                            language_specific_data[language_field].extend(
                                data[fieldname])
                        else:
                            language_specific_data[language_field].append(
                                data[fieldname])

        # append language specific fields to data
        for key in language_specific_data:
            data[key] = language_specific_data[key]

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_ner_spacy.py
================================================
import etl
import requests
import json
import os
import sys
import time

#
# SpaCy Named Entity Recognizer (NER)
#

# Appends classified (Persons, Locations, Organizations) entities (names/words) to mapped facets/fields

class enhance_ner_spacy(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        if 'spacy_ner_mapping' in parameters:
            mapping = parameters['spacy_ner_mapping']
        else:
            mapping = {
                'ORG': 'organization_ss',
                'NORP': 'organization_ss',
                'orgName': 'organization_ss',
                'ORGANIZATION': 'organization_ss',
                'PER': 'person_ss',
                'PERSON': 'person_ss',
                'persName': 'person_ss',
                'GPE': 'location_ss',
                'LOC': 'location_ss',
                'placeName': 'location_ss',
                'FACILITY': 'location_ss',
                'PRODUCT': 'product_ss',
                'EVENT': 'event_ss',
                'LAW': 'law_ss',
                'DATE': 'date_ss',
                'TIME': 'time_ss',
                'MONEY': 'money_ss',
                'WORK_OF_ART': 'work_of_art_ss',
            }

        # default classifier
        classifier = 'en_core_web_sm'

        if 'spacy_ner_classifier_default' in parameters:
            classifier = parameters['spacy_ner_classifier_default']

        # set language specific classifier, if configured and document language detected
        if 'spacy_ner_classifiers' in parameters and 'language_s' in data:
            # is a language specific classifier there for the detected language?
            if data['language_s'] in parameters['spacy_ner_classifiers']:
                classifier = parameters['spacy_ner_classifiers'][data['language_s']]

        # if standard classifier configured to None and no classifier for detected language, exit the plugin
        if not classifier:
            return parameters, data

        if verbose:
            print("Using SpaCY NER language / classifier: {}".format(classifier))

        analyse_fields = ['title_txt', 'content_txt',
                          'description_txt', 'ocr_t']

        text = ''
        for field in analyse_fields:
            if field in data:
                text = "{}{}\n".format(text, data[field])

        # classify/tag with class each word of the content

        url = "http://localhost:8080/ent"
        if os.getenv('OPEN_SEMANTIC_ETL_SPACY_SERVER'):
            url = os.getenv('OPEN_SEMANTIC_ETL_SPACY_SERVER') + '/ent'

        headers = {'content-type': 'application/json'}
        d = {'text': text, 'model': classifier}

        retries = 0
        retrytime = 1
        # wait time until next retry will be doubled until reaching maximum of 120 seconds (2 minutes) until next retry
        retrytime_max = 120
        no_connection = True

        while no_connection:
            try:
                if retries > 0:
                    print(
                        'Retrying to connect to Spacy services in {} second(s).'.format(retrytime))
                    time.sleep(retrytime)
                    retrytime = retrytime * 2
                    if retrytime > retrytime_max:
                        retrytime = retrytime_max

                response = requests.post(
                    url, data=json.dumps(d), headers=headers)
                # if bad status code, raise exception
                response.raise_for_status()

                no_connection = False

            except requests.exceptions.ConnectionError as e:
                retries += 1
                sys.stderr.write(
                    "Connection to Spacy services (will retry in {} seconds) failed. Exception: {}\n".format(retrytime, e))

        r = response.json()

        for ent in r:

            entity_class = ent['label']
            # get entity string from returned start and end value
            entity = text[int(ent['start']): int(ent['end'])]

            # strip whitespaces from begin and end
            entity = entity.strip()

            # after strip exclude empty entities
            if not entity:
                continue

            # if class of entity is mapped to a facet/field, append the entity to this facet/field

            if entity_class in mapping:

                if verbose:
                    print("NER classified word(s)/name {} to {}. Appending to mapped facet {}".format(
                        entity, entity_class, mapping[entity_class]))

                etl.append(data, mapping[entity_class], entity)

            else:
                if verbose:
                    print("Since Named Entity Recognition (NER) class {} not mapped to a field/facet, ignore entity/word(s): {}".format(entity_class, entity))

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_ner_stanford.py
================================================
import etl
from nltk.tag.stanford import StanfordNERTagger


#
# Stanford Named Entitiy Recognizer (NER)
#

# Appends classified (Persons, Locations, Organizations) entities (names/words) to mapped facets/fields

class enhance_ner_stanford(object):

    # compound words of same class to multi word entities (result is a split by class changes instead of split on single words/tokens)
    def multi_word_entities(self, entities):

        multi_word_entities = []
        multi_word_entity = ""
        last_entity_class = ""

        i = 0

        for entity, entity_class in entities:

            i += 1

            class_change = False

            # new entity class different from last words which had been joined?
            if last_entity_class:
                if entity_class != last_entity_class:
                    class_change = True

            # if new class add last values to dictionary and begin new multi word entity
            if class_change:
                multi_word_entities.append(
                    (multi_word_entity, last_entity_class))
                multi_word_entity = ""

            # add new word to multi word entity
            if multi_word_entity:
                multi_word_entity += " " + entity
            else:
                multi_word_entity = entity

            # if last entity, no next class change, so add now
            if i == len(entities):
                multi_word_entities.append((multi_word_entity, entity_class))

            last_entity_class = entity_class

        return multi_word_entities

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        if 'stanford_ner_mapping' in parameters:
            mapping = parameters['stanford_ner_mapping']
        else:
            # todo: extend mapping for models with more classes like dates
            mapping = {
                'PERSON': 'person_ss',
                'LOCATION': 'location_ss',
                'ORGANIZATION': 'organization_ss',
                'I-ORG': 'organization_ss',
                'I-PER': 'person_ss',
                'I-LOC': 'location_ss',
                'ORG': 'organization_ss',
                'PER': 'person_ss',
                'LOC': 'location_ss',
                'PERS': 'person_ss',
                'LUG': 'location_ss',
                'MONEY': 'money_ss',
            }

        # default classifier
        classifier = 'english.all.3class.distsim.crf.ser.gz'

        if 'stanford_ner_classifier_default' in parameters:
            classifier = parameters['stanford_ner_classifier_default']

        # set language specific classifier, if configured and document language detected
        if 'stanford_ner_classifiers' in parameters and 'language_s' in data:
            # is a language speciic cassifier there for the detected language?
            if data['language_s'] in parameters['stanford_ner_classifiers']:
                classifier = parameters['stanford_ner_classifiers'][data['language_s']]

        # if standard classifier configured to None and no classifier for detected language, exit the plugin
        if not classifier:
            return parameters, data

        kwargs = {}

        if 'stanford_ner_java_options' in parameters:
            kwargs['java_options'] = parameters['stanford_ner_java_options']

        if 'stanford_ner_path_to_jar' in parameters:
            kwargs['path_to_jar'] = parameters['stanford_ner_path_to_jar']

        analyse_fields = ['title_txt', 'content_txt',
                          'description_txt', 'ocr_t', 'ocr_descew_t']

        text = ''
        for field in analyse_fields:
            if field in data:
                text = "{}{}\n".format(text, data[field])

        # classify/tag with class each word of the content
        st = StanfordNERTagger(classifier, encoding='utf8',
                               verbose=verbose, **kwargs)
        entities = st.tag(text.split())

        # compound words of same class to multi word entities (result is a split by class changes instead of split on single words/tokens)
        entities = self.multi_word_entities(entities)

        # if class of entity is mapped to a facet/field, append the entity to this facet/field
        for entity, entity_class in entities:

            if entity_class in mapping:

                if verbose:
                    print("NER classified word(s)/name {} to {}. Appending to mapped facet {}".format(
                        entity, entity_class, mapping[entity_class]))

                etl.append(data, mapping[entity_class], entity)

            else:
                if verbose:
                    print("Since Named Entity Recognition (NER) class {} not mapped to a field/facet, ignore entity/word(s): {}".format(entity_class, entity))

        # mark the document, that it was analyzed by this plugin yet
        data['enhance_ner_stanford_b'] = "true"

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_ocr.py
================================================
from tesseract_cache import tesseract_cache


#
# If image add ocr text
#
class enhance_ocr(object):

    # how to find uris which are not enriched yet?
    # (if not enhanced on indexing but later)

    # this plugin needs to read the field id as a
    # parameters to enrich unenriched docs
    fields = ['id', 'content_type']

    # query to find documents, that were not enriched by this plugin yet
    # (since we marked documents which were OCRd with ocr_b = true
    query = "content_type: image/* AND NOT enhance_ocr_b:true"

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = parameters.get('verbose', False)

        filename = parameters['filename']

        if 'content_type_ss' in data:
            mimetype = data['content_type_ss']
        else:
            mimetype = parameters['content_type_ss']

        # if connector returns a list, use only first
        # value (which is the only entry of the list)
        if isinstance(mimetype, list):
            mimetype = mimetype[0]

        lang = parameters.get('ocr_lang', 'eng')

        if "image" in mimetype.lower():
            if verbose:
                print("Mimetype seems image ({}), starting OCR"
                      .format(mimetype))

            ocr_txt = tesseract_cache.get_ocr_text(filename=filename, lang=lang, cache_dir=parameters.get("ocr_cache"))

            if ocr_txt:
                data['ocr_t'] = ocr_txt

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_path.py
================================================
import os.path

#
# Build and add path facets from filename
#

class enhance_path(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        docid = parameters['id']

        filename_extension = os.path.splitext(docid)[1][1:].lower()
        if filename_extension:
            data['filename_extension_s'] = filename_extension

        if 'facet_path_strip_prefix' in parameters:
            facet_path_strip_prefix = parameters['facet_path_strip_prefix']
        else:
            facet_path_strip_prefix = ['file://', 'http://', 'https://']

        # if begins with unwanted path prefix strip it
        if facet_path_strip_prefix:
            for prefix in facet_path_strip_prefix:
                if docid.startswith(prefix):
                    docid = docid.replace(prefix, '', 1)
                    break

        # replace backslash (i.e. windows filenames) with unix path seperator
        docid = docid.replace("\\", '/')

        # replace # (i.e. uri) with unix path seperator
        docid = docid.replace("#", '/')

        # if more than one /
        docid = docid.replace("//", '/')

        # split paths
        path = docid.split('/')

        # it's only a domain
        if (len(path) == 1) or (len(path) == 2 and docid.endswith('/')):
            data['path0_s'] = path[0]

        else:
            # it's a path

            # if leading / on unix paths, split leads to first element empty, so delete it
            if not path[0]:
                del path[0]

            i = 0
            for subpath in path:

                if i == len(path) - 1:
                    # last element, so basename/pure filename without path
                    if subpath:  # if not ending / so empty last part after split
                        data['path_basename_s'] = subpath
                else:
                    # not last path element (=filename), so part of path, not the filename at the end
                    data['path' + str(i) + '_s'] = subpath
                    i += 1

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_pdf_ocr.py
================================================
import os.path
import sys
import subprocess
import hashlib
import tempfile
import json

import etl_plugin_core
from tesseract_cache import tesseract_cache


# Extract text from all extracted images from pdf
# if splitpages is off, return one txt instead of page based list of texts

def pdfimages2text(filename, lang='eng', verbose=False,
                   pdf_ocr=True,
                   cache=None):
    ocr_txt = {}
    if cache is not None:
        try:
            return load_cache(filename, cache, lang, pdf_ocr)
        except (FileNotFoundError, KeyError):
            if verbose:
                print('Not in PDF OCR cache, starting OCR for {}'.format(filename))

    ocr_temp_dirname = tempfile.mkdtemp(prefix="opensemanticetl_pdf_ocr_")

    # Extract all images of the pdf to tempdir with commandline tool
    # "pdfimages" from poppler pdf toolbox
    # -j = export as JPEG
    # -p = write page name in image filename
    result = subprocess.call(
        ['pdfimages', '-p', '-j', filename,
         ocr_temp_dirname + os.path.sep + 'image'])

    if result != 0:
        sys.stderr.write(
            "Error: Extracting images from PDF failed for {} {}"
            .format(filename, result))
        return {}, {}

    images = os.listdir(ocr_temp_dirname)
    images.sort()

    for image in images:

        imagefilename = ocr_temp_dirname + os.path.sep + image

        if pdf_ocr:

            try:
                result = tesseract_cache.get_ocr_text(filename=imagefilename, lang=lang, cache_dir=cache)

                if result:
                    # extract page number from extracted image
                    # filename (image-pagenumber-imagenumber.jpg)
                    pagenumber = int(image.split('-')[1])

                    append_page(ocr_txt, pagenumber, result)
            except BaseException as e:
                sys.stderr.write("Exception while OCR of PDF: {} - "
                                 "maybe corrupt image: {} - exception: {}\n"
                                 .format(filename, imagefilename, e))

        os.remove(imagefilename)

    os.rmdir(ocr_temp_dirname)
    return ocr_txt


def load_cache(filename, cache, lang='eng',
               pdf_ocr=True):
    pdffile = open(filename, 'rb')
    md5hash = hashlib.md5(pdffile.read()).hexdigest()
    pdffile.close()
    ocr_cache_filename = cache + os.path.sep + \
        "{}-{}.json".format(lang, md5hash)
    with open(ocr_cache_filename) as f:
        dct = json.load(f)
        ocr_txt = None
        if pdf_ocr:
            ocr_txt = dict(enumerate(dct["ocr_txt"], 1))
        return ocr_txt


def append_page(dct, n, page):
    if n in dct:
        dct[n] += '\n' + page
    else:
        dct[n] = page

#
# Process plugin
#
# check if content type PDF, if so start enrich pdf process for OCR
#

class enhance_pdf_ocr(etl_plugin_core.Plugin):

    # process plugin, if one of the filters matches
    filter_filename_suffixes = ['.pdf']
    filter_mimetype_prefixes = ['application/pdf']

    # how to find uris which are not enriched yet?
    # (if not enhanced on indexing but later)

    # this plugin needs to read the field id as a parameters
    # to enrich unenriched docs
    fields = ['id', 'content_type']

    # query to find documents, that were not enriched by this plugin yet
    # (since we marked documents which were OCRd with ocr_b = true
    query = ("(content_type:application/pdf*) "
             "AND NOT (etl_enhance_pdf_ocr_b:true)")

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = parameters.get('verbose', False)

        # no further processing, if plugin filters like for content type do not match
        if self.filter(parameters, data):
            return parameters, data
    
        filename = parameters['filename']

        # is OCR of embedded images by Tika enabled or disabled by config?
        ocr_pdf_tika = parameters.get('ocr_pdf_tika', True)

        # was there a Tika exception?
        tika_exception = parameters.get('etl_tika_exception', False)
        if 'etl_error_plugins_ss' in data:
            if 'enhance_extract_text_tika_server' in data['etl_error_plugins_ss']:
                tika_exception = True


        # OCR is done by Apache Tika plugin
        # If standard OCR by Tika is disabled or Tika Exception, do it here
        pdf_ocr = False

        # Do not run if no images (detected by Tika plugin)
        nothing_for_ocr = parameters.get('etl_nothing_for_ocr', False)

        if nothing_for_ocr:

            if verbose:
                print('Not running OCR for PDF, since no image(s) detected by Apache Tika')
            
            pdf_ocr = False

        elif tika_exception or ocr_pdf_tika == False:
            pdf_ocr = True

        if pdf_ocr:
    
            if verbose:
                print('Mimetype is PDF or file ending is .pdf, running OCR of embedded images')

                if not ocr_pdf_tika:
                    print ('OCR of embedded images in PDF by Apache Tika is disabled, so doing OCR for PDF by plugin enhance_pdf_ocr')
                elif tika_exception:
                    print ('Because of Apache Tika exception, adding / trying fallback OCR for PDF by plugin enhance_pdf_ocr')

            lang = parameters.get('ocr_lang', 'eng')
    
            ocr_txt = {}

            try:
                ocr_txt = pdfimages2text(
                    filename=filename, lang=lang, verbose=verbose,
                    pdf_ocr=pdf_ocr,
                    cache=parameters.get("ocr_cache"))
            except BaseException as e:
                sys.stderr.write(
                    "Exception while OCR the PDF {} - {}\n".format(filename, e))
        
            parameters['enhance_pdf_ocr'] = ocr_txt

            # create text field ocr_t with all OCR results of all pages
            pages_content = [value for (key, value) in sorted(ocr_txt.items())]
            data['ocr_t'] = "\n".join(pages_content)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_pdf_page.py
================================================
import os
import sys
import subprocess
import tempfile
import hashlib

import etl_plugin_core
from etl import ETL

#
# by split to pages (so we have links to pages instead of documents) and get text from OCR from previous running plugin enhance_pdf_ocr and run plugins for splitting results into paragraphs and sentences
#


class enhance_pdf_page(etl_plugin_core.Plugin):

    # process plugin, if one of the filters matches
    filter_filename_suffixes = ['.pdf']
    filter_mimetype_prefixes = ['application/pdf']

    # how to find uris which are not enriched yet?
    # (if not enhanced on indexing but later)

    # this plugin needs to read the field id as a parameters to enrich unenriched docs
    fields = ['id', 'content_type']

    # query to find documents, that were not enriched by this plugin yet
    # (since we marked documents which were OCRd with ocr_b = true
    query = "content_type: application\/pdf* AND NOT enhance_pdf_page_b:true"

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        # no further processing, if plugin filters like for content type do not match
        if self.filter(parameters, data):
            return parameters, data

        if verbose:
            print('Mimetype or filename suffix is PDF, extracting single pages for segmentation')

        if 'id' in data:
            docid = data['id']
        else:
            docid = parameters['id']

        filename = parameters['filename']

        # defaults, if pdfinfo will not detect them
        pages = 1
        title = 'No title'
        author = None

        # get pagecount with pdfinfo command line tool
        pdfinfo = subprocess.check_output(
            ['pdfinfo', '-enc', 'UTF-8', filename])

        # decode
        pdfinfo = pdfinfo.decode(encoding='UTF-8')

        # get the count of pages from pdfinfo result
        # its a text with a line per parameter
        for line in pdfinfo.splitlines():
            line = line.strip()
            # we want only the line with the pagecount
            if line.startswith('Pages:'):
                pages = int(line.split()[1])

            if line.startswith('Title:'):
                title = line.replace("Title:", '', 1)
                title = title.strip()

            if line.startswith('Author:'):
                author = line.replace("Author:", '', 1)
                author = author.strip()

        etl = ETL()

        # export and index each page
        for pagenumber in range(1, pages + 1):

            if verbose:
                print("Extracting PDF page {} of {}".format(pagenumber, pages))
            # generate temporary filename
            md5hash = hashlib.md5(filename.encode('utf-8')).hexdigest()
            temp_filename = tempfile.gettempdir() + os.path.sep + \
                "opensemanticetl_pdftotext_" + md5hash + "_" + str(pagenumber)

            # call pdftotext to write the text of page into tempfile
            try:
                result = subprocess.check_call(['pdftotext', '-enc', 'UTF-8', '-f', str(
                    pagenumber), '-l', str(pagenumber), filename, temp_filename])
            except BaseException as e:
                sys.stderr.write(
                    "Exception extracting text from PDF page {}: {}\n".format(pagenumber, e))

            # read text from tempfile
            f = open(temp_filename, "r", encoding="utf-8")
            text = f.read()
            os.remove(temp_filename)

            partdocid = docid + '#page=' + str(pagenumber)

            partparameters = parameters.copy()
            partparameters['plugins'] = ['enhance_path', 'enhance_detect_language_tika_server',
                                         'enhance_entity_linking', 'enhance_multilingual']

            if 'enhance_ner_spacy' in parameters['plugins']:
                partparameters['plugins'].append('enhance_ner_spacy')
            if 'enhance_ner_stanford' in parameters['plugins']:
                partparameters['plugins'].append('enhance_ner_stanford')

            pagedata = {}
            pagedata['id'] = partdocid

            pagedata['page_i'] = pagenumber
            pagedata['pages_i'] = pages
            pagedata['container_s'] = docid
            pagedata['title_txt'] = title

            if author:
                pagedata['author_ss'] = author

            pagedata['content_type_group_ss'] = "Page"
            pagedata['content_type_ss'] = "PDF page"
            pagedata['content_txt'] = text

            if verbose:
                print("Indexing extracted page {}".format(pagenumber))

            # index page
            try:
                partparameters, pagedata = etl.process(
                    partparameters, pagedata)

            except BaseException as e:
                sys.stderr.write(
                    "Exception adding PDF page {} : {}".format(pagenumber, e))

        data['pages_i'] = pages

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_pdf_page_preview.py
================================================
import sys
import subprocess
from pathlib import Path
import hashlib

import etl_plugin_core


# generate single page PDF for each page of the full PDF for preview so client has not to load full pdf for previewing a page

class enhance_pdf_page_preview(etl_plugin_core.Plugin):
    
    # process plugin, if one of the filters matches
    filter_filename_suffixes = ['.pdf']
    filter_mimetype_prefixes = ['application/pdf']


    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        # no further processing, if plugin filters like for content type do not match
        if self.filter(parameters, data):
            return parameters, data

        if verbose:
            print('Mimetype or filename suffix is PDF, extracting single pages for preview')

        if 'id' in data:
            docid = data['id']
        else:
            docid = parameters['id']

        filename = parameters['filename']

        thumbnail_dir = '/var/opensemanticsearch/media/thumbnails'

        # generate thumbnail directory
        md5hash = hashlib.md5(docid.encode('utf-8')).hexdigest()

        if not thumbnail_dir.endswith('/'):
            thumbnail_dir += '/'

        thumbnail_subdir = md5hash

        Path(thumbnail_dir + thumbnail_subdir).mkdir(parents=True, exist_ok=True)

        if verbose:
            print("Generating single page PDF for previews from {} for {} to {}".format(
                filename, docid, thumbnail_dir + thumbnail_subdir))

        # call pdftk burst
        try:
            result = subprocess.check_call(
                ['pdftk', filename, 'burst', 'output', thumbnail_dir + thumbnail_subdir + '/%d.pdf'])
            data['etl_thumbnails_s'] = thumbnail_subdir
        except BaseException as e:
            sys.stderr.write(
                "Exception while genarating single page PDFs by pdftk burst\n")

        return parameters, data
    

================================================
FILE: src/opensemanticetl/enhance_pst.py
================================================
import sys
import hashlib
import tempfile
import os
import shutil
import subprocess

import etl_plugin_core
from etl_file import Connector_File

#
# Extract emails from Outlook PST file
#

class enhance_pst(etl_plugin_core.Plugin):
    # process plugin, if one of the filters matches
    filter_filename_suffixes = ['.pst']
    filter_mimetype_prefixes = ['application/vnd.ms-outlook-pst']

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        # no further processing, if plugin filters like for content type do not match
        if self.filter(parameters, data):
            return parameters, data

        if verbose:
            print("Mimetype or file ending seems Outlook PST file, starting extraction of emails")


        pstfilename = parameters['filename']

        # we build temp dirname ourselfes instead of using system_temp_dirname so we can use configurable / external tempdirs

        if 'tmp' in parameters:
            system_temp_dirname = parameters['tmp']
            if not os.path.exists(system_temp_dirname):
                os.mkdir(system_temp_dirname)
        else:
            system_temp_dirname = tempfile.gettempdir()

        h = hashlib.md5(parameters['id'].encode('UTF-8'))
        temp_dirname = system_temp_dirname + os.path.sep + \
            "opensemanticetl_enhancer_pst_" + \
            str(os.getpid()) + "_" + h.hexdigest()

        if not os.path.exists(temp_dirname):
            os.mkdir(temp_dirname)

        # start external PST extractor / converter
        result = subprocess.call(
            ['readpst', '-S', '-D', '-o', temp_dirname, pstfilename])

        if not result == 0:
            sys.stderr.write(
                "Error: readpst failed for {}".format(pstfilename))

        # prepare document processing
        connector = Connector_File()
        connector.verbose = verbose
        connector.config = parameters.copy()

        # only set container if not yet set by a ZIP or PST before (if this PST is inside another ZIP or PST)
        if not 'container' in connector.config:
            connector.config['container'] = pstfilename

        for dirName, subdirList, fileList in os.walk(temp_dirname):

            if verbose:
                print('Scanning directory: %s' % dirName)

            for fileName in fileList:
                if verbose:
                    print('Scanning file: %s' % fileName)

                try:
                    # replace temp dirname from indexed id
                    contained_dirname = dirName.replace(temp_dirname, '', 1)

                    # build a virtual filename pointing to original PST file

                    if contained_dirname:
                        contained_dirname = contained_dirname + os.path.sep
                    else:
                        contained_dirname = os.path.sep

                    connector.config['id'] = parameters['id'] + \
                        contained_dirname + fileName

                    contained_filename = dirName + os.path.sep + fileName

                    # E-mails filenames are pure number
                    # Attachment file names are number-filename
                    # if temp_filename without - in filename, its a mail file
                    # rename to suffix .eml so Tika will extract more metadata like from and to
                    if not '-' in fileName:
                        os.rename(contained_filename,
                                  contained_filename + '.eml')
                        contained_filename += '.eml'
                        connector.config['id'] += '.eml'

                    try:
                        connector.index_file(filename=contained_filename)

                    except KeyboardInterrupt:
                        raise KeyboardInterrupt

                    except BaseException as e:
                        sys.stderr.write("Exception while indexing contained content {} from {} : {}\n".format(
                            fileName, connector.config['container'], e.args[0]))

                    os.remove(contained_filename)

                except BaseException as e:
                    sys.stderr.write(
                        "Exception while indexing file {} : {}\n".format(fileName, e.args[0]))

        shutil.rmtree(temp_dirname)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_rdf.py
================================================
import sys
import logging
import rdflib

import etl_plugin_core

# define used ontologies / standards / properties
skos = rdflib.Namespace('http://www.w3.org/2004/02/skos/core#')
owl = rdflib.Namespace('http://www.w3.org/2002/07/owl#')

import etl
from etl import ETL


# Import RDF graph file granular, not only as a whole single file:
# for every entity (subject) own document with properties (predicates) as facets and its objects as values

class enhance_rdf(etl_plugin_core.Plugin):

    def __init__(self, verbose=False):

        self.verbose = verbose

        self.labelProperties = (rdflib.term.URIRef(u'http://www.w3.org/2004/02/skos/core#prefLabel'), rdflib.term.URIRef(u'http://www.w3.org/2000/01/rdf-schema#label'),
                                rdflib.term.URIRef(u'http://www.w3.org/2004/02/skos/core#altLabel'), rdflib.term.URIRef(u'http://www.w3.org/2004/02/skos/core#hiddenLabel'))


    #
    # get all labels, alternate labels / synonyms for the URI/subject, if not there, use subject (=URI) as default
    #

    def get_labels(self, subject):

        labels = []

        # append RDFS.label

        # get all labels for this obj
        for label in self.graph.objects(subject=subject, predicate=rdflib.RDFS.label):
            labels.append(str(label))

        #
        # append SKOS labels
        #

        # append SKOS prefLabel
        skos = rdflib.Namespace('http://www.w3.org/2004/02/skos/core#')
        for label in self.graph.objects(subject=subject, predicate=skos['prefLabel']):
            labels.append(str(label))

        # append SKOS altLabels
        for label in self.graph.objects(subject=subject, predicate=skos['altLabel']):
            labels.append(str(label))

        # append SKOS hiddenLabels
        for label in self.graph.objects(subject=subject, predicate=skos['hiddenLabel']):
            labels.append(str(label))

        return labels

    #
    # Get indexable full text(s) / label(s) instead of URI references
    #

    def get_values(self, obj):

        values = []

        # since we want full text search we want not to use ID/URI but all labels for indexing
        # if type not literal but URI reference, add label(s)

        if type(obj) == rdflib.URIRef:

            # get labels of this object, therefore it is the subject parameter for getlabels()
            values = self.get_labels(subject=obj)

            if not values:

                if self.verbose:
                    print("No label for this object, using URI {}".format(obj))

                values = str(obj)

        elif type(obj) == rdflib.term.Literal:
            values = str(obj)

        # if no values or labels, use the object / URI
        if not values:
            if self.verbose:
                print("No label or URI for this object, using object {}".format(obj))
                print("Data type of RDF object: {}".format(type(obj)))

            values = str(obj)

        return values

    # best/preferred label as title
    def get_preferred_label(self, subject, lang='en'):

        preferred_label = self.graph.preferredLabel(
            subject=subject, lang=lang, labelProperties=self.labelProperties)

        # if no label in preferred language, try with english, if not preferred lang is english yet)
        if not preferred_label and not lang == 'en':

            preferred_label = self.graph.preferredLabel(
                subject=subject, lang='en', labelProperties=self.labelProperties)

        # use label from some other language
        if not preferred_label:

            preferred_label = self.graph.preferredLabel(
                subject=subject, labelProperties=self.labelProperties)

        # if no label, use URI
        if preferred_label:
            # since return is tuple with type and label take only the label
            preferred_label = preferred_label[0][1]
        else:
            preferred_label = subject

        return str(preferred_label)

    #
    # ETL knowledge graph to full text search index
    #

    # Index each entity / subject with all its properties/predicates as facets and objects (dereference URIs by their labels) as values

    def etl_graph(self, parameters):

        if self.verbose:
            print("Graph has {} triples.".format(len(self.graph)))

        count_triple = 0
        count_subjects = 0

        part_parameters = {}
        part_parameters['plugins'] = []
        part_parameters['export'] = parameters['export']

        property2facet = {}
        if 'property2facet' in parameters:
            property2facet = parameters['property2facet']

        etl_processor = ETL()
        etl_processor.verbose = self.verbose

        class_properties = []
        class_properties.append(rdflib.term.URIRef(
            u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'))
        class_properties.append(rdflib.term.URIRef(
            u'http://www.wikidata.org/prop/direct/P31'))
        # since there can be multiple triples/values for same property in/from different graphs or graph describes existing other file/document,
        # do not overwrite document but add value to existent document & values of the facet/field/property
        part_parameters['add'] = True

        # use SPARQL query with distinct to get subjects only once
        res = self.graph.query(
                """SELECT DISTINCT ?subject
			WHERE {
			?subject ?predicate ?object .
			}""")

        for row in res:

            count_subjects += 1

            if self.verbose:
                print("Importing entity / subject {}".format(count_subjects))

            # get subject of the concept from first column
            subj = row[0]

            if self.verbose:
                print("Processing RDF subject {}".format(subj))

            part_data = {}

            part_data['content_type_group_ss'] = 'Knowledge graph'
            # subject as URI/ID
            part_parameters['id'] = str(subj)

            preferred_label = self.get_preferred_label(subject=subj)
            part_data['title_txt'] = preferred_label

            count_subject_triple = 0

            # get all triples for this subject
            for pred, obj in self.graph.predicate_objects(subject=subj):

                count_triple += 1
                count_subject_triple += 1

                if self.verbose:
                    print("Importing subjects triple {}".format(
                        count_subject_triple))
                    print("Predicate / property: {}".format(pred))
                    print("Object / value: {}".format(obj))

                try:

                    # if class add preferredlabel of this entity to facet of its class (RDF rdf:type or Wikidata "instance of" (Property:P31)),
                    # so its name (label) will be available in entities view and as filter for faceted search

                    if pred in class_properties:
                        class_facet = str(obj)
                        # map class to facet, if mapping for class exist
                        if class_facet in property2facet:
                            class_facet = property2facet[class_facet]
                            if class_facet in parameters['facets']:
                                part_data['content_type_ss'] = 'Knowledge graph class {}'.format(
                                    parameters['facets'][class_facet]['label'])
                                etl.append(data=part_data, facet=class_facet, values=preferred_label)

                    #
                    # Predicate/property to facet/field
                    #

                    # set Solr datatype strings so facets not available yet in Solr schema can be inserted automatically (dynamic fields) with right datatype

                    facet = str(pred) + '_ss'
                    facet_uri = facet + '_uri_ss'
                    facet_preferred_label_and_uri = facet + '_preflabel_and_uri_ss'

                    if self.verbose:
                        print("Facet: {}".format(facet))

                    #
                    # get values or labels of this object
                    #

                    values = self.get_values(obj=obj)
                    if self.verbose:
                        print("Values: {}".format(values))

                    # insert or append value (object of triple) to data
                    etl.append(data=part_data, facet=facet, values=values)

                    # if object is reference/URI append URI
                    if type(obj) == rdflib.URIRef:

                        uri = str(obj)

                        etl.append(data=part_data, facet=facet_uri, values=uri)

                        # append mixed field with preferred label and URI of the object for disambiguation of different Entities/IDs/URIs with same names/labels in faceted search
                        preferredlabel_and_uri = "{} <{}>".format(
                            self.get_preferred_label(subject=obj), str(obj))

                    else:
                        preferredlabel_and_uri = self.get_preferred_label(
                            subject=obj)

                    etl.append(
                        data=part_data, facet=facet_preferred_label_and_uri, values=preferredlabel_and_uri)

                except KeyboardInterrupt:
                    raise KeyboardInterrupt

                except BaseException as e:
                    sys.stderr.write("Exception while triple {} of subject {}: {}\n".format(
                        count_subject_triple, subj, e))

            # index subject
            etl_processor.process(part_parameters, part_data)

    def etl_graph_file(self, docid, filename, parameters=None):
        if parameters is None:
            parameters = {}

        self.graph = rdflib.Graph()
        self.graph.parse(filename)

        self.etl_graph(parameters=parameters)

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                self.verbose = True

        # get parameters
        docid = parameters['id']
        filename = parameters['filename']

        mimetype = ''
        if 'content_type_ss' in data:
            mimetype = data['content_type_ss']
        elif 'content_type_ss' in parameters:
            mimetype = parameters['content_type_ss']

        # if connector returns a list, use only first value (which is the only entry of the list)
        if isinstance(mimetype, list):
            mimetype = mimetype[0]

        # todo: add other formats like turtle
        # if mimetype is graph, call graph import
        if mimetype.lower() == "application/rdf+xml":

            self.etl_graph_file(docid, filename, parameters=parameters)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_rdf_annotations_by_http_request.py
================================================
import os
import sys
import hashlib
import urllib
import rdflib
from rdflib import URIRef

# Do templating of metaserver url for id


def metaserver_url(metaserver, docid):

    metaurl = metaserver

    metaurl = metaurl.replace('[uri]', urllib.parse.quote_plus(docid))

    h = hashlib.md5(docid.encode("utf-8"))
    metaurl = metaurl.replace(
        '[uri_md5]', urllib.parse.quote_plus(h.hexdigest()))

    return metaurl


# get the modification date of meta data
# todo: check all metaservers, not only the last one and return latest date

def getmeta_modified(metaservers, docid, verbose=False):

    if isinstance(metaservers, str):
        metaserver = metaservers
    else:
        for server in metaservers:
            metaserver = server

    metaurl = metaserver_url(metaserver, docid)

    moddate = False

    if verbose:
        print("Getting Meta from {}".format(metaurl))

    try:
        g = rdflib.Graph()
        result = g.parse(metaurl)

        # if semantic mediawiki modification date field, take this as date

        for subj, pred, obj in g.triples((None, URIRef("http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate"), None)):

            # todo only if later than previos, if more than one (f.e. more than one metaserver)
            moddate = str(obj)

        if verbose:
            print("Extracted modification date: {}".format(moddate))

        if verbose:
            if not moddate:
                print("No modification date for metadata")

    except BaseException as e:
        sys.stderr.write(
            "Exception while getting metadata modification time: {}\n".format(e.args[0]))

    return moddate


# Get tagging and annotation from metadata server
def getmeta_rdf_from_server(metaserver, data, property2facet, docid, verbose=False):

    moddate = False

    metaurl = metaserver_url(metaserver, docid)

    if verbose:
        print("Getting Meta from {}".format(metaurl))

    g = rdflib.Graph()
    result = g.parse(metaurl)

    # Print infos
    if verbose:
        print("Meta graph has {} statements.".format(len(g)))
        for subj, pred, obj in g:

            try:
                print("{} : {}".format(pred, obj.toPython))
            except BaseException as e:
                sys.stderr.write(
                    "Exception while printing triple: {}\n".format(e.args[0]))

    # make solr iteral for each rdf tripple contained in configurated properties
    for facet in property2facet:

        # if this predicat is configured as facet, add literal with pred as facetname and object as value
        try:
            if verbose:
                print('Checking Facet {}'.format(facet))

            facetRef = URIRef(facet)

            for subj, pred, obj in g.triples((None, facetRef, None)):
                try:

                    # add the facet with object as value
                    solr_facet = property2facet[facet]

                    if verbose:
                        print("Adding Solr facet {} with the object {}".format(
                            solr_facet, obj))

                    if solr_facet in data:
                        data[solr_facet].append(obj.toPython())
                    else:
                        data[solr_facet] = [obj.toPython()]

                except BaseException as e:
                    sys.stderr.write(
                        "Exception while checking predicate {}{}\n".format(pred, e.args[0]))

        except BaseException as e:
            sys.stderr.write(
                "Exception while checking a part of metadata graph: {}\n".format(e.args[0]))

    # if semantic mediawiki modification date field, take this as date
    moddateRef = URIRef(
        "http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate")
    if (None, moddateRef, None) in g:
        for subj, pred, obj in g.triples((None, moddateRef, None)):
            moddate = obj.toPython()

        # todo: transform date format to date and in exporter date to Solr date string format
        #data['meta_modified_dt'] = str(moddate)

        if verbose:
            print("Extracted modification date: {}".format(moddate))

    elif verbose:
        print("No semantic mediawiki modification date")

    return data


# Get tagging and annotation from metadata server

class enhance_rdf_annotations_by_http_request(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        # get parameters
        docid = parameters['id']

        metaserver = parameters['metaserver']
        if os.getenv('OPEN_SEMANTIC_ETL_METADATA_SERVER'):
            metaserver = os.getenv('OPEN_SEMANTIC_ETL_METADATA_SERVER')

        property2facet = parameters['property2facet']

        if isinstance(metaserver, str):
            # get metadata
            metaserver=[metaserver]
 
        for server in metaserver:
            # get and add metadata
            data = getmeta_rdf_from_server(
                metaserver=server, data=data, property2facet=property2facet, docid=docid, verbose=verbose)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_regex.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import re
import etl_plugin_core


def regex2facet(data, text, regex, group, facet, verbose=False):

    if verbose:
        print("Checking regex {} for facet {}".format(regex, facet))

    matches = re.finditer(regex, text, re.IGNORECASE)

    if matches:
        for match in matches:

            try:
                value = match.group(group)
                if verbose:
                    print("Found regex {} with value {} for facet {}".format(
                        regex, value, facet))

                etl_plugin_core.append(data, facet, value)

            except BaseException as e:
                print("Exception while adding value {} from regex {} and group {} to facet {}:".format(
                    value, regex, group, facet))
                print(e.args[0])


# opens a tab with regexes and facets
def readregexesfromfile(data, text, filename, verbose=False):
    listfile = open(filename)

    # search all the lines
    for line in listfile:
        try:
            line = line.strip()

            # ignore empty lines and comment lines (starting with #)
            if line and not line.startswith("#"):
                facet = 'tag_ss'
                columns = line.split("\t")

                regex = columns[0]

                if len(columns) > 1:
                    facet = columns[1]

                if len(columns) > 2:
                    group = int(columns[2])
                else:
                    group = 0

                regex2facet(data=data, text=text, regex=regex,
                            group=group, facet=facet, verbose=verbose)

        except BaseException as e:
            print("Exception while checking line {} of regexlist {}:".format(
                line, filename))
            print(e.args[0])

    listfile.close()


#
# add to configured facet, if entry in list is in text
#

class enhance_regex(object):
    def process(self, parameters=None, data=None):

        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        regexlists = {}

        if 'regex_lists' in parameters:
            regexlists = parameters['regex_lists']

        # collect/copy to be analyzed text from all fields
        text = etl_plugin_core.get_text(data=data)

        for regexlistfile in regexlists:

            try:

                readregexesfromfile(data=data, text=text,
                                    filename=regexlistfile, verbose=verbose)

            except BaseException as e:
                print("Exception while checking regex list {}:".format(regexlistfile))
                print(e.args[0])

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_sentence_segmentation.py
================================================
import json
import os
import requests
import sys
import time

from etl import ETL

#
# split text to sentences
#


class enhance_sentence_segmentation(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        if 'id' in data:
            docid = data['id']
        else:
            docid = parameters['id']

        # default classifier
        classifier = 'en_core_web_sm'

        if 'spacy_ner_classifier_default' in parameters:
            classifier = parameters['spacy_ner_classifier_default']

        # set language specific classifier, if configured and document language detected
        if 'spacy_ner_classifiers' in parameters and 'language_s' in data:
            # is a language speciic cassifier there for the detected language?
            if data['language_s'] in parameters['spacy_ner_classifiers']:
                classifier = parameters['spacy_ner_classifiers'][data['language_s']]

                analyse_fields = ['content_txt', 'ocr_t', 'ocr_descew_t']

        text = ''
        for field in analyse_fields:
            if field in data:
                text = "{}{}\n".format(text, data[field])

        # extract sentences from text
        url = "http://localhost:8080/sents"
        if os.getenv('OPEN_SEMANTIC_ETL_SPACY_SERVER'):
            url = os.getenv('OPEN_SEMANTIC_ETL_SPACY_SERVER') + '/sents'

        headers = {'content-type': 'application/json'}
        d = {'text': text, 'model': classifier}

        retries = 0
        retrytime = 1
        # wait time until next retry will be doubled until reaching maximum of 120 seconds (2 minutes) until next retry
        retrytime_max = 120
        no_connection = True

        while no_connection:
            try:
                if retries > 0:
                    print(
                        'Retrying to connect to Spacy services in {} second(s).'.format(retrytime))
                    time.sleep(retrytime)
                    retrytime = retrytime * 2
                    if retrytime > retrytime_max:
                        retrytime = retrytime_max

                response = requests.post(url, data=json.dumps(d), headers=headers)

                # if bad status code, raise exception
                response.raise_for_status()

                no_connection = False

            except requests.exceptions.ConnectionError as e:
                retries += 1
                sys.stderr.write(
                    "Connection to Spacy services (will retry in {} seconds) failed. Exception: {}\n".format(retrytime, e))

        sentences = response.json()

        etl = ETL()

        sentencenumber = 0

        for sentence in sentences:

            sentencenumber += 1

            partdocid = docid + '#sentence' + str(sentencenumber)

            partparameters = parameters.copy()
            partparameters['plugins'] = ['enhance_path', 'enhance_detect_language_tika_server',
                                         'enhance_entity_linking', 'enhance_multilingual']

            if 'enhance_ner_spacy' in parameters['plugins']:
                partparameters['plugins'].append('enhance_ner_spacy')
            if 'enhance_ner_stanford' in parameters['plugins']:
                partparameters['plugins'].append('enhance_ner_stanford')

            sentencedata = {}
            sentencedata['id'] = partdocid

            sentencedata['container_s'] = docid

            if 'author_ss' in data:
                sentencedata['author_ss'] = data['author_ss']

            sentencedata['content_type_group_ss'] = "Sentence"
            sentencedata['content_type_ss'] = "Sentence"
            sentencedata['content_txt'] = sentence

            # index sentence
            try:
                partparameters, sentencedata = etl.process(
                    partparameters, sentencedata)

            except BaseException as e:
                sys.stderr.write(
                    "Exception adding sentence {} : {}".format(sentencenumber, e))

        data['sentences_i'] = sentencenumber

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_warc.py
================================================
import hashlib
import tempfile
import os
import sys
import shutil
import time

from warcio.archiveiterator import ArchiveIterator

import etl_plugin_core
from etl_file import Connector_File


class enhance_warc(etl_plugin_core.Plugin):

    # process plugin, if one of the filters matches
    filter_filename_suffixes = ['.warc', '.warc.gz']
    filter_mimetype_prefixes = ['application/warc']

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        # no further processing, if plugin filters like for content type do not match
        if self.filter(parameters, data):
            return parameters, data

        warcfilename = parameters['filename']

        # create temp dir where to unwarc the archive
        if 'tmp' in parameters:
            system_temp_dirname = parameters['tmp']
            if not os.path.exists(system_temp_dirname):
                os.mkdir(system_temp_dirname)
        else:
            system_temp_dirname = tempfile.gettempdir()

        # we build temp dirname ourselfes instead of using system_temp_dirname so we can use configurable / external tempdirs
        h = hashlib.md5(parameters['id'].encode('UTF-8'))
        temp_dirname = system_temp_dirname + os.path.sep + \
            "opensemanticetl_enhancer_warc_" + h.hexdigest()

        if os.path.exists(temp_dirname) == False:
            os.mkdir(temp_dirname)

        # prepare document processing
        connector = Connector_File()
        connector.verbose = verbose
        connector.config = parameters.copy()

        # only set container if not yet set by a zip before (if this zip is inside another zip)
        if not 'container' in connector.config:
            connector.config['container'] = warcfilename

        i = 0

        with open(warcfilename, 'rb') as stream:
            for record in ArchiveIterator(stream):
                i += 1

                if record.rec_type == 'response':

                    print(record.rec_headers)

                    # write WARC record content to tempfile
                    tempfilename = temp_dirname + \
                        os.path.sep + 'warcrecord' + str(i)
                    tmpfile = open(tempfilename, 'wb')
                    tmpfile.write(record.content_stream().read())
                    tmpfile.close()

                    # set last modification time of the file to WARC-Date
                    try:
                        last_modified = time.mktime(time.strptime(
                            record.rec_headers.get_header('WARC-Date'), '%Y-%m-%dT%H:%M:%SZ'))
                        os.utime(tempfilename, (last_modified, last_modified))
                    except BaseException as e:
                        sys.stderr.write("Exception while reading filedate to warc content {} from {} : {}\n".format(
                            tempfilename, connector.config['container'], e))

                    # set id (URL and WARC Record ID)
                    uri = record.rec_headers.get_header('WARC-Target-URI')
                    if not uri.endswith('/'):
                        uri += '/'
                    connector.config['id'] = uri + record.rec_headers.get_header('WARC-Record-ID')

                    # index the extracted file
                    try:

                        connector.index_file(filename=tempfilename)

                    except KeyboardInterrupt:
                        raise KeyboardInterrupt

                    except BaseException as e:
                        sys.stderr.write("Exception while indexing warc content {} from {} : {}\n".format(
                            tempfilename, connector.config['container'], e))

                    os.remove(tempfilename)

        shutil.rmtree(temp_dirname)

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_xml.py
================================================
import xml.etree.ElementTree as ElementTree
import os.path
import sys


class enhance_xml(object):

    def elements2data(self, element, data, path="xml"):

        path += "/" + element.tag

        fieldname = path + '_ss'

        text = element.text.strip()

        if text:
            if fieldname in data:
                data[fieldname].append(text)
            else:
                data[fieldname] = [text]

        for child in element:
            data = self.elements2data(element=child, path=path, data=data)

        return data

    # get xml filename by mapping configuration
    def get_xml_filename(self, filename, mapping):

        dirname = os.path.dirname(filename)
        basename = os.path.basename(filename)

        xmlfilename = mapping

        xmlfilename = xmlfilename.replace('%DIRNAME%', dirname)
        xmlfilename = xmlfilename.replace('%BASENAME%', dirname)

        if not os.path.isfile(xmlfilename):
            xmlfilename = False

        return xmlfilename

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = parameters['filename']

        mapping = parameters['xml_sidecar_file_mapping']

        #
        # is there a xml sidecar file?
        #

        xmlfilename = self.get_xml_filename(filename, mapping)

        if verbose:

            if xmlfilename:

                print('XML sidecar file: {}'.format(xmlfilename))

            else:
                print("No xml sidecar file")

        #
        # read meta data from the XML sidecar file
        #
        if xmlfilename:

            if verbose:
                print("Reading XML sidecar file: {}".format(xmlfilename))
            try:

                # Parse the XML file
                parser = ElementTree.XMLParser()
                et = ElementTree.parse(xmlfilename, parser)
                root = et.getroot()

                for child in root:
                    self.elements2data(element=child, path=root.tag, data=data)

            except BaseException as e:
                sys.stderr.write(
                    "Exception while parsing XML {} {}".format(xmlfilename, e))

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_xmp.py
================================================
import xml.etree.ElementTree as ElementTree
import os.path
import sys


#
# is there a xmp sidecar file?
#

def get_xmp_filename(filename):

    xmpfilename = False

    # some xmp sidecar filenames are based on the original filename without extensions like .jpg or .jpeg
    filenamewithoutextension = '.' . join(filename.split('.')[:-1])

    # check if a xmp sidecar file exists
    if os.path.isfile(filename + ".xmp"):
        xmpfilename = filename + ".xmp"
    elif os.path.isfile(filename + ".XMP"):
        xmpfilename = filename + ".XMP"
    elif os.path.isfile(filenamewithoutextension + ".xmp"):
        xmpfilename = filenamewithoutextension + ".xmp"
    elif os.path.isfile(filenamewithoutextension + ".XMP"):
        xmpfilename = filenamewithoutextension + ".XMP"

    return xmpfilename


# Build path facets from filename

class enhance_xmp(object):
    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = parameters['filename']

        #
        # is there a xmp sidecar file?
        #
        xmpfilename = get_xmp_filename(filename)

        if not xmpfilename:
            if verbose:
                print("No xmp sidecar file")

        #
        # read meta data of the xmp sidecar file (= xml + rdf)
        #
        if xmpfilename:

            creator = False
            headline = False
            creator = False
            location = False
            tags = []

            if verbose:
                print("Reading xmp sidecar file {}".format(xmpfilename))
            try:

                # Parse the xmp file with utf 8 encoding
                parser = ElementTree.XMLParser(encoding="utf-8")
                et = ElementTree.parse(xmpfilename, parser)
                root = et.getroot()

                # get author
                try:
                    creator = root.findtext(
                        ".//{http://purl.org/dc/elements/1.1/}creator")

                    if creator:
                        data['author_ss'] = creator

                except BaseException as e:
                    sys.stderr.write("Exception while parsing creator from xmp {} {}".format(
                        xmpfilename, e.args[0]))

                # get headline
                try:
                    headline = root.findtext(
                        ".//{http://ns.adobe.com/photoshop/1.0/}Headline")

                    if headline:
                        data['title_txt'] = headline

                except BaseException as e:
                    sys.stderr.write("Exception while parsing headline from xmp {} {}".format(
                        xmpfilename, e.args[0]))

                # get location
                try:
                    location = root.findtext(
                        ".//{http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/}Location")

                    if location:

                        if 'locations_ss' in data:
                            data['locations_ss'].append(location)
                        else:
                            data['locations_ss'] = [location]

                except BaseException as e:
                    sys.stderr.write("Exception while parsing location from xmp {} {}".format(
                        xmpfilename, e.args[0]))

                # get tags (named "subject")
                try:
                    for tag in root.findall(".//{http://purl.org/dc/elements/1.1/}subject/{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Bag/{http://www.w3.org/1999/02/22-rdf-syntax-ns#}li"):
                        try:
                            if 'tag_ss' in data:
                                data['tag_ss'].append(tag.text)
                            else:
                                data['tag_ss'] = [tag.text]

                        except BaseException as e:
                            sys.stderr.write("Exception while parsing a tag from xmp {} {}".format(
                                xmpfilename, e.args[0]))
                except BaseException as e:
                    sys.stderr.write("Exception while parsing tags from xmp {} {}".format(
                        xmpfilename, e.args[0]))

            except BaseException as e:
                sys.stderr.write("Exception while parsing xmp {} {}".format(
                    xmpfilename, e.args[0]))

        return parameters, data


================================================
FILE: src/opensemanticetl/enhance_zip.py
================================================
import zipfile
import sys
import hashlib
import tempfile
import os
import shutil
from etl_file import Connector_File


class enhance_zip(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = parameters['filename']

        # if the processed file was extracted from a zip (parameter container was set), write container setting in data, so the link of the id/content can be set to the zip file
        if 'container' in parameters:
            if not 'container_s' in data:
                data['container_s'] = parameters['container']

        # if this file is a zip file, unzip it
        if zipfile.is_zipfile(filename):
            self.unzip_and_index_files(
                zipfilename=filename, parameters=parameters, verbose=verbose)

        return parameters, data

    # unzip all content and index each file with literal filename of the zip file in field container
    def unzip_and_index_files(self, zipfilename, parameters=None, verbose=False):
        if parameters is None:
            parameters = {}

        # create temp dir where to unzip the archive

        if 'tmp' in parameters:
            system_temp_dirname = parameters['tmp']
            if not os.path.exists(system_temp_dirname):
                os.mkdir(system_temp_dirname)
        else:
            system_temp_dirname = tempfile.gettempdir()

        # we build temp dirname ourselfes instead of using system_temp_dirname so we can use configurable / external tempdirs
        h = hashlib.md5(parameters['id'].encode('UTF-8'))
        temp_dirname = system_temp_dirname + os.path.sep + \
            "opensemanticetl_enhancer_zip_" + h.hexdigest()

        if os.path.exists(temp_dirname) == False:
            os.mkdir(temp_dirname)

        # unzip the files
        my_zip = zipfile.ZipFile(zipfilename)
        my_zip.extractall(temp_dirname)
        my_zip.close()

        # prepare document processing
        connector = Connector_File()
        connector.verbose = verbose
        connector.config = parameters.copy()

        # only set container if not yet set by a zip before (if this zip is inside another zip)
        if not 'container' in connector.config:
            connector.config['container'] = zipfilename

        # walk trough all unzipped directories / files and index all files
        for dirName, subdirList, fileList in os.walk(temp_dirname):

            if verbose:
                print('Scanning directory: %s' % dirName)

            for fileName in fileList:
                if verbose:
                    print('Scanning file: %s' % fileName)

                try:
                    # replace temp dirname from indexed id
                    zipped_dirname = dirName.replace(temp_dirname, '', 1)

                    # build a virtual filename pointing to original zip file

                    if zipped_dirname:
                        zipped_dirname = zipped_dirname + os.path.sep
                    else:
                        zipped_dirname = os.path.sep

                    connector.config['id'] = parameters['id'] + \
                        zipped_dirname + fileName

                    unziped_filename = dirName + os.path.sep + fileName

                    try:

                        connector.index_file(filename=unziped_filename)

                    except KeyboardInterrupt:
                        raise KeyboardInterrupt

                    except BaseException as e:
                        sys.stderr.write("Exception while indexing zipped content {} from {} : {}\n".format(
                            fileName, connector.config['container'], e))

                    os.remove(unziped_filename)

                except BaseException as e:
                    sys.stderr.write(
                        "Exception while indexing file {} : {}\n".format(fileName, e))

        shutil.rmtree(temp_dirname)


================================================
FILE: src/opensemanticetl/etl.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import datetime
import importlib
import os
import sys

import filter_blacklist

#
# Extract Transform Load (ETL):
#

# Runs the configured plugins with parameters from configs it reads
#
# Then exports data like content, data enrichment or analytics results generated by the plugins
# to index or database


class ETL(object):

    def __init__(self, plugins=(), verbose=False):

        self.verbose = verbose

        self.config = {}

        self.config['plugins'] = list(plugins)

        self.set_configdefaults()

    def set_configdefaults(self):

        #
        # Standard config
        #
        # Do not edit config here! Overwrite options in /etc/opensemanticsearch/etl or connector configs
        #

        self.config['plugins'] = ['enhance_extract_text_tika_server',
                                  'enhance_detect_language_tika_server']
        self.config['export'] = 'export_solr'
        self.config['regex_lists'] = []

        self.config['raise_pluginexception'] = False

    def init_exporter(self):

        exporter = self.config['export']

        module = importlib.import_module(exporter)
        objectreference = getattr(module, exporter)
        self.exporter = objectreference(self.config)

    def read_configfile(self, configfile):
        result = False

        if os.path.isfile(configfile):
            config = self.config
            file = open(configfile, "r")
            exec(file.read(), locals())
            file.close()
            self.config = config

            result = True

        # if another exporter
        self.init_exporter()

    def is_plugin_blacklisted_for_contenttype(self, plugin, parameters, data):

        blacklisted = False

        # is there a content type yet?
        if 'content_type_ss' in data:
            content_types = data['content_type_ss']
        elif 'content_type_ss' in parameters:
            content_types = parameters['content_type_ss']
        else:
            content_types = None

        # if content type check the plugins' blacklists
        if content_types:

            if not isinstance(content_types, list):
                content_types = [content_types]

            for content_type in content_types:
                # Do not try to blacklist by content type if none was determined.
                if not content_type:
                    continue

                # directory where the plugins' blacklist are
                blacklistdir = '/etc/opensemanticsearch/blacklist/' + plugin + '/'

                filename = blacklistdir + 'blacklist-contenttype'
                if os.path.isfile(filename):
                    if filter_blacklist.is_in_list(filename=filename, value=content_type):
                        blacklisted = True

                if not blacklisted:
                    filename = blacklistdir + 'blacklist-contenttype-prefix'
                    if os.path.isfile(filename):
                        if filter_blacklist.is_in_list(filename=filename, value=content_type, match="prefix"):
                            blacklisted = True

                if not blacklisted:
                    filename = blacklistdir + 'blacklist-contenttype-suffix'
                    if os.path.isfile(filename):
                        if filter_blacklist.is_in_list(filename=filename, value=content_type, match="suffix"):
                            blacklisted = True

                if not blacklisted:
                    filename = blacklistdir + 'blacklist-contenttype-regex'
                    if os.path.isfile(filename):
                        if filter_blacklist.is_in_list(filename=filename, value=content_type, match="regex"):
                            blacklisted = True

                # check whitelists for plugin, if blacklisted but should not
                if blacklisted:
                    filename = blacklistdir + 'whitelist-contenttype'
                    if os.path.isfile(filename):
                        if filter_blacklist.is_in_list(filename=filename, value=content_type):
                            blacklisted = False

                if blacklisted:
                    filename = blacklistdir + 'whitelist-contenttype-prefix'
                    if os.path.isfile(filename):
                        if filter_blacklist.is_in_list(filename=filename, value=content_type, match="prefix"):
                            blacklisted = False

                if blacklisted:
                    filename = blacklistdir + 'whitelist-contenttype-suffix'
                    if os.path.isfile(filename):
                        if filter_blacklist.is_in_list(filename=filename, value=content_type, match="suffix"):
                            blacklisted = False

                if blacklisted:
                    filename = blacklistdir + 'whitelist-contenttype-regex'
                    if os.path.isfile(filename):
                        if filter_blacklist.is_in_list(filename=filename, value=content_type, match="regex"):
                            blacklisted = False

        return blacklisted

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}
        
        time_start = datetime.datetime.now()

        if 'plugins' in parameters:
            plugins = sort_plugins(parameters['plugins'])
        else:
            plugins = sort_plugins(self.config['plugins'])

        data['etl_error_plugins_ss'] = []
        data['etl_error_txt'] = []

        for plugin in plugins:

            data['etl_error_' + plugin + '_txt'] = []

            # if content_type / plugin combination blacklisted, continue with next plugin
            if self.is_plugin_blacklisted_for_contenttype(plugin, parameters, data):

                if self.verbose:
                    print(
                        "Not starting plugin {} because this plugin is blacklisted for the contenttype".format(plugin))

                # mark plugin as blacklisted
                data['etl_' + plugin + '_blacklisted_b'] = True

                continue

            # start plugin
            if self.verbose:
                print("Starting plugin {}".format(plugin))

            time_plugin_start = datetime.datetime.now()
            try:
                module = importlib.import_module(plugin)

                objectreference = getattr(module, plugin, False)

                # if object oriented programming, run instance of object and call its "process" function
                if objectreference:
                    enhancer = objectreference()

                    parameters, data = enhancer.process(
                        parameters=parameters, data=data)

                else:  # else call "process"-function
                    functionreference = getattr(module, 'process', False)

                    if functionreference:

                        parameters, data = functionreference(parameters, data)

                    else:
                        sys.stderr.write(
                            "Exception while data enrichment with plugin {}: Module implements wether object \"{}\" nor function \"process\"\n".format(plugin, plugin))

            # if exception because user interrupted processing by keyboard, respect this and abort
            except KeyboardInterrupt:
                raise KeyboardInterrupt

            # else dont break because fail of a plugin
            # (maybe other plugins or data extraction will success),
            # only error message
            except BaseException as e:

                error_message(
                    docid=parameters['id'], data=data, plugin=plugin, e=e)

                if self.config['raise_pluginexception']:
                    raise

            time_plugin_end = datetime.datetime.now()
            time_plugin_delta = time_plugin_end - time_plugin_start
            data['etl_' + plugin + '_time_millis_i'] = int(time_plugin_delta.total_seconds() * 1000)

            # mark plugin as run
            data['etl_' + plugin + '_b'] = True

            # Abort plugin chain if plugin set parameters['break'] to True
            # (used for example by blacklist or exclusion plugins)
            abort = parameters.get('break', False)

            if abort:
                break

        time_end = datetime.datetime.now()
        time_delta = time_end - time_start
        data['etl_time_millis_i'] = int(time_delta.total_seconds() * 1000)

        # if processing aborted (f.e. by blacklist filter or file modification time did not change)
        abort = parameters.get('break', False)
        if not abort:

            if 'export' in parameters:
                exporter = parameters['export']
            else:
                exporter = self.config['export']

            if exporter:
                # export results (data) to db/storage/index
                module = importlib.import_module(exporter)
                objectreference = getattr(module, exporter)
                self.exporter = objectreference(self.config)

                try:

                    parameters, data = self.exporter.process(
                        parameters=parameters, data=data)

                # if exception because user interrupted processing by keyboard, respect this and abbort
                except KeyboardInterrupt:
                    raise KeyboardInterrupt
                except BaseException as e:
                    sys.stderr.write(
                        "Error while exporting to index or database: {}\n".format(parameters['id']))
                    raise e

        return parameters, data

    def commit(self):

        if self.verbose:
            print("Committing cached or open transactions to index")

        self.exporter.commit()


# append values (i.e. from an enhancer) to data structure
def append(data, facet, values):

    # if facet there yet, append/extend the values, else set values to facet
    if facet in data:

        # if new value(s) single value instead of list convert to list
        if not isinstance(values, list):
            values = [values]

        # if facet in data single value instead of list convert to list
        if not isinstance(data[facet], list):
            data[facet] = [data[facet]]

        # add new values to this list
        data[facet].extend(values)

        # dedupe data in facet
        data[facet] = list(set(data[facet]))

        # if only one value, it has not to be a list
        if len(data[facet]) == 1:
            data[facet] = data[facet][0]

    else:
        data[facet] = values


# Append errors to data/index and print error message
# so we have a log and can see something went wrong within search engine and / or filter for that

def error_message(docid, data, plugin, e):

    try:

        errormessage = "{}".format(e)

        # add error status and message to data to be indexed

        if 'etl_error_txt' in data:
            data['etl_error_txt'].append(errormessage)
        else:
            data['etl_error_txt'] = [errormessage]

        if 'etl_error_plugins_ss' in data:
            data['etl_error_plugins_ss'].append(plugin)
        else:
            data['etl_error_plugins_ss'] = [plugin]

        data['etl_error_' + plugin + '_txt'] = errormessage

        sys.stderr.write(
            "Exception while data enrichment of {} with plugin {}: {}\n".format(docid, plugin, e))

    except:

        sys.stderr.write(
            "Exception while generating error message for exception while processing plugin {} for file {}\n".format(
                plugin, docid))


#
# sort added plugins because of dependencies
#

def sort_plugins(plugins):

    # OCR has to be done before language detection, since content maybe only scanned text within images
    if "enhance_detect_language_tika_server" in plugins and "enhance_pdf_ocr" in plugins:
        if plugins.index("enhance_pdf_ocr") > plugins.index("enhance_detect_language_tika_server"):
            # remove after
            plugins.remove("enhance_pdf_ocr")
            # add before
            plugins.insert(plugins.index(
                "enhance_detect_language_tika_server"), "enhance_pdf_ocr")

    # manual annotations should be found by fulltext search too
    # (automatic entity linking does by including the text or synonym)
    # so read before generating the default search fields like _text_ or text_txt_languageX by enhance_multilingual
    if "enhance_rdf_annotations_by_http_request" in plugins and "enhance_multilingual" in plugins:
        if plugins.index("enhance_rdf_annotations_by_http_request") > plugins.index("enhance_multilingual"):
            # remove after
            plugins.remove(
                "enhance_rdf_annotations_by_http_request")
            # add before
            plugins.insert(plugins.index(
                "enhance_multilingual"), "enhance_rdf_annotations_by_http_request")

    if "enhance_annotations" in plugins and "enhance_multilingual" in plugins:
        if plugins.index("enhance_annotations") > plugins.index("enhance_multilingual"):
            # remove after
            plugins.remove(
                "enhance_annotations")
            # add before
            plugins.insert(plugins.index(
                "enhance_multilingual"), "enhance_annotations")

    return plugins


================================================
FILE: src/opensemanticetl/etl_delete.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import importlib

from etl import ETL
import enhance_mapping_id

class Delete(ETL):
    def __init__(self, verbose=False, quiet=True):

        ETL.__init__(self, verbose=verbose)

        self.quiet = quiet

        self.set_configdefaults()

        self.read_configfiles()

        # read on what DB or search server software our index is
        export = self.config['export']

        # call delete function of the configured exporter
        module = importlib.import_module(export)
        objectreference = getattr(module, export)
        self.connector = objectreference()

    def set_configdefaults(self):
        #
        # Standard config
        #
        # Do not edit config here! Overwrite options in /etc/etl/ or /etc/opensemanticsearch/connector-files
        #

        ETL.set_configdefaults(self)

        self.config['force'] = False

    def read_configfiles(self):
        #
        # include configs
        #

        # Windows style filenames
        self.read_configfile('conf\\opensemanticsearch-etl')
        self.read_configfile('conf\\opensemanticsearch-connector-files')

        # Linux style filenames
        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/connector-files')

    def delete(self, uri):

        if 'mappings' in self.config:
            uri = enhance_mapping_id.mapping(value=uri, mappings=self.config['mappings'])
        
        if self.verbose:
            print("Deleting from index {}".format(uri))

        self.connector.delete(parameters=self.config, docid=uri)

    def empty(self):

        if self.verbose:
            print("Deleting all documents from index")

        self.connector.delete(parameters=self.config, query="*:*")


#
# Read command line arguments and start
#

# if running (not imported to use its functions), run main function
if __name__ == "__main__":

    from optparse import OptionParser

    # get uri or filename from args

    parser = OptionParser("etl-delete [options] URI(s)")
    parser.add_option("-e", "--empty", dest="empty", action="store_true",
                      default=False,
                      help="Empty the index (delete all documents in index)")
    parser.add_option("-v", "--verbose", dest="verbose", action="store_true",
                      default=None, help="Print debug messages")
    parser.add_option("-c", "--config", dest="config", default=False,
                      help="Config file")

    (options, args) = parser.parse_args()

    if not options.empty and len(args) < 1:
        parser.error("No URI given")

    connector = Delete()

    connector.read_configfile('/etc/etl/config')

    # add optional config parameters
    if options.config:
        connector.read_configfile(options.config)

    if options.verbose == False or options.verbose == True:
        connector.verbose = options.verbose

    if options.empty:
        print(
            "This will delete the whole index, are you sure ? Then enter \"yes\"")
        descision = input()
        if descision == "yes" or "Yes" or "YES":
            connector.empty()

    # index each filename
    for uri in args:
        connector.delete(uri)


================================================
FILE: src/opensemanticetl/etl_enrich.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

from etl import ETL
import pysolr
import export_solr
import importlib
import threading

# Todo: Abstraction of querying data to a function of output plugin
# so this will work for other index or database than Solr, too

# Todo: Yet problem, if you run only enrichment plugins (i.e. OCR) without container plugins (i.e. mailbox extractor or ZIP archive extractor), if you want to run this plugin with a container plugin:
# because the container is marked as done in first run (but was not extract because container/extraction plugin not active) so next run with this continer_plugins/extractors wont enrich them anymore

# Since we use enrichment of queries only for OCR after indexing in Open Semantic Desktop Search where we know we call the OCR plugin with all container plugins, fixing or maybe making this perfect later (maybe better classification/management of container plugins or defining in plugins, if they need access to file)


class ETL_Enrich(ETL):

    def __init__(self, plugins=(), verbose=False):

        ETL.__init__(self, plugins=list(plugins), verbose=verbose)

        self.read_configfile('/etc/etl/config')
        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/enhancer-rdf')

        self.fields = self.getfieldnames_from_plugins()

        # init exporter	(todo: exporter as extended PySolr)
        self.export_solr = export_solr.export_solr()

        # init PySolr
        solr_uri = self.config['solr']
        if not solr_uri.endswith('/'):
            solr_uri += '/'
        solr_uri += self.config['index']

        self.solr = pysolr.Solr(solr_uri)

        self.threads_max = None

        # if not set explicit, autodetection of count of CPUs for amount of threads
        if not self.threads_max:
            import multiprocessing
            self.threads_max = multiprocessing.cpu_count()
            if self.verbose:
                print("Setting threads to count of CPUs: " +
                      str(self.threads_max))

        self.rows_per_step = 100
        if self.rows_per_step < self.threads_max * 2:
            self.rows_per_step = self.threads_max * 2

        self.work_in_progress = []
        self.delete_from_work_in_progress_lock = threading.Lock()

        self.delete_from_work_in_progress_after_commit = []
        self.work_in_progress_lock = threading.Lock()

        self.e_job_done = threading.Event()

    #
    # get all the fields needed by all plugins for analysis
    #

    def getfieldnames_from_plugins(self):

        # the field id is needed for every enrichment
        fields = ['id']

        # read all fieldnames, the plugins need to analyze
        for plugin in self.config['plugins']:

            module = importlib.import_module(plugin)
            objectreference = getattr(module, plugin, False)
            if objectreference:
                modulefields = getattr(objectreference, 'fields', False)
            if modulefields:
                for field in modulefields:
                    # only if not there yet from other plugin
                    if not field in fields:
                        fields.append(field)

        return fields

    #
    # Start ETL process / run of all set plugins
    #

    def enrich_document(self, docid):

        try:

            if self.verbose:
                print("Enriching {}".format((docid)))

            parameters = self.config.copy()

            #
            # read data from analyzed fields and add to parameters
            #

            # id is only field, so we do not have to get it again from index or database
            if len(self.fields) == 1:
                parameters['id'] = docid

            # if more than id needed add fields from DB/index to parameters since that data is analysed by the plugins
            else:

                data = self.export_solr.get_data(
                    docid=docid, fields=self.fields)

                # add  to by analysed data of the first and only result to ETL/Enrichment parameters
                parameters.update(data)

            filename = docid
            # if exist delete protocol prefix file://
            if filename.startswith("file://"):
                filename = filename.replace("file://", '', 1)

            parameters['filename'] = filename

            parameters['verbose'] = self.verbose

            if self.verbose:
                print("Parameters:")
                print(parameters)

            # set markers that enriched by this plugins
            data = {}
            for plugin in self.config['plugins']:
                data['etl_' + plugin + '_b'] = True

            # start ETL / Enrichment process
            parameters, data = self.process(parameters=parameters, data=data)

        finally:

            # remove blacklisting/locking for this document, since enrichment process is now done
            self.work_in_progress_lock.acquire()

            if docid in self.work_in_progress:
                self.delete_from_work_in_progress_lock.acquire()

                self.delete_from_work_in_progress_after_commit.append(docid)

                self.delete_from_work_in_progress_lock.release()

            self.work_in_progress_lock.release()

            # set event, so main thread wakes up and knows next job/thread can be started
            self.e_job_done.set()

    #
    # get query from plugin and start enrichment process for this query
    #

    # not usable for plugin chains (i.e. extract containers like ZIP files and than OCR contents)! Use enrich_query with a compounded query instead.
    def enrich(self):

        for plugin in self.config['plugins']:

            query = "*:* AND NOT (etl_{}_b:true)".format(plugin)

            # check if plugin has own more special query and if so, use this
            module = importlib.import_module(plugin)
            objectreference = getattr(module, plugin, False)
            if objectreference:
                query = getattr(objectreference, 'query', query)

            if self.verbose:
                print("Data enrichment query: {}".format(query))

            # enrich
            self.enrich_query(query)

    # get ids from query
    # get fields from plugins
    # run enrichment chain with this parameters
    def enrich_query(self, query):

        counter = 0

        solrparameters = {
            'fl': 'id',
            'rows': self.rows_per_step,
        }

        # have to proceed all documents matching this query:

        # - all yet not enriched (query for content type AND not plugin_b: true) results of enriched content type (plugin query)
        # OR
        # - container file type like ZIP archive or PST mailbox
        # - AND not yet enriched by all set plugins
        #
        # - but both cases not if no file but content of a container file (for subfiles the enrichment plugins will be runned by the run of the container plugin)

        # query matching i.e. the contenttype of to be enriched files (for example if doing OCR we do only have to process images, not all documents)
        running_plugin_query = '(' + query + ')'

        # not, if all set plugins yet runned on this document
        all_plugin_query = []
        for plugin in self.config['plugins']:
            all_plugin_query.append("(etl_{}_b:true)".format(plugin))
        all_plugin_query = '(' + ' AND '.join(all_plugin_query) + ')'

        # matching container content types like archive files
        # since our content types like f.e. images can be in containers like for example ZIP archives, so we should enrich them too)
        container_query = '(content_type:application\/zip OR id:*.zip OR content_type:application\/vnd.ms-outlook-pst OR id:*.pst)'
        # Todo for more performance:
        # distinct container_s from results with query for only needed content types instead of working trough all containers

        query = running_plugin_query + \
            ' OR (' + container_query + ' AND NOT ' + all_plugin_query + ')'

        # not try to enrich (not existent in filesystem) subfiles inside container files like ZIP archives or extracted mail attachments
        # since for subfiles the enrichment plugins will be runned by the run of the container plugin)
        query = '(' + query + ') AND NOT (container_s:*)'

        if self.verbose:
            print("Enrichment of matches the following query:")
            print(query)

        results = self.solr.search(query, **solrparameters)

        while len(results) > 0:

            for result in results:

                docid = result['id']

                if self.threads_max == 1:

                    # no threading, do it directly in this process
                    self.enrich_document(docid=docid)
                    counter += 1

                else:

                    #
                    # Manage threading
                    #

                    # If doc id blacklistet (work in progress in running threads) don't start thread for docid,
                    # since new search result can include some same documents again which not yet ready enriched because work goes on in a thread from step before.
                    # So continue with next result.
                    if docid in self.work_in_progress:
                        continue

                    # wait for a job done if yet maximum threads (+1 because do not count this main thread) running
                    while threading.active_count() >= self.threads_max + 1:
                        # wait for event that signals that a thread/job finished (set in enrich_document() at end)
                        # use a timeout if racing condition (setting in finished job before clearing this event here)
                        self.e_job_done.wait(1)

                    # blacklist id of document work in progres
                    self.work_in_progress_lock.acquire()
                    self.work_in_progress.append(docid)
                    self.work_in_progress_lock.release()

                    # start enrichment of this document in new thread
                    thread = threading.Thread(
                        target=self.enrich_document, args=(docid, ))
                    self.e_job_done.clear()
                    thread.start()

                    counter += 1

            # do commit, so next query wont find documents again, which were processed but not available for searcher yet
            self.commit()

            #
            # delete done IDs from blacklist
            #

            self.delete_from_work_in_progress_lock.acquire()

            while len(self.delete_from_work_in_progress_after_commit) > 0:
                docid = self.delete_from_work_in_progress_after_commit.pop()

                self.work_in_progress_lock.acquire()
                self.work_in_progress.remove(docid)
                self.work_in_progress_lock.release()

            self.delete_from_work_in_progress_lock.release()

            #
            # If last step (fewer search results than a step manages), wait for all threads to be done
            # before starting new search / next step (which will only find the documents again, that are work in progress in running threads and not done/marked as ready yet) to prevent unnecessary search load
            #

            if len(results) < self.rows_per_step:
                # wait until all started threads done before continue (commit and end),

                while threading.active_count() > 1:
                    # wait for event that signals that a thread/job finished (set in enrich_document() at end)
                    # use a timeout if racing condition (setting in finished job before clearing this event here)
                    self.e_job_done.wait(1)
                    self.e_job_done.clear()

                # commit last results, so very last enrichments have not wait the next autocommit time (if set up)
                self.commit()

            # are there (more) not yet enriched documents in search index for a next step?
            results = self.solr.search(query, **solrparameters)

        # if self.verbose:
        print("Enriched {} documents".format(counter))


# todo: export to Solr by update

if __name__ == "__main__":

    # get uri or filename from args
    from optparse import OptionParser

    parser = OptionParser("etl-enrich [options] --plugins pluginname")

    parser.add_option("-p", "--plugins", dest="plugins",
                      default=False, help="Plugins (comma separated)")
    parser.add_option("-c", "--config", dest="config",
                      default=False, help="Config file")
    parser.add_option("-q", "--query", dest="query",
                      default=False, help="Query")
    parser.add_option("-o", "--outputfile", dest="outputfile", default=False,
                      help="Output file (if exporter set to a file format)")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=None, help="Print debug messages")
    (options, args) = parser.parse_args()

    etl = ETL_Enrich()

    if options.config:
        etl.read_configfile(options.config)
        etl.fields = etl.getfieldnames_from_plugins()

    if not options.verbose or options.verbose:
        etl.verbose = options.verbose

    # set (or if config overwrite) plugin config
    if options.plugins:
        etl.config['plugins'] = options.plugins.split(',')
        etl.fields = etl.getfieldnames_from_plugins()

    if options.outputfile:
        etl.config['outputfile'] = options.outputfile

    # if query, enrich IDs matching this query
    if options.query:

        etl.enrich_query(options.query)

    # if no query and no id's as argument, use default query from plugins
    elif len(args) == 0:

        etl.enrich()

    # if not query but IDs
    else:

        for uri in args:

            # todo: if not local file, download to temp file if a plugin needs parameter filename

            etl.enrich_document(docid=uri)

        etl.commit()


================================================
FILE: src/opensemanticetl/etl_file.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import os.path
import sys

from etl import ETL


class Connector_File(ETL):

    def __init__(self, verbose=False, quiet=True):

        ETL.__init__(self, verbose=verbose)

        self.quiet = quiet

        self.set_configdefaults()

        self.read_configfiles()

    def set_configdefaults(self):
        #
        # Standard config
        #
        # Do not edit config here!
        # Overwrite options in /etc/etl/
        # or /etc/opensemanticsearch/connector-files
        #

        ETL.set_configdefaults(self)

        self.config['force'] = False

        # filename to URI mapping
        self.config['mappings'] = {"/": "file:///"}

        self.config['facet_path_strip_prefix'] = [
            "file://",
            "http://www.", "https://www.", "http://", "https://"]

        self.config['plugins'] = [
            'enhance_mapping_id',
            'filter_blacklist',
            'filter_file_not_modified',
            'enhance_extract_text_tika_server',
            'enhance_detect_language_tika_server',
            'enhance_contenttype_group',
            'enhance_pst',
            'enhance_csv',
            'enhance_file_mtime',
            'enhance_path',
            'enhance_extract_hashtags',
            'enhance_warc',
            'enhance_zip',
            'clean_title',
            'enhance_multilingual',
        ]

        self.config['blacklist'] = [
            "/etc/opensemanticsearch/blacklist/blacklist-url"]
        self.config['blacklist_prefix'] = [
            "/etc/opensemanticsearch/blacklist/blacklist-url-prefix"]
        self.config['blacklist_suffix'] = [
            "/etc/opensemanticsearch/blacklist/blacklist-url-suffix"]
        self.config['blacklist_regex'] = [
            "/etc/opensemanticsearch/blacklist/blacklist-url-regex"]
        self.config['whitelist'] = [
            "/etc/opensemanticsearch/blacklist/whitelist-url"]
        self.config['whitelist_prefix'] = [
            "/etc/opensemanticsearch/blacklist/whitelist-url-prefix"]
        self.config['whitelist_suffix'] = [
            "/etc/opensemanticsearch/blacklist/whitelist-url-suffix"]
        self.config['whitelist_regex'] = [
            "/etc/opensemanticsearch/blacklist/whitelist-url-regex"]

    def read_configfiles(self):
        #
        # include configs
        #

        # Windows style filenames
        self.read_configfile('conf\\opensemanticsearch-etl')
        self.read_configfile('conf\\opensemanticsearch-enhancer-rdf')
        self.read_configfile('conf\\opensemanticsearch-connector-files')

        # Linux style filenames
        self.read_configfile('/etc/etl/config')
        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/etl-webadmin')
        self.read_configfile('/etc/opensemanticsearch/etl-custom')
        self.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
        self.read_configfile('/etc/opensemanticsearch/facets')
        self.read_configfile('/etc/opensemanticsearch/connector-files')
        self.read_configfile('/etc/opensemanticsearch/connector-files-custom')

    # clean filename (convert filename given as URI to filesystem)
    def clean_filename(self, filename):

        # if exist delete prefix file://

        if filename.startswith("file://"):
            filename = filename.replace("file://", '', 1)

        return filename

    # index directory or file
    def index(self, filename):

        # clean filename (convert filename given as URI to filesystem)
        filename = self.clean_filename(filename)

        # if singe file start to index it
        if os.path.isfile(filename):

            self.index_file(filename=filename)

            result = True

        # if directory walkthrough
        elif os.path.isdir(filename):

            self.index_dir(rootDir=filename)

            result = True

        # else error message
        else:

            result = False

            sys.stderr.write(
                "No such file or directory: {}\n".format(filename))

        return result

    # walk through all subdirectories and call index_file for each file
    def index_dir(self, rootDir, followlinks=False):

        for dirName, subdirList, fileList in os.walk(rootDir,
                                                     followlinks=followlinks):

            if self.verbose:
                print("Scanning directory: {}".format(dirName))

            for fileName in fileList:
                if self.verbose:
                    print("Scanning file: {}".format(fileName))

                try:

                    fullname = dirName
                    if not fullname.endswith(os.path.sep):
                        fullname += os.path.sep
                    fullname += fileName

                    self.index_file(filename=fullname)

                except KeyboardInterrupt:
                    raise KeyboardInterrupt
                except BaseException as e:
                    try:
                        sys.stderr.write(
                            "Exception while processing file {}{}{} : {}\n"
                            .format(dirName, os.path.sep, fileName, e))
                    except BaseException:
                        sys.stderr.write(
                            "Exception while processing a file and exception "
                            "while printing error message (maybe problem with"
                            " encoding of filename on console or converting "
                            "the exception to string?)\n")

    # Index a file
    def index_file(self, filename, additional_plugins=()):

        # clean filename (convert filename given as URI to filesystem)
        filename = self.clean_filename(filename)

        # fresh parameters / chain for each file (so processing one file will
        # not change config/parameters for next, if directory or multiple
        # files, which would happen if given by reference)
        parameters = self.config.copy()
        if additional_plugins:
            parameters['plugins'].extend(additional_plugins)

        if self.verbose:
            parameters['verbose'] = True

        data = {}

        # add this connector name to ETL status
        data['etl_file_b'] = True

        if 'id' not in parameters:
            parameters['id'] = filename

        parameters['filename'] = filename

        parameters, data = self.process(parameters=parameters, data=data)

        return parameters, data

#
# Read command line arguments and start
#


# if running (not imported to use its functions), run main function
if __name__ == "__main__":

    from argparse import ArgumentParser

    # get uri or filename and (optional) parameters from args

    def key_val(s):
        return s.split("=")

    parser = ArgumentParser("etl-file")
    parser.add_argument("-q", "--quiet",
                        action="store_true",
                        default=None,
                        help="Don\'t print status (filenames) while indexing")
    parser.add_argument("-v", "--verbose", dest="verbose",
                        action="store_true",
                        default=None, help="Print debug messages")
    parser.add_argument("-f", "--force", dest="force", action="store_true",
                        default=None,
                        help="Force (re)indexing, even if no changes")
    parser.add_argument("-c", "--config",
                        help="Config file")
    parser.add_argument("-p", "--plugins",
                        type=lambda s: s.split(","),
                        help="Plugin chain to use instead configured "
                        "plugins (comma separated and in order)")
    parser.add_argument("-a", "--additional-plugins",
                        dest="additional_plugins",
                        type=lambda s: s.split(","),
                        help="Plugins to add to default/configured plugins"
                        " (comma separated and in order)")
    parser.add_argument("-w", "--outputfile", dest="outputfile",
                        help="Output file")
    parser.add_argument("--param", action="append",
                        type=key_val,
                        help="Set a config parameter (key=value). "
                        "Can be specified multiple times")
    parser.add_argument("args", nargs="+", help="Input files")

    options = {key: val for key, val in vars(parser.parse_args()).items()
               if val is not None}

    args = options.pop("args")

    connector = Connector_File()

    # add optional config parameters
    config = options.pop("config", None)
    if config:
        connector.read_configfile(config)

    plugins = options.pop("plugins", []) + \
        options.pop("additional_plugins", [])

    # set (or if config overwrite) plugin config
    if plugins:
        connector.config['plugins'] = plugins

    connector.config.update(dict(options.pop("param", {})))

    connector.config.update(options)

    # index each filename
    for filename in args:
        connector.index(filename)

    # commit changes, if not yet done automatically by index timer
    connector.commit()

    # after file or files processed with basic/first stage config
    # if plugins or config options configured for later stage, reprocess with additional config
    additional_plugins_later = connector.config.get('additional_plugins_later', [])

    additional_plugins_later_config = connector.config.get('additional_plugins_later_config', {})

    if len(additional_plugins_later) > 0 or len(additional_plugins_later_config) > 0:

        if connector.config['verbose']:
            print("There are options configured for later stage, so (re)processing with additional plugins {} and/or config {}"
                  .format(additional_plugins_later, additional_plugins_later_config))

        for option in additional_plugins_later_config:
            connector.config[option] = additional_plugins_later_config[option]

            connector.config['plugins'].extend(additional_plugins_later)
            
        # index each filename
        for filename in args:
            connector.index(filename=filename)

        # commit changes, if not yet done automatically by index timer
        connector.commit()


================================================
FILE: src/opensemanticetl/etl_filedirectory.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

from etl_file import Connector_File

#
# Parallel processing of files by adding each file to celery tasks
#


class Connector_Filedirectory(Connector_File):

    def __init__(self, verbose=False, quiet=False):

        Connector_File.__init__(self, verbose=verbose)

        self.quiet = quiet

        # apply filters before adding to queue, so filtered or yet indexed files not added to queue
        # adding to queue by plugin export_queue_files

        # exporter to index filenames before text extraction and other later tasks
        # will run before adding tasks to queue by export_queue_files
        # so reseted plugin status will be in index before started ETL tasks apply not modified filter
        export_to_index = self.config['export']

        self.config['plugins'] = [
            'enhance_mapping_id',
            'filter_blacklist',
            'filter_file_not_modified',
            'enhance_file_mtime',
            'enhance_path',
            'enhance_entity_linking',
            'enhance_multilingual',
            export_to_index,
            'export_queue_files',
        ]

#
# Read command line arguments and start
#


# if running (not imported to use its functions), run main function
if __name__ == "__main__":

    from optparse import OptionParser

    # get uri or filename from args

    parser = OptionParser("etl-filedirectory [options] filename")
    parser.add_option("-q", "--quiet", dest="quiet", action="store_true",
                      default=None, help="Don\'t print status (filenames) while indexing")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=None, help="Print debug messages")

    (options, args) = parser.parse_args()

    if len(args) < 1:
        parser.error("No filename given")

    connector = Connector_Filedirectory()

    if options.verbose == False or options.verbose == True:
        connector.verbose = options.verbose

    if options.quiet == False or options.quiet == True:
        connector.quiet = options.quiet

    # index each filename
    for filename in args:
        connector.index(filename)


================================================
FILE: src/opensemanticetl/etl_filemonitoring.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

from argparse import ArgumentParser

import pyinotify

from tasks import index_file
from tasks import delete

from etl import ETL
from enhance_mapping_id import mapping

from move_indexed_file import move_files, move_dir


class EventHandler(pyinotify.ProcessEvent):

    def __init__(self):
        super().__init__()
        self.verbose = False
        self.config = {}

    def process_IN_CLOSE_WRITE(self, event):
        if self.verbose:
            print("Close_write: {}".format(event.pathname))

        self.index_file(filename=event.pathname)

    def process_IN_MOVED_TO(self, event):
        if self.verbose:
            print("Move: {} -> {}".format(event.src_pathname, event.pathname))

        if event.dir:
            self.move_dir(src=event.src_pathname, dest=event.pathname)
        else:
            self.move_file(src=event.src_pathname, dest=event.pathname)

    def process_IN_DELETE(self, event):

        if self.verbose:
            print("Delete {}:".format(event.pathname))

        self.delete_file(filename=event.pathname)

    #
    # write to queue
    #

    def move_file(self, src, dest):
        if self.verbose:
            print("Moving file from {} to {}".format(src, dest))
        solr_uri = self.config["solr"] + self.config["index"]
        if not solr_uri.endswith("/"):
            solr_uri += "/"
        move_files(solr_uri, moves={src: dest}, prefix="file://")

    def move_dir(self, src, dest):
        if self.verbose:
            print("Moving dir from {} to {}".format(src, dest))
        solr_uri = self.config["solr"] + self.config["index"]
        if not solr_uri.endswith("/"):
            solr_uri += "/"
        move_dir(solr_uri, src=src, dest=dest, prefix="file://")

    def index_file(self, filename):
        if self.verbose:
            print("Indexing file {}".format(filename))

        index_file.apply_async(
            kwargs={'filename': filename}, queue='open_semantic_etl_tasks', priority=5)

    def delete_file(self, filename):
        uri = filename
        if 'mappings' in self.config:
            uri = mapping(value=uri, mappings=self.config['mappings'])

        if self.verbose:
            print("Deleting from index filename {} with URL {}".format(
                filename, uri))

        delete.apply_async(kwargs={'uri': uri}, queue='open_semantic_etl_tasks', priority=6)


class Filemonitor(ETL):

    def __init__(self, verbose=False):

        ETL.__init__(self, verbose=verbose)

        self.verbose = verbose

        self.read_configfiles()

        # Watched events
        #
        # We need IN_MOVE_SELF to track moved folder paths
        # pyinotify-internally. If omitted, the os instructions
        # mv /docs/src /docs/dest; touch /docs/dest/doc.pdf
        # will produce a IN_MOVED_TO pathname=/docs/dest/ followed by
        # IN_CLOSE_WRITE pathname=/docs/src/doc.pdf
        # where we would like a IN_CLOSE_WRITE pathname=/docs/dest/doc.pdf
        self.mask = (
            pyinotify.IN_DELETE
            | pyinotify.IN_CLOSE_WRITE
            | pyinotify.IN_MOVED_TO
            | pyinotify.IN_MOVED_FROM
            | pyinotify.IN_MOVE_SELF
        )

        self.watchmanager = pyinotify.WatchManager()  # Watch Manager

        self.handler = EventHandler()

        self.notifier = pyinotify.Notifier(self.watchmanager, self.handler)

    def read_configfiles(self):
        #
        # include configs
        #

        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/connector-files')

    def add_watch(self, filename):

        self.watchmanager.add_watch(
            filename, self.mask, rec=True, auto_add=True)

    @staticmethod
    def add_watches_from_file(filename):
        listfile = open(filename)
        for line in listfile:
            filename = line.strip()
            # ignore empty lines and comment lines (starting with #)
            if filename and not filename.startswith("#"):

                filemonitor.add_watch(filename)

    def watch(self):

        self.handler.config = self.config
        self.handler.verbose = self.verbose

        self.notifier.loop()


# parse command line options
parser = ArgumentParser(description="etl-filemonitor")
parser.add_argument("-v", "--verbose", dest="verbose",
                    action="store_true", default=False,
                    help="Print debug messages")
parser.add_argument("-f", "--fromfile", dest="fromfile",
                    default=None, help="File names config")
parser.add_argument("watchfiles", nargs="*",
                    default=(), help="Files / directories to watch")
args = parser.parse_args()


filemonitor = Filemonitor(verbose=args.verbose)


# add watches for every file/dir given as command line parameter
for _filename in args.watchfiles:
    filemonitor.add_watch(_filename)


# add watches for every file/dir in list file
if args.fromfile is not None:
    filemonitor.add_watches_from_file(args.fromfile)


# start watching
filemonitor.watch()


================================================
FILE: src/opensemanticetl/etl_hypothesis.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

#
# Import annotations from Hypothesis - https://hypothes.is
#

import requests
import json
import sys

from etl import ETL

from etl_web import Connector_Web

import export_solr


class Connector_Hypothesis(ETL):

    verbose = False

    documents = True

    token = None

    api = 'https://hypothes.is/api/'

    # how many annotations to download at once / per page
    limit = 10

    # initialize Open Semantic ETL
    etl = ETL()
    etl.read_configfile('/etc/etl/config')
    etl.read_configfile('/etc/opensemanticsearch/etl')
    etl.read_configfile('/etc/opensemanticsearch/hypothesis')
    etl.verbose = verbose

    exporter = export_solr.export_solr()

    #
    # index the annotated document, if not yet in index
    #

    def etl_document(self, uri):

        result = True
        doc_mtime = self.exporter.get_lastmodified(docid=uri)

        if doc_mtime:

            if self.verbose:
                print(
                    "Annotated document in search index. No new indexing of {}".format(uri))

        else:
            # Download and Index the new or updated uri

            if self.verbose:
                print(
                    "Annotated document not in search index. Start indexing of {}".format(uri))

            try:
                etl = Connector_Web()
                etl.index(uri=uri)
            except KeyboardInterrupt:
                raise KeyboardInterrupt
            except BaseException as e:
                sys.stderr.write(
                    "Exception while getting {} : {}".format(uri, e))
                result = False
        return result

    #
    # import an annotation
    #

    def etl_annotation(self, annotation):

        parameters = {}
        parameters['plugins'] = ['enhance_multilingual']

        # since there can be multiple annotations for same URI,
        # do not overwrite but add value to existent values of the facet/field/property
        parameters['add'] = True
        data = {}

        # id/uri of the annotated document, not the annotation id
        parameters['id'] = annotation['uri']

        # first index / etl the webpage / document that has been annotated if not yet in index
        if self.documents:
            result = self.etl_document(uri=annotation['uri'])
        if not result:
            data['etl_error_hypothesis_ss'] = "Error while indexing the document that has been annotated"

        # annotation id
        data['annotation_id_ss'] = annotation['id']

        data['annotation_text_txt'] = annotation['text']

        tags = []
        if 'tags' in annotation:

            if self.verbose:
                print("Tags: {}".format(annotation['tags']))

            for tag in annotation['tags']:
                tags.append(tag)
        data['annotation_tag_ss'] = tags

        # write annotation to database or index
        self.etl.process(parameters=parameters, data=data)

    #
    # import all annotations since last imported annotation
    #

    def etl_annotations(self, last_update="", user=None, group=None, tag=None, uri=None):

        newest_update = last_update

        if not self.api.endswith('/'):
            self.api = self.api + '/'

        searchurl = '{}search?limit={}&sort=updated&order=desc'.format(
            self.api, self.limit)

        if user:
            searchurl += "&user={}".format(user)

        if group:
            searchurl += "&group={}".format(group)

        if tag:
            searchurl += "&tag={}".format(tag)

        if uri:
            searchurl += "&uri={}".format(uri)

        # Authorization
        headers = {'user-agent': 'Open Semantic Search'}

        if self.token:
            headers['Authorization'] = 'Bearer ' + self.token

        # stats
        stat_downloaded_annotations = 0
        stat_imported_annotations = 0
        stat_pages = 0

        offset = 0
        last_page = False

        while not last_page:

            searchurl_paged = searchurl + "&offset={}".format(offset)

            # Call API / download annotations
            if self.verbose:
                print("Calling hypothesis API {}".format(searchurl_paged))

            request = requests.get(searchurl_paged, headers=headers)

            result = json.loads(request.content.decode('utf-8'))

            stat_pages += 1

            if len(result['rows']) < self.limit:
                last_page = True

            # import annotations
            for annotation in result['rows']:

                stat_downloaded_annotations += 1

                if annotation['updated'] > last_update:

                    if self.verbose:
                        print(
                            "Importing new annotation {}annotations/{}".format(self.api, annotation['id']))
                        print(annotation['text'])

                    stat_imported_annotations += 1

                    # save update time from newest annotation/edit
                    if annotation['updated'] > newest_update:
                        newest_update = annotation['updated']

                    self.etl_annotation(annotation)

                else:

                    last_page = True

            offset += self.limit

        # commit to index, if yet buffered
        self.etl.commit()

        if self.verbose:
            print("Downloaded annotations: {}".format(
                stat_downloaded_annotations))
            print("Imported new annotations: {}".format(
                stat_imported_annotations))

        return newest_update


#
# Read command line arguments and start
#

# if running (not imported to use its functions), run main function
if __name__ == "__main__":

    from optparse import OptionParser

    # get uri or filename from args

    parser = OptionParser("etl-file [options] filename")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=None, help="Print debug messages")
    parser.add_option("-a", "--api", dest="api",
                      default="https://hypothes.is/api/", help="API URL")
    parser.add_option("-p", "--token", dest="token",
                      default=None, help="API token for authorization")
    parser.add_option("-d", "--documents", dest="documents", action="store_true",
                      default=True, help="Index content of annotated document(s), too")
    parser.add_option("-f", "--force", dest="force", action="store_true",
                      default=None, help="Force (re)indexing, even if no changes")
    parser.add_option("-c", "--config", dest="config",
                      default=False, help="Config file")
    parser.add_option("-t", "--tag", dest="tag",
                      default=None, help="Filter for a tag")
    parser.add_option("-u", "--user", dest="user",
                      default=None, help="Filter for an user")
    parser.add_option("-g", "--group", dest="group",
                      default=None, help="Filter for a group")

    (options, args) = parser.parse_args()

    connector = Connector_Hypothesis()

    # add optional config parameters
    if options.config:
        connector.read_configfile(options.config)

    if options.verbose == False or options.verbose == True:
        connector.verbose = options.verbose

    connector.documents = options.documents

    if options.token:
        connector.token = options.token

    connector.api = options.api

    connector.etl_annotations(
        last_update="", user=options.user, group=options.group, tag=options.tag)


================================================
FILE: src/opensemanticetl/etl_plugin_core.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import itertools

#
# Core functions used by multiple plugins, so they can be inherit from this class
#

class Plugin(object):

    filter_filename_suffixes = []
    filter_mimetype_prefixes = []

    # filter for mimetype prefixes or filename suffixes
    def filter(self, parameters=None, data=None):

        filtered = False
        
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        filename = None
        if 'filename' in parameters:
            filename = parameters['filename']

        mimetype = None
        if 'content_type_ss' in data:
            mimetype = data['content_type_ss']
        elif 'content_type_ss' in parameters:
            mimetype = parameters['content_type_ss']

        # if connector returns a list, use only first value
        # (which is the only entry or the main content type of file)
        if isinstance(mimetype, list):
            mimetype = mimetype[0]

        # is there a filename suffix match?
        match_filename_suffix = False
        if filename:
            for suffix in self.filter_filename_suffixes:
                if filename.lower().endswith(suffix.lower()):
                    if verbose:
                        print('Filename suffix matches plugin filter(s) {}'.format(self.filter_filename_suffixes))
    
                    match_filename_suffix = True
            
        # is there a mimetype prefix match?
        match_contenttype_prefix = False
        if mimetype:
            for prefix in self.filter_mimetype_prefixes:
                if mimetype.lower().startswith(prefix.lower()):
                    if verbose:
                        print('Contenttype matches plugin filter(s) {}'.format(self.filter_mimetype_prefixes))
    
                    match_contenttype_prefix = True

        # if filter(s) configured for file suffix or mimetype prefix, set filtered if matches
        if len(self.filter_mimetype_prefixes)>0 and len(self.filter_filename_suffixes) > 0:
            if not match_filename_suffix and not match_contenttype_prefix:
                if verbose:
                    print('Wether filename suffix nor content type matches plugin filter(s) for mimetypes {} or filename suffixes {}, so no further processing of this plugin'.format(self.filter_mimetype_prefixes, self.filter_filename_suffixes))
                filtered = True
        elif len(self.filter_mimetype_prefixes)>0:
            if not match_contenttype_prefix:
                if verbose:
                    print('Contenttype does not match plugin filter(s) {}, so no further processing of this plugin'.format(self.filter_mimetype_prefixes))
                filtered = True
        elif len(self.filter_filename_suffixes)>0:
            if not match_filename_suffix:
                if verbose:
                    print('Filename suffix does not match plugin filter(s) {}, so no further processing of this plugin'.format(self.filter_filename_suffixes))
                filtered = True
            
        return filtered


def get_text(data):
    values_list = []

    #
    # exclude fields like technical metadata
    #

    exclude_fields_prefix = []

    listfile = open('/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-prefix')
    for line in listfile:
        line = line.strip()
        if line and not line.startswith("#"):
            exclude_fields_prefix.append(line)
    listfile.close()

    # suffixes of non-text fields like nubers
    exclude_fields_suffix = []

    listfile = open('/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname-suffix')
    for line in listfile:
        line = line.strip()
        if line and not line.startswith("#"):
            exclude_fields_suffix.append(line)
    listfile.close()

    # full fieldnames
    exclude_fields = []
    listfile = open('/etc/opensemanticsearch/blacklist/textanalysis/blacklist-fieldname')
    for line in listfile:
        line = line.strip()
        if line and not line.startswith("#"):
            exclude_fields.append(line)
    listfile.close()


    for field in data:

        is_blacklisted = False

        for blacklisted_prefix in exclude_fields_prefix:
            if field.startswith(blacklisted_prefix):
                is_blacklisted = True
        
        for blacklisted_suffix in exclude_fields_suffix:
            if field.endswith(blacklisted_suffix):
                is_blacklisted = True
        
        if field in exclude_fields:
            is_blacklisted = True

        if not is_blacklisted:
    
            values = data[field]
    
            if not isinstance(values, list):
                values = [values]

            values_list.append(values)


    # Flatten:
    values = itertools.chain.from_iterable(values_list)

    # Remove empty values:
    values = filter(None, values)

    # Make sure everything is a string:
    values = (
        value if isinstance(value, str) else "{}".format(value)
        for value in values
    )

    # Ensure a trailing newline:
    values = itertools.chain(values, [""])

    # Concatenate:
    return "\n".join(values)


# append values (i.e. from an enhancer) to data structure
def append(data, facet, values):

    # if facet there yet, append/extend the values, else set values to facet
    if facet in data:

        # if new value(s) single value instead of list convert to list
        if not isinstance(values, list):
            values = [values]

        # if facet in data single value instead of list convert to list
        if not isinstance(data[facet], list):
            data[facet] = [data[facet]]

        # add new values to this list
        data[facet].extend(values)

        # dedupe data in facet
        data[facet] = list(set(data[facet]))

        # if only one value, it has not to be a list
        if len(data[facet]) == 1:
            data[facet] = data[facet][0]

    else:
        data[facet] = values


#
# Get preferred label(s) from field in format pref label and uri
#
def get_preflabels(values):

    uri2preflabel_map = {}

    if values:

        if not isinstance(values, list):
            values = [values]

        for value in values:
            pos_uri = value.rfind(' <')
            uri = value[pos_uri+2:-1]
            preflabel = value[0:pos_uri]
            uri2preflabel_map[uri] = preflabel

    return uri2preflabel_map


def get_all_matchtexts(values):
    
    results = {}

    if not isinstance(values, list):
        values = [values]

    for value in values:

        #get only matchtext (without ID/URI of matching entity)
        value = value.split("\t")
        matchid = value[0]
        matchtext = value[1]
        
        if not matchid in results:
            results[matchid] = []
            
        if not matchtext in results[matchid]:
            results[matchid].append(matchtext)

    return results


================================================
FILE: src/opensemanticetl/etl_rss.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import feedparser
import sys

from etl_web import Connector_Web

import export_solr


class Connector_RSS(Connector_Web):

    def __init__(self, verbose=False, quiet=True):

        Connector_Web.__init__(self, verbose=verbose, quiet=quiet)

        self.quiet = quiet
        self.read_configfiles()

    def read_configfiles(self):
        #
        # include configs
        #

        # windows style filenames
        self.read_configfile('conf\\opensemanticsearch-connector')
        self.read_configfile('conf\\opensemanticsearch-enhancer-ocr')
        self.read_configfile('conf\\opensemanticsearch-enhancer-rdf')
        self.read_configfile('conf\\opensemanticsearch-connector-web')
        self.read_configfile('conf\\opensemanticsearch-connector-rss')

        # linux style filenames
        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/etl-webadmin')
        self.read_configfile('/etc/opensemanticsearch/etl-custom')
        self.read_configfile('/etc/opensemanticsearch/enhancer-ocr')
        self.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
        self.read_configfile('/etc/opensemanticsearch/connector-web')
        self.read_configfile('/etc/opensemanticsearch/connector-rss')

    # Import Feed

    #
    # Import a RSS feed: If article has changed or not indexed, call download_and_index_to_solr()
    #
    def index(self, uri):

        result = True

        exporter = export_solr.export_solr()

        feed = feedparser.parse(uri)

        new_items = 0

        for item in feed.entries:

            articleuri = item.link

            #
            # Is new article or indexed in former runs?
            #

            doc_mtime = exporter.get_lastmodified(docid=articleuri)

            if doc_mtime:

                if self.verbose:
                    print(
                        "Article indexed before, so skip new indexing: {}".format(articleuri))

            else:
                # Download and Index the new or updated uri

                if self.verbose:
                    print("Article not in index: {}".format(articleuri))

                try:
                    partresult = Connector_Web.index(self, uri=articleuri)
                    if partresult == False:
                        result = False
                    new_items += 1

                except KeyboardInterrupt:
                    raise KeyboardInterrupt
                except BaseException as e:
                    sys.stderr.write(
                        "Exception while getting {} : {}".format(articleuri, e))

        if new_items:
            exporter.commit()

        return result

#
# If runned (not importet for functions) get parameters and start
#


if __name__ == "__main__":

    # todo: if no protocoll, use http://

    # get uri or filename from args
    from optparse import OptionParser
    parser = OptionParser("etl-rss [options] uri")
    parser.add_option("-q", "--quiet", dest="quiet", action="store_true",
                      default=None, help="Dont print status (filenames) while indexing")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=None, help="Print debug messages")
    parser.add_option("-c", "--config", dest="config",
                      default=False, help="Config file")
    parser.add_option("-p", "--plugins", dest="plugins",
                      default=False, help="Plugins (comma separated)")
    parser.add_option("-w", "--outputfile", dest="outputfile",
                      default=False, help="Output file")

    (options, args) = parser.parse_args()

    if len(args) != 1:
        parser.error("No uri(s) given")

    connector = Connector_RSS()

    # add optional config parameters
    if options.config:
        connector.read_configfile(options.config)
    if options.outputfile:
        connector.config['outputfile'] = options.outputfile

    # set (or if config overwrite) plugin config
    if options.plugins:
        connector.config['plugins'] = options.plugins.split(',')

    if options.verbose == False or options.verbose == True:
        connector.verbose = options.verbose

    if options.quiet == False or options.quiet == True:
        connector.quiet = options.quiet

    for uri in args:
        connector.index(uri)


================================================
FILE: src/opensemanticetl/etl_sitemap.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import sys
import urllib.request
import xml.etree.ElementTree as ElementTree

from etl_web import Connector_Web
import tasks


class Connector_Sitemap(Connector_Web):

    def __init__(self, verbose=False, quiet=True):

        Connector_Web.__init__(self, verbose=verbose, quiet=quiet)

        self.quiet = quiet
        self.read_configfiles()
        self.queue = True

    def read_configfiles(self):
        #
        # include configs
        #

        # windows style filenames
        self.read_configfile('conf\\opensemanticsearch-connector')
        self.read_configfile('conf\\opensemanticsearch-enhancer-ocr')
        self.read_configfile('conf\\opensemanticsearch-enhancer-rdf')
        self.read_configfile('conf\\opensemanticsearch-connector-web')

        # linux style filenames
        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/etl-webadmin')
        self.read_configfile('/etc/opensemanticsearch/etl-custom')
        self.read_configfile('/etc/opensemanticsearch/enhancer-ocr')
        self.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
        self.read_configfile('/etc/opensemanticsearch/connector-web')

    # Import sitemap

    # Index every URL of the sitemap

    def index(self, sitemap):

        if self.verbose or self.quiet == False:
            print("Downloading sitemap {}".format(sitemap))

        sitemap = urllib.request.urlopen(sitemap)

        et = ElementTree.parse(sitemap)

        root = et.getroot()

        # process subsitemaps if sitemapindex
        for sitemap in root.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}sitemap"):
            url = sitemap.findtext(
                '{http://www.sitemaps.org/schemas/sitemap/0.9}loc')

            if self.verbose or self.quiet == False:
                print("Processing subsitemap {}".format(url))

            self.index(url)

        #
        # get urls if urlset
        #

        urls = []

        # XML schema with namespace sitemaps.org
        for url in root.findall("{http://www.sitemaps.org/schemas/sitemap/0.9}url"):

            url = url.findtext(
                '{http://www.sitemaps.org/schemas/sitemap/0.9}loc')

            urls.append(url)

        # XML schema with namespace Google sitemaps
        for url in root.findall("{http://www.google.com/schemas/sitemap/0.84}url"):

            url = url.findtext(
                '{http://www.google.com/schemas/sitemap/0.84}loc')

            urls.append(url)

        # Queue or download and index the urls

        for url in urls:

            if self.queue:

                # add webpage to queue as Celery task
                try:

                    if self.verbose or self.quiet == False:
                        print("Adding URL to queue: {}".format(url))

                    result = tasks.index_web.apply_async(
                        kwargs={'uri': url}, queue='tasks', priority=5)

                except KeyboardInterrupt:
                    raise KeyboardInterrupt
                except BaseException as e:
                    sys.stderr.write(
                        "Exception while adding to queue {} : {}\n".format(url, e))

            else:

                # batchmode, index page after page ourselves

                try:
                    if self.verbose or self.quiet == False:
                        print("Indexing {}".format(url))

                    result = Connector_Web.index(self, uri=url)

                except KeyboardInterrupt:
                    raise KeyboardInterrupt
                except BaseException as e:
                    sys.stderr.write(
                        "Exception while indexing {} : {}\n".format(url, e))

#
# If runned (not imported for functions) get parameters and start
#


if __name__ == "__main__":

    # get uri or filename from args
    from optparse import OptionParser
    parser = OptionParser("etl-sitemap [options] uri")
    parser.add_option("-q", "--quiet", dest="quiet", action="store_true",
                      default=False, help="Don't print status (filenames) while indexing")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=None, help="Print debug messages")
    parser.add_option("-b", "--batch", dest="batchmode", action="store_true",
                      default=None, help="Batch mode (Page after page instead of adding to queue)")
    parser.add_option("-c", "--config", dest="config",
                      default=False, help="Config file")
    parser.add_option("-p", "--plugins", dest="plugins",
                      default=False, help="Plugins (comma separated)")

    (options, args) = parser.parse_args()

    if len(args) != 1:
        parser.error("No sitemap uri(s) given")

    connector = Connector_Sitemap()

    # add optional config parameters
    if options.config:
        connector.read_configfile(options.config)

    # set (or if config overwrite) plugin config
    if options.plugins:
        connector.config['plugins'] = options.plugins.split(',')

    if options.verbose == False or options.verbose == True:
        connector.verbose = options.verbose

    if options.quiet == False or options.quiet == True:
        connector.quiet = options.quiet

    if options.batchmode == True:
        connector.queue = False

    for uri in args:
        connector.index(uri)


================================================
FILE: src/opensemanticetl/etl_sparql.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import os
import tempfile

from etl import ETL
from enhance_rdf import enhance_rdf

from SPARQLWrapper import SPARQLWrapper, XML, JSON


#
# download (part of) graph by SPARQL query from SPARQL endpoint to RDF file
#

def download_rdf_from_sparql_endpoint(endpoint, query):

    # read graph by construct query results from SPARQL endpoint
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(XML)
    results = sparql.query().convert()

    # crate temporary filename
    file = tempfile.NamedTemporaryFile()
    filename = file.name
    file.close()

    # export graph to RDF file
    results.serialize(destination=filename, format="xml")

    return filename


#
# Append values from SPARQL SELECT result to plain text list file
#

def sparql_select_to_list_file(endpoint, query, filename=None):

    # read graph by construct query results from SPARQL endpoint
    sparql = SPARQLWrapper(endpoint)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    if not filename:
        # crate temporary filename
        listfile = tempfile.NamedTemporaryFile(delete=False)
        filename = listfile.name
        listfile.close()

    listfile = open(filename, 'a', encoding="utf-8")

    for result in results["results"]["bindings"]:

        for variable in results["head"]["vars"]:
            if variable in result:
                if "value" in result[variable]:
                    value = result[variable]["value"]
                    value = value.strip()
                    if value:
                        listfile.write(result[variable]["value"] + "\n")

    listfile.close()

    return filename


class Connector_SPARQL(ETL):

    def __init__(self, verbose=False, quiet=True):

        ETL.__init__(self, verbose=verbose)

        self.read_configfiles()

        self.config["plugins"] = []

    def read_configfiles(self):
        #
        # include configs
        #

        # windows style filenames
        self.read_configfile('conf\\opensemanticsearch-connector')
        self.read_configfile('conf\\opensemanticsearch-enhancer-rdf')
        self.read_configfile('conf\\opensemanticsearch-connector-sparql')

        # linux style filenames
        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/etl-custom')
        self.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
        self.read_configfile('/etc/opensemanticsearch/connector-sparql')

    # Import RDF from SPARQL result

    def index_rdf(self, endpoint, query):

        # download (part of) graph from endpoint to temporary rdf file
        rdffilename = download_rdf_from_sparql_endpoint(
            endpoint=endpoint, query=query)

        parameters = self.config.copy()

        # import the triples of rdf graph by RDF plugin
        enhancer = enhance_rdf()
        enhancer.etl_graph_file(
            docid=endpoint, filename=rdffilename, parameters=parameters)

        os.remove(rdffilename)

    # Import fields and values from SPARQL SELECT result

    def index_select(self, endpoint, query):

        # read graph by construct query results from SPARQL endpoint
        sparql = SPARQLWrapper(endpoint)
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        results = sparql.query().convert()

        i = 0
        for result in results["results"]["bindings"]:
            i += 1
            data = {}
            data['id'] = endpoint + "/" + query + "/" + str(i)

            for variable in results["head"]["vars"]:
                if variable in result:
                    if "value" in result[variable]:
                        data[variable] = result[variable]["value"]

            self.process(data=data)

    # Import SPARQL result

    def index(self, endpoint, query):

        if query.startswith("SELECT "):
            self.index_select(endpoint, query)
        else:
            self.index_rdf(endpoint, query)


#
# If runned (not imported for functions) get parameters and start
#

if __name__ == "__main__":

    # todo: if no protocoll, use http://

    # get uri or filename from args
    from optparse import OptionParser
    parser = OptionParser("etl-sparql [options] uri query")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=None, help="Print debug messages")
    parser.add_option("-c", "--config", dest="config",
                      default=False, help="Config file")
    parser.add_option("-p", "--plugins", dest="plugins",
                      default=False, help="Plugins (comma separated)")

    (options, args) = parser.parse_args()

    if len(args) != 2:
        parser.error("Missing parameters endpoint URI and SPARQL query")

    connector = Connector_SPARQL()

    # add optional config parameters
    if options.config:
        connector.read_configfile(options.config)

    # set (or if config overwrite) plugin config
    if options.plugins:
        connector.config['plugins'] = options.plugins.split(',')

    if options.verbose == False or options.verbose == True:
        connector.verbose = options.verbose

    connector.index(endpoint=args[0], query=args[1])


================================================
FILE: src/opensemanticetl/etl_twitter_scraper.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import twint
import sys
from etl import ETL
from tasks import index_web

module = sys.modules["twint.storage.write"]

etl = ETL()
etl.read_configfile('/etc/opensemanticsearch/etl')
etl.read_configfile('/etc/opensemanticsearch/etl-webadmin')

etl.config['plugins'] = ['enhance_path', 'enhance_entity_linking', 'enhance_multilingual']
etl.config['facet_path_strip_prefix'] = ["http://www.", "https://www.", "http://", "https://"]


def index_tweet(obj, config):
    tweet = obj.__dict__

    parameters = {}
    parameters['id'] = tweet['link']

    data = {}
    data['content_type_ss'] = 'Tweet'
    data['content_type_group_ss'] = 'Social media post'

    data['author_ss'] = tweet['name']
    data['userid_s'] = tweet['user_id_str']
    data['username_ss'] = tweet['username']

    data['title_txt'] = tweet['tweet']
    data['content_txt'] = tweet['tweet']

    data['hashtag_ss'] = tweet['hashtags']

    if tweet['place']:
        data['location_ss'] = tweet['place']

    data['urls_ss'] = tweet['urls']

    data['mentions_ss'] = tweet['mentions']

    data['retweets_count_i'] = tweet['retweets_count']
    data['likes_count_i'] = tweet['likes_count']
    data['replies_count_i'] = tweet['replies_count']
    data['file_modified_dt'] = tweet['datestamp'] + 'T' + tweet['timestamp'] + 'Z'

    if config.Index_Linked_Webpages:
        if data['urls_ss']:
            for url in data['urls_ss']:
                index_web.apply_async(kwargs={'uri': url}, queue='open_semantic_etl_tasks', priority=5)

    try:
        etl.process(parameters, data)
    except BaseException as e:
        sys.stderr.write("Exception while indexing tweet {} : {}".format(parameters['id'], e))

# overwrite twint json export method with custom function index_tweet
module.Json = index_tweet


def index(search=None, username=None, Profile_full=False, limit=None, Index_Linked_Webpages=False):

    c = twint.Config()
    c.Hide_output = True
    c.Store_json = True
    c.Output = "tweets.json"

    if username:
        c.Username = username

    if search:
        c.Search = search

    if limit:
        c.Limit = limit

    c.Index_Linked_Webpages = Index_Linked_Webpages

    c.Profile_full = Profile_full

    if Profile_full:
        twint.run.Profile(c)
    else:
        twint.run.Search(c)

    etl.commit()


#
# If running from command line (not imported as library) get parameters and start
#

if __name__ == "__main__":

    # get uri or filename from args

    from optparse import OptionParser

    # get uri or filename from args

    parser = OptionParser("etl-twitter-scraper [options]")
    parser.add_option("-u", "--user", dest="username",
                      default=None, help="User")
    parser.add_option("-s", "--search", dest="search",
                      default=None, help="Search")
    parser.add_option("-l", "--limit", dest="limit",
                      default=None, help="Limit")

    (options, args) = parser.parse_args()

    if not options.username and not options.search:
        parser.error("No Username or search given")

    index(username=options.username, search=options.search, limit=options.limit)


================================================
FILE: src/opensemanticetl/etl_web.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import time
import urllib.request
import os
from lxml import etree
from dateutil import parser as dateparser

from etl_file import Connector_File


class Connector_Web(Connector_File):

    def __init__(self, verbose=False, quiet=True):

        Connector_File.__init__(self, verbose=verbose)

        self.quiet = quiet
        self.set_configdefaults()
        self.read_configfiles()

    def set_configdefaults(self):

        Connector_File.set_configdefaults(self)

        #
        # Standard config
        #
        # Do not edit config here! Overwrite options in /etc/opensemanticsearch/connector-web
        #

        # no filename to uri mapping
        self.config['uri_prefix_strip'] = False
        self.config['uri_prefix'] = False

        # strip in facet path
        self.config['facet_path_strip_prefix'] = ['http://www.',
                                                  'http://',
                                                  'https://www.',
                                                  'https://',
                                                  'ftp://'
                                                  ]

        self.config['plugins'] = [
            'filter_blacklist',
            'enhance_extract_text_tika_server',
            'enhance_detect_language_tika_server',
            'enhance_contenttype_group',
            'enhance_pst',
            'enhance_csv',
            'enhance_path',
            'enhance_zip',
            'enhance_warc',
            'enhance_extract_hashtags',
            'clean_title',
            'enhance_multilingual',

        ]

    def read_configfiles(self):
        #
        # include configs
        #

        # Windows style filenames
        self.read_configfile('conf\\opensemanticsearch-etl')
        self.read_configfile('conf\\opensemanticsearch-enhancer-rdf')
        self.read_configfile('conf\\opensemanticsearch-connector-web')

        # Linux style filenames
        self.read_configfile('/etc/opensemanticsearch/etl')
        self.read_configfile('/etc/opensemanticsearch/etl-webadmin')
        self.read_configfile('/etc/opensemanticsearch/etl-custom')
        self.read_configfile('/etc/opensemanticsearch/enhancer-rdf')
        self.read_configfile('/etc/opensemanticsearch/facets')
        self.read_configfile('/etc/opensemanticsearch/connector-web')
        self.read_configfile('/etc/opensemanticsearch/connector-web-custom')

    def read_mtime_from_html(self, tempfilename):
        mtime = False

        try:
            parser = etree.HTMLParser()
            tree = etree.parse(tempfilename, parser)

            try:
                mtimestring = tree.xpath(
                    "//meta[@http-equiv='last-modified']")[0].get("content")
            except:
                mtimestring = False

            try:
                mtimestring = tree.xpath(
                    "//meta[@name='last-modified']")[0].get("content")
            except:
                mtimestring = False
        except:
            mtimestring = False

        if mtimestring:

            if self.verbose:
                print("Modification time in HTML: ", mtimestring)

            try:
                mtime = time.strptime(mtimestring)
            except:
                mtime = False

            try:
                # parse datetime
                mtime = dateparser.parse(mtimestring)
                # convert datetime to time
                mtime = mtime.timetuple()

            except BaseException as e:
                print("Exception while reading last-modified from content: {}".format(e))

        if self.verbose:
            print("Extracted modification time: {}".format(mtime))

        return mtime

    def index(self, uri, last_modified=False, downloaded_file=False, downloaded_headers=None):
        if downloaded_headers is None:
            downloaded_headers = {}

        parameters = self.config.copy()

        if self.verbose:
            parameters['verbose'] = True

        data = {}

        uri = uri.strip()
        # if no protocol, add http://
        if not uri.lower().startswith("http://") and not uri.lower().startswith("https://") and not uri.lower().startswith("ftp://") and not uri.lower().startswith("ftps://"):
            uri = 'http://' + uri

        parameters['id'] = uri

        #
        # Download to tempfile, if not yet downloaded by crawler
        #

        if downloaded_file:
            tempfilename = downloaded_file
            headers = downloaded_headers

        else:

            if self.verbose:
                print("Downloading {}".format(uri))

            tempfilename, headers = urllib.request.urlretrieve(uri)

            if self.verbose:
                print("Download done")

        parameters['filename'] = tempfilename

        #
        # Modification time
        #
        mtime = False

        # get meta "last-modified" from content
        mtime = self.read_mtime_from_html(tempfilename)

        # use HTTP status modification time
        if not mtime:
            try:

                last_modified = headers['last-modified']

                if self.verbose:
                    print("HTTP Header Last-modified: {}".format(last_modified))

                mtime = dateparser.parse(last_modified)
                # convert datetime to time
                mtime = mtime.timetuple()

                if self.verbose:
                    print("Parsed date: {}".format(mtime))

            except:
                mtime = False
                print("Failed to parse HTTP header last-modified")

        # else HTTP create date
        if not mtime:
            try:
                date = headers['date']

                if self.verbose:
                    print("HTTP Header date: {}".format(date))

                mtime = dateparser.parse(date)
                # convert datetime to time
                mtime = mtime.timetuple()

                if self.verbose:
                    print("Parsed date: {}".format(mtime))

            except:
                mtime = False
                print("Failed to parse HTTP header date")

        # else now
        if not mtime:
            mtime = time.localtime()

        mtime_masked = time.strftime("%Y-%m-%dT%H:%M:%SZ", mtime)

        data['file_modified_dt'] = mtime_masked

        # Enrich data and write to search index
        parameters, data = self.process(parameters=parameters, data=data)

        os.remove(tempfilename)


#
# If runned (not importet for functions) get parameters and start
#

if __name__ == "__main__":

    # get uri or filename from args

    from optparse import OptionParser

    # get uri or filename from args

    parser = OptionParser("etl-web [options] URL")
    parser.add_option("-q", "--quiet", dest="quiet", action="store_true",
                      default=None, help="Do not print status (filenames) while indexing")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=None, help="Print debug messages")
    parser.add_option("-f", "--force", dest="force", action="store_true",
                      default=None, help="Force (re)indexing, even if no changes")
    parser.add_option("-c", "--config", dest="config",
                      default=False, help="Config file")
    parser.add_option("-p", "--plugins", dest="plugins",
                      default=False, help="Plugins (comma separated)")
    parser.add_option("-w", "--outputfile", dest="outputfile",
                      default=False, help="Output file")

    (options, args) = parser.parse_args()

    if len(args) != 1:
        parser.error("No URI(s) given")

    connector = Connector_Web()

    # add optional config parameters
    if options.config:
        connector.read_configfile(options.config)
    if options.outputfile:
        connector.config['outputfile'] = options.outputfile

    # set (or if config overwrite) plugin config
    if options.plugins:
        connector.config['plugins'] = options.plugins.split(',')

    if options.verbose == False or options.verbose == True:
        connector.verbose = options.verbose

    if options.quiet == False or options.quiet == True:
        connector.quiet = options.quiet

    if options.force == False or options.force == True:
        connector.config['force'] = options.force

    for uri in args:
        connector.index(uri)


================================================
FILE: src/opensemanticetl/etl_web_crawl.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import tempfile
import re

from scrapy.crawler import CrawlerProcess

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from tasks import index_web


class OpenSemanticETL_Spider(CrawlSpider):

    name = "Open Semantic ETL"

    def parse_item(self, response):

        # write downloaded body to temp file
        file = tempfile.NamedTemporaryFile(
            mode='w+b', delete=False, prefix="etl_web_crawl_")
        file.write(response.body)
        filename = file.name
        file.close()

        self.logger.info(
            'Adding ETL task for downloaded page or file from %s', response.url)

        downloaded_headers = {}
        if 'date' in response.headers:
                downloaded_headers['date'] = response.headers['date'].decode("utf-8", errors="ignore")
        if 'last-modified' in response.headers:
                downloaded_headers['last-modified'] = response.headers['last-modified'].decode("utf-8", errors="ignore")

        # add task to index the downloaded file/page by ETL web in Celery task worker
        index_web.apply_async(kwargs={'uri': response.url, 'downloaded_file': filename,
                                      'downloaded_headers': downloaded_headers}, queue='open_semantic_etl_tasks', priority=5)


def index(uri, crawler_type="PATH"):

    configfile = '/etc/opensemanticsearch/connector-web'

    # read config file
    config = {}
    exec(open(configfile).read(), locals())

    name = "Open Semantic ETL {}".format(uri)

    start_urls = [uri]

    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    if crawler_type == "PATH":
        # crawl only the path
        filter_regex = re.escape(uri) + '*'
        rules = (
            Rule(LinkExtractor(allow=filter_regex, deny_extensions=config['webcrawler_deny_extensions']), callback='parse_item'),
        )
        process.crawl(OpenSemanticETL_Spider,
                      start_urls=start_urls, rules=rules, name=name)

    else:
        # crawl full domain and subdomains

        allowed_domain = uri
        # remove protocol prefix
        if allowed_domain.lower().startswith('http://www.'):
            allowed_domain = allowed_domain[11:]
        elif allowed_domain.lower().startswith('https://www.'):
            allowed_domain = allowed_domain[12:]
        elif allowed_domain.lower().startswith('http://'):
            allowed_domain = allowed_domain[7:]
        elif allowed_domain.lower().startswith('https://'):
            allowed_domain = allowed_domain[8:]

        # get only domain name without path
        allowed_domain = allowed_domain.split("/")[0]

        rules = (
            Rule(LinkExtractor(deny_extensions=config['webcrawler_deny_extensions']), callback='parse_item'),
        )
        process.crawl(OpenSemanticETL_Spider, start_urls=start_urls,
                      allowed_domains=[allowed_domain], rules=rules, name=name)

    # the start URL itselves shall be indexed, too, so add task to index the downloaded file/page by ETL web in Celery task worker
    index_web.apply_async(kwargs={'uri': uri}, queue='open_semantic_etl_tasks', priority=5)

    process.start()  # the script will block here until the crawling is finished


if __name__ == "__main__":

    # get uri or filename from args

    from optparse import OptionParser

    # get uri or filename from args

    parser = OptionParser("etl-web-crawl URL")

    (options, args) = parser.parse_args()

    if len(args) != 1:
        parser.error("No URL(s) given")

    for uri in args:
        index(uri)


================================================
FILE: src/opensemanticetl/export_elasticsearch.py
================================================
from elasticsearch import Elasticsearch


# Connect to Elastic Search

class export_elasticsearch(object):

    def __init__(self, config=None):
        if config is None:
            config = {}

        self.config = config

        if not 'index' in self.config:
            self.config['index'] = 'opensemanticsearch'

        if not 'verbose' in self.config:
            self.config['verbose'] = False

    #
    # Write data to Elastic Search
    #

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        self.config = parameters

        # post data
        self.update(parameters=parameters, data=data)

        return parameters, data

    # send the updated field data to Elastic Search
    def update(self, docid=None, data=None, parameters=None):
        if data is None:
            data = {}
        if parameters is None:
            parameters = {}

        if docid:
            parameters['id'] = docid
        else:
            docid = parameters['id']

        es = Elasticsearch()
        result = es.index(
            index=self.config['index'], doc_type='document', id=docid, body=data)

        return result

    # get last modified date for document
    def get_lastmodified(self, docid, parameters=None):
        if parameters is None:
            parameters = {}

        es = Elasticsearch()

        doc_exists = es.exists(
            index=self.config['index'], doc_type="document", id=docid)

        # if doc with id exists in index, read modification date
        if doc_exists:
            doc = es.get(index=self.config['index'], doc_type="document",
                         id=docid, _source=False, fields="file_modified_dt")
            last_modified = doc['fields']['file_modified_dt'][0]
        else:
            last_modified = None

        return last_modified

    # commits are managed by Elastic Search setup, so no explicit commit here
    def commit(self):
        return


================================================
FILE: src/opensemanticetl/export_json.py
================================================
import json


class export_json(object):

    def __init__(self, config=None):
        if config is None:
            config = {'verbose': False}

        self.config = config

    #
    # Json data
    #

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        # if outputfile write json to file
        if 'outputfile' in parameters:

            import io
            with io.open(parameters['outputfile'], 'w', encoding='utf-8') as f:
                f.write(json.dumps(data, ensure_ascii=False))
        else:  # else print json
            print(json.dumps(data))

        return parameters, data


================================================
FILE: src/opensemanticetl/export_neo4j.py
================================================
import os

from py2neo import Graph, Node, Relationship

#
# Export entities and connections to neo4j
#


class export_neo4j(object):

    def __init__(self, config=None):
        if config is None:
            config = {'verbose': False}
        self.config = config

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        if 'verbose' in parameters:
            self.config['verbose'] = parameters['verbose']

        # for this facets, do not add additional entity to connect with, but write to properties of the entity
        properties = ['content_type_ss',
                      'content_type_group_ss', 'language_ss', 'language_s']

        host = 'localhost'
        if 'neo4j_host' in parameters:
            host = parameters['neo4j_host']
        if os.getenv('OPEN_SEMANTIC_ETL_NEO4J_HOST'):
            host = os.getenv('OPEN_SEMANTIC_ETL_NEO4J_HOST')

        user = 'neo4j'
        if 'neo4j_user' in parameters:
            user = parameters['neo4j_user']

        password = 'neo4j'
        if 'neo4j_password' in parameters:
            password = parameters['neo4j_password']

        neo4j_auth = os.getenv('NEO4J_AUTH', '')
        if '/' in neo4j_auth:
            user, _, password = neo4j_auth.partition('/')

        graph = Graph(host=host, user=user, password=password)

        document_node = Node('Document', name=parameters['id'])

        if 'title' in data:
            document_node['title'] = data['title']

        # add properties from facets
        for entity_class in parameters['facets']:

            if entity_class in data:

                entity_class_label = parameters['facets'][entity_class]['label']

                if entity_class in properties:

                    document_node[entity_class_label] = data[entity_class]

        graph.merge(document_node, 'Document', 'name')

        # add / connect linked entities from facets

        for entity_class in parameters['facets']:

            if entity_class in data:

                entity_class_label = entity_class
                if parameters['facets'][entity_class]['label']:
                    entity_class_label = parameters['facets'][entity_class]['label']

                if not entity_class in properties:

                    relationship_label = entity_class_label

                    if entity_class in ['person_ss', 'organization_ss', 'location_ss']:
                        relationship_label = "Named Entity Recognition"

                    # convert to array, if single entity / not multivalued field
                    if isinstance(data[entity_class], list):
                        entities = data[entity_class]
                    else:
                        entities = [data[entity_class]]

                    for entity in entities:

                        if self.config['verbose']:
                            print("Export to Neo4j: Merging entity {} of class {}".format(
                                entity, entity_class_label))

                        # if not yet there, add the entity to graph
                        entity_node = Node(entity_class_label, name=entity)
                        graph.merge(entity_node, entity_class_label, 'name')

                        # if not yet there, add relationship to graph
                        relationship = Relationship(
                            document_node, relationship_label, entity_node)
                        graph.merge(relationship)

        return parameters, data


================================================
FILE: src/opensemanticetl/export_print.py
================================================
import pprint


class export_print(object):

    def __init__(self, config=None):
        if config is None:
            config = {'verbose': False}

        self.config = config

    #
    # Print data
    #

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        pprint.pprint(data)

        return parameters, data


================================================
FILE: src/opensemanticetl/export_queue_files.py
================================================
#
# Write filename to Celery queue for batching and parallel processing
#

from tasks import index_file


class export_queue_files(object):

    def __init__(self, config=None):
        if config is None:
            config = {'verbose': False}
        self.config = config

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        # add file to ETL queue with standard prioritization
        # but don't if only plugins not ran that should run later (which will be added to queue in step below)
        if not 'only_additional_plugins_later' in parameters:
            index_file.apply_async(
                kwargs={'filename': parameters['filename']}, queue='open_semantic_etl_tasks', priority=5)

        # add file to (lower prioritized) ETL queue with additional plugins or options which should run later after all files tasks of standard prioritized queue done
        # to run ETL of the file later again with additional plugins like OCR which need much time/resources while meantime all files are searchable by other plugins which need fewer resources
        if 'additional_plugins_later' in parameters or 'additional_plugins_later_config' in parameters:

            additional_plugins_later = parameters.get('additional_plugins_later', [])

            additional_plugins_later_config = parameters.get('additional_plugins_later_config', {})

            if len(additional_plugins_later) > 0 or len(additional_plugins_later_config) > 0:

                index_file.apply_async(kwargs={
                                       'filename': parameters['filename'], 'additional_plugins': additional_plugins_later, 'config': additional_plugins_later_config}, queue='open_semantic_etl_tasks', priority=1)

        return parameters, data


================================================
FILE: src/opensemanticetl/export_solr.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import os
import json
import requests
import sys
import time

import urllib.request
import urllib.parse

# Export data to Solr


class export_solr(object):

    def __init__(self, config=None):
        if config is None:
            config = {}

        self.config = config

        if os.getenv('OPEN_SEMANTIC_ETL_SOLR'):
            self.config['solr'] = os.getenv('OPEN_SEMANTIC_ETL_SOLR')

        if not 'solr' in self.config:
            self.config['solr'] = 'http://localhost:8983/solr/'

        if not 'index' in self.config:
            self.config['index'] = 'opensemanticsearch'

        self.solr = self.config['solr']
        self.core = self.config['index']

        if not 'verbose' in self.config:
            self.config['verbose'] = False

        self.verbose = self.config['verbose']

    #
    # Write data to Solr
    #

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        # if not there, set config defaults
        if 'verbose' in parameters:
            self.verbose = parameters['verbose']

        if self.verbose:            
            print ('Starting Exporter: Solr')

        if 'solr' in parameters:
            self.solr = parameters['solr']
            if not self.solr.endswith('/'):
                self.solr += '/'

        if 'index' in parameters:
            self.core = parameters['index']

        add = parameters.get('add', False)

        fields_set = parameters.get('fields_set', [])

        commit = parameters.get('commit', None)

        if not 'id' in data:
            data['id'] = parameters['id']

        # post data to Solr
        do_export = True

        # but do not post, if only id (which will contain no add or set commands for fields and will be seen as overwrite for whole document)
        if len(data) < 2:
            if self.verbose:
                print('Not exported to Solr because no data or only the ID.')
            do_export = False

        # and do not post, if yet posted before (exporter not exporter, but in plugin queue, f.e. in multi stage processing before adding task to queue)
        if 'etl_export_solr_b' in data:
            if self.verbose:
                print('Not exported to Solr because no data or yet exported in this ETL run, because exporter was runned as plugin.')
                do_export = False

        if do_export:
            self.update(data=data, add=add, fields_set=fields_set, commit=commit)


        return parameters, data

    # update document in index, set fields in data to updated or new values or add new/additional values
    # if no document yet, it will be added
    def update(self, data, add=False, fields_set=(), commit=None):

        update_fields = {}

        for fieldname in data:
            if fieldname == 'id':
                update_fields['id'] = data['id']
            else:
                update_fields[fieldname] = {}

                if add and not fieldname in fields_set:
                    # add value to existent values of the field
                    update_fields[fieldname]['add-distinct'] = data[fieldname]
                else:
                    # if document there with values for this fields, the existing values will be overwritten with new values
                    update_fields[fieldname]['set'] = data[fieldname]

        self.post(data=update_fields, commit=commit)

    def post(self, data=None, docid=None, commit=None):
        if data is None:
            data = {}

        solr_uri = self.solr + self.core + '/update'

        if docid:
            data['id'] = docid

        datajson = '[' + json.dumps(data) + ']'

        params = {}

        if commit:
            params['commit'] = 'true'

        if self.verbose:
            print("Sending update request to {}".format(solr_uri))
            print(datajson)


        retries = 0
        retrytime = 1
        # wait time until next retry will be doubled until reaching maximum of 120 seconds (2 minutes) until next retry
        retrytime_max = 120
        no_connection = True

        while no_connection:
            try:
                if retries > 0:
                    print('Will retry to connect to Solr in {} second(s).'.format(retrytime))
                    time.sleep(retrytime)
                    retrytime = retrytime * 2
                    if retrytime > retrytime_max:
                        retrytime = retrytime_max

                r = requests.post(solr_uri, data=datajson, params=params, headers={'Content-Type': 'application/json'})

                # if bad status code, raise exception
                r.raise_for_status()

                if retries > 0:
                    print('Successfull retried to connect Solr.')

                no_connection = False

            except KeyboardInterrupt:
                raise KeyboardInterrupt

            except requests.exceptions.ConnectionError as e:

                    retries += 1

                    sys.stderr.write("Connection to Solr failed (will retry in {} seconds). Exception: {}\n".format(retrytime, e))

            except requests.exceptions.HTTPError as e:
                if e.response.status_code == 503:

                    retries += 1

                    sys.stderr.write("Solr temporary unavailable (HTTP status code 503). Will retry in {} seconds). Exception: {}\n".format(retrytime, e))

                else:
                    no_connection = False

                    sys.stderr.write('Error while posting data to Solr: {}'.format(e))

                    raise(e)

            except BaseException as e:
                no_connection = False

                sys.stderr.write('Error while posting data to Solr: {}'.format(e))

                raise(e)


    # tag a document by adding new value to field
    def tag(self, docid=None, field=None, value=None, data=None):
        if data is None:
            data = {}

        data_merged = data.copy()

        if docid:
            data_merged['id'] = docid

        if field:
            if field in data_merged:
                # if not list, convert to list
                if not isinstance(data_merged[field], list):
                    data_merged[field] = [data_merged[field]]
                # add value to field
                data_merged[field].append(value)
            else:
                data_merged[field] = value

        result = self.update(data=data_merged, add=True)

        return result

    # search for documents with query and without tag and update them with the tag
    def update_by_query(self, query, field=None, value=None, data=None, queryparameters=None):
        if data is None:
            data = {}

        import pysolr

        count = 0

        solr = pysolr.Solr(self.solr + self.core)

        #
        # extend query: do not return documents, that are tagged
        #

        query_marked_before = ''

        if field:
            query_marked_before = field + ':"' + solr_mask(value) + '"'

        # else extract field and value from data to build query of yet tagged docs to exclude

        for fieldname in data:

            if isinstance(data[fieldname], list):

                for value in data[fieldname]:

                    if query_marked_before:
                        query_marked_before += " AND "

                    query_marked_before += fieldname + \
                        ':"' + solr_mask(value) + '"'
            else:

                value = data[fieldname]
                if query_marked_before:
                    query_marked_before += " AND "

                query_marked_before += fieldname + \
                    ':"' + solr_mask(value) + '"'

        solrparameters = {
            'fl': 'id',
            'defType': 'edismax',
            'rows': 10000000,
        }

        # add custom Solr parameters (if the same parameter, overwriting the obove defaults)
        if queryparameters:
            solrparameters.update(queryparameters)

        if query_marked_before:
            # don't extend query but use filterquery for more performance (cache) on aliases
            solrparameters["fq"] = 'NOT (' + query_marked_before + ')'

        if self.verbose:
            print("Solr query:")
            print(query)
            print("Solr parameters:")
            print(solrparameters)

        results = solr.search(query, **solrparameters)

        for result in results:
            docid = result['id']

            if self.verbose:
                print("Tagging {}".format(docid))

            self.tag(docid=docid, field=field, value=value, data=data)

            count += 1

        return count

    def get_data(self, docid, fields):

        uri = self.solr + self.core + '/get?id=' + \
            urllib.parse.quote(docid) + '&fl=' + ','.join(fields)

        request = urllib.request.urlopen(uri)
        encoding = request.info().get_content_charset('utf-8')
        data = request.read()
        request.close()

        solr_doc = json.loads(data.decode(encoding))

        data = None
        if 'doc' in solr_doc:
            data = solr_doc['doc']

        return data

    def commit(self):

        uri = self.solr + self.core + '/update?commit=true'
        if self.verbose:
            print("Committing to {}".format(uri))
        request = urllib.request.urlopen(uri)
        request.close()

    def get_lastmodified(self, docid):
        # convert mtime to solr format
        solr_doc_mtime = None

        solr_doc = self.get_data(docid=docid, fields=["file_modified_dt"])

        if solr_doc:
            if 'file_modified_dt' in solr_doc:
                solr_doc_mtime = solr_doc['file_modified_dt']

            # todo: for each plugin
#			solr_meta_mtime = False
#			if 'meta_modified_dt' in solr_doc['doc']:
#				solr_meta_mtime = solr_doc['doc']['meta_modified_dt']

        return solr_doc_mtime

    def delete(self, parameters, docid=None, query=None,):
        import pysolr

        if 'solr' in parameters:
            self.solr = parameters['solr']
            if not self.solr.endswith('/'):
                self.solr += '/'

        if 'index' in parameters:
            self.core = parameters['index']

        solr = pysolr.Solr(self.solr + self.core)

        if docid:
            result = solr.delete(id=docid)

        if query:
            result = solr.delete(q=query)

        return result

    #
    # append synonyms by Solr REST API for managed resources
    #

    def append_synonyms(self, resourceid, synonyms):

        url = self.solr + self.core + '/schema/analysis/synonyms/' + resourceid
        headers = {'content-type': 'application/json'}

        r = requests.post(url=url, data=json.dumps(synonyms), headers=headers)


def solr_mask(string_to_mask, solr_specialchars='\+-&|!(){}[]^"~*?:/'):

    masked = string_to_mask
    # mask every special char with leading \
    for char in solr_specialchars:
        masked = masked.replace(char, "\\" + char)

    return masked


================================================
FILE: src/opensemanticetl/filter_blacklist.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import re


def is_in_lists(listfiles, value, match=None):

    result = False

    for listfile in listfiles:

        try:
            if is_in_list(filename=listfile, value=value, match=match):
                result = True
                break

        except BaseException as e:
            print("Exception while checking blacklist {}:".format(listfile))
            print(e.args[0])

    return result


#
# is a value in a textfile with a list
#
def is_in_list(filename, value, match=None):

    result = False
    listfile = open(filename)

    # search all the lines
    for line in listfile:
        line = line.strip()

        # ignore empty lines and comment lines (starting with #)
        if line and not line.startswith("#"):

            if match == 'prefix':
                if value.startswith(line):
                    result = True
            elif match == 'suffix':
                if value.endswith(line):
                    result = True
            elif match == 'regex':
                if re.search(line, value):
                    result = True

            else:
                if line == value:
                    result = True

            if result:

                # we dont have to check rest of list
                break

    listfile.close()

    return result


#
# add to configured facet, if entry in list is in text
#

class filter_blacklist(object):

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        blacklisted = False

        verbose = False
        if 'verbose' in parameters:
            if parameters['verbose']:
                verbose = True

        uri = parameters['id']

        # if blacklist type configurated in parameters, check this blacklists for URI

        if 'blacklist_prefix' in parameters:

            if is_in_lists(listfiles=parameters['blacklist_prefix'], value=uri, match="prefix"):
                blacklisted = True

        if not blacklisted and 'blacklist_suffix' in parameters:

            if is_in_lists(listfiles=parameters['blacklist_suffix'], value=uri, match="suffix"):
                blacklisted = True

        if not blacklisted and 'blacklist_regex' in parameters:

            if is_in_lists(listfiles=parameters['blacklist_regex'], value=uri, match="regex"):
                blacklisted = True

        if not blacklisted and 'blacklist' in parameters:

            if is_in_lists(listfiles=parameters['blacklist'], value=uri):
                blacklisted = True

        # check whitelists for URI, if blacklisted

        if blacklisted and 'whitelist_prefix' in parameters:
            if is_in_lists(listfiles=parameters['whitelist_prefix'], value=uri, match="prefix"):
                blacklisted = False

        if blacklisted and 'whitelist_suffix' in parameters:
            if is_in_lists(listfiles=parameters['whitelist_suffix'], value=uri, match="suffix"):
                blacklisted = False

        if blacklisted and 'whitelist_regex' in parameters:
            if is_in_lists(listfiles=parameters['whitelist_regex'], value=uri, match="regex"):
                blacklisted = False

        if blacklisted and 'whitelist' in parameters:
            if is_in_lists(listfiles=parameters['whitelist'], value=uri):
                blacklisted = False

        # if blacklisted and not matched whitelist, return parameter break, so no further processing
        if blacklisted:
            parameters['break'] = True

        return parameters, data


================================================
FILE: src/opensemanticetl/filter_file_not_modified.py
================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import datetime
import sys
import importlib

#
# do not index (set parameters["break"] = True), if yet former ETL with all configured plugins / yet indexed
#


class filter_file_not_modified(object):

    def __init__(self):

        self.verbose = False
        self.quiet = False

        # if a critical plugin failed in former ETL / indexed document, reindex file to retry
        self.force_reindex_if_former_etl_plugin_errors = [
            'enhance_extract_text_tika_server']

    def process(self, parameters=None, data=None):
        if parameters is None:
            parameters = {}
        if data is None:
            data = {}

        if 'verbose' in parameters:
            if parameters['verbose']:
                self.verbose = True

        if 'quiet' in parameters:
            self.quiet = parameters['quiet']

        filename = parameters['filename']

        force = False

        if 'force' in parameters:
            force = parameters['force']

        # check if file size and file date are the same in DB
        # if exist delete protocol prefix file://
        if filename.startswith("file://"):
            filename = filename.replace("file://", '', 1)

        # if relative path change to absolute path
        filename = os.path.abspath(filename)

        # get modification time from file
        file_mtime = os.path.getmtime(filename)

        # get id
        docid = parameters['id']

        export = False
        indexed_doc_mtime = None
        plugins_failed = []
        critical_plugins_failed = []
        plugins_runned = []
        plugins_not_runned = []
        additional_plugins_later_not_runned = []
        do_not_reindex_because_plugin_yet_not_processed = []

        # use abstracted function from exporter module to get last modification time of file in index
        if 'export' in parameters:
            export = parameters['export']
            module = importlib.import_module(export)
            objectreference = getattr(module, export)
            exporter = objectreference(parameters)

            # get modtime and ETL errors from document saved in index
            metadatafields = ['file_modified_dt', 'etl_error_plugins_ss']

            # get plugin status fields
            for configured_plugin in parameters['plugins']:
                if not configured_plugin == 'export_queue_files' and not configured_plugin == parameters['export']:
                    metadatafields.append('etl_' + configured_plugin + '_b')
            if 'additional_plugins_later' in parameters:
                for configured_plugin in parameters['additional_plugins_later']:
                    metadatafields.append('etl_' + configured_plugin + '_b')

            # get config option status field for OCR
            if 'ocr' in parameters:
                if parameters['ocr']:
                    metadatafields.append(
                        'etl_enhance_extract_text_tika_server_ocr_enabled_b')
            if 'additional_plugins_later_config' in parameters:
                if 'ocr' in parameters['additional_plugins_later_config']:
                    if parameters['additional_plugins_later_config']['ocr']:
                        metadatafields.append(
                            'etl_enhance_extract_text_tika_server_ocr_enabled_b')

            if 'do_not_reindex_because_plugin_yet_not_processed' in parameters:
                do_not_reindex_because_plugin_yet_not_processed=parameters['do_not_reindex_because_plugin_yet_not_processed']

            # read yet indexed metadata, if there
            indexed_metadata = exporter.get_data(
                docid=docid, fields=metadatafields)

            if indexed_metadata:
                if 'file_modified_dt' in indexed_metadata:
                    indexed_doc_mtime = indexed_metadata['file_modified_dt']
                if 'etl_error_plugins_ss' in indexed_metadata:
                    plugins_failed = indexed_metadata['etl_error_plugins_ss']

        # mask file_mtime for comparison in same format than in Lucene index
        file_mtime_masked = datetime.datetime.fromtimestamp(
            file_mtime).strftime("%Y-%m-%dT%H:%M:%SZ")

        # Is it a new file (not indexed, so the initial None different to filemtime)
        # or modified (also doc_mtime <> file_mtime of file)?

        if indexed_doc_mtime == file_mtime_masked:

            # Doc was found in index and field moddate of solr doc same as files mtime
            # so file was indexed before and is unchanged

            # all now configured plugins processed in former ETL/their analysis is in index?
            for configured_plugin in parameters['plugins']:
                if not configured_plugin == 'export_queue_files' and not configured_plugin == parameters['export']:
                    plugin_runned = indexed_metadata.get('etl_' + configured_plugin + '_b', False)
                    if plugin_runned:
                        plugins_runned.append(configured_plugin)
                    else:
                        if not configured_plugin in do_not_reindex_because_plugin_yet_not_processed:
                            plugins_not_runned.append(configured_plugin)

            if 'additional_plugins_later' in parameters:
                for configured_plugin in parameters['additional_plugins_later']:
                    plugin_runned = indexed_metadata.get('etl_' + configured_plugin + '_b', False)
                    if plugin_runned:
                        plugins_runned.append(configured_plugin)
                    else:
                        if not configured_plugin in do_not_reindex_because_plugin_yet_not_processed:
                            additional_plugins_later_not_runned.append(configured_plugin)

            # Tika OCR was enabled in former ETL/their analysis is in index?
            if 'ocr' in parameters:
                if parameters['ocr']:
                    plugin_runned = indexed_metadata.get('etl_enhance_extract_text_tika_server_ocr_enabled_b', False)
                    if plugin_runned:
                        plugins_runned.append('enhance_extract_text_tika_server_ocr_enabled')
                    else:
                        if not configured_plugin in do_not_reindex_because_plugin_yet_not_processed:
                            plugins_not_runned.append('enhance_extract_text_tika_server_ocr_enabled')

            if 'additional_plugins_later_config' in parameters:
                if 'ocr' in parameters['additional_plugins_later_config']:
                    if parameters['additional_plugins_later_config']['ocr']:
                        plugin_runned = indexed_metadata.get('etl_enhance_extract_text_tika_server_ocr_enabled_b', False)
                        if plugin_runned:
                            plugins_runned.append('enhance_extract_text_tika_server_ocr_enabled')
                        else:
                            if not configured_plugin in do_not_reindex_because_plugin_yet_not_processed:
                                additional_plugins_later_not_runned.append('enhance_extract_text_tika_server_ocr_enabled')

            for critical_plugin in self.force_reindex_if_former_etl_plugin_errors:
                if critical_plugin in plugins_failed:
                    critical_plugins_failed.append(critical_plugin)

            if len(plugins_not_runned) > 0 or len(additional_plugins_later_not_runned) > 0:

                doindex = True

                # print status
                if self.verbose or self.quiet == False:
                    try:
                        print('Repeating indexing of unchanged file because (additional configured) plugin(s) or options {} not ran yet: {}'.format(
                            plugins_not_runned + additional_plugins_later_not_runned, filename))
                    except:
                        sys.stderr.write(
                            "Repeating indexing of unchanged file because former fail of critical plugin, but exception while printing message (problem with encoding of filename or console? Is console set to old ASCII standard instead of UTF-8?)")

                if len(plugins_not_runned) == 0:
                    parameters['only_additional_plugins_later'] = True

            # a critical plugin failed in former ETL
            elif len(critical_plugins_failed) > 0:

                doindex = True

                # print status
                if self.verbose or self.quiet == False:
                    try:
                        print('Repeating indexing of unchanged file because critical plugin(s) {} failed in former run: {}'.format(
                            critical_plugins_failed, filename))
                    except:
                        sys.stderr.write(
                            "Repeating indexing of unchanged file because critical plugin(s) failed in former run, but exception while printing message (problem with encoding of filename or console? Is console set to old ASCII standard instead of UTF-8?)")

            # If force option, do further processing even if unchanged
            elif force:

                doindex = True

                # print status
                if self.verbose or self.quiet == False:
                    try:
                        print(
                            'Forced indexing of unchanged file: {}'.format(filename))
                    except:
                        sys.stderr.write(
                            "Forced indexing of unchanged file but exception while printing message (problem with encoding of filename or console? Is console set to old ASCII standard instead of UTF-8?)")

            else:

                doindex = False

                # print status
                if self.verbose:
                    try:
                        print("Not indexing unchanged file: {}".format(filename))
                    except:
                        sys.stderr.write(
                            "Not indexing unchanged file but exception while printing message (problem with encoding of filename or console?)")

        else:  # doc not found in index or other/old modification time in index

            doindex = True

            # print status, if new document
            if self.verbose or self.quiet == False:

                if indexed_doc_mtime == None:
                    try:
                        print("Indexing new file: {}".format(filename))
                    except:
                        sys.stderr.write(
                            "Indexing new file but exception while printing message (problem with encoding of filename or console?)")
                else:
                    try:
                        print('Indexing modified file: {}'.format(filename))
                    except:
                        sys.stderr.write(
                            "Indexing modified file. Exception while printing filename (problem with encoding of filename or console?)\n")

        # if not modified and no critical ETL errors, stop ETL process, because all done on last run
        if not doindex:
            parameters['break'] = True
        else:
            # reset plugin status of plugins of next stage
            # so reprocessing of updated data works by tasks in later stages,
            # which else would have plugin status processed
            # from first/last processing of old version of content

            commit = False
            if len(plugins_runned) > 0:
                
                for runned_plugin in plugins_runned:
                    if not runned_plugin in [parameters['export'], 'enhance_mapping_id', 'filter_blacklist', 'filter_file_not_modified']:
                        data['etl_' + runned_plugin + '_b'] = False
                        commit = True

            # immediately commit (else Solr autocommit after some time) of etl status reset(s) in exporter before adding new ETL tasks which need the status for plugin filter_file_not_modified
            if commit:
                parameters['commit'] = True

        return parameters, data


================================================
FILE: src/opensemanticetl/move_indexed_file.py
================================================
#!/usr/bin/env python3

import urllib.request
import urllib.parse
import json
from itertools import starmap


def move_files(host: str, moves: dict, prefix=""):
    """Moves files within the index (not physically).

    Example of usage:
    host = "http://solr:8983/solr/opensemanticsearch/"
    move_files(host, {"/b2": "/book2",
                      "/b1": "/folder/book1"}, prefix="file://")


    :host: Url to the solr instance
    :moves: A dict of the form {src: dest, ...}, where src is
            the source path and dest is the destination path.
    """
    src = moves.keys()
    indexed_data = get_files(host, map(append_prefix(prefix), src))
    # In the following we have to remap the destination path to
    # the individual metadata entries, since the ordering of
    # the query result and query may differ:
    moved_data = starmap(change_path(prefix),
                         zip(indexed_data,
                             map(dict_map(moves),
                                 map(extract_path,
                                     indexed_data))))
    request_payload = prepare_payload(
        moved_data, (d["id"] for d in indexed_data))
    post(host, request_payload)


def move_dir(host: str, src: str, dest: str, prefix=""):
    """Moves directories within the index (not physically).

    Example of usage:
    host = "http://solr:8983/solr/opensemanticsearch/"
    move_dir(host, src=/docs/a/, dest=/docs/b, prefix="file://")


    :host: Url to the solr instance
    :src: Source directory
    :dest: Destination directory
    """
    indexed_data = get_files_in_dir(host, src)
    moved_data = map(change_dir(prefix, src=src, dest=dest),
                     indexed_data)
    request_payload = prepare_payload(
        moved_data, (d["id"] for d in indexed_data))
    post(host, request_payload)


def change_path(prefix: str):
    """Returns a mapping function to be used with starmap
    """
    def change(data: dict, dest: str) -> dict:
        """Creates a modified version of data

        :data: The indexed metadata of the moved file
        :dest: The destination path
        """
        dest_components = dest.strip("/").split("/")
        return _change_path(data, dest_components, prefix=prefix)
    return change


def change_dir(prefix: str, src: str, dest: str):
    """Returns a mapping function to be used with map
    """
    dest_components = dest.strip("/").split("/")
    src_path_components = src.strip("/").split("/")

    def change(data: dict) -> dict:
        """Creates a modified version of data

        :data: The indexed metadata of the moved file
        :dest: The destination path
        """
        indexed_components = extract_path_components(data)
        # Attention: zip consumes the generator up to the number
        # of elements in indexed_components. If you switch the two
        # arguments of zip, an additional element will be consumed
        # from indexed_components, as zip will perform a next on
        # its first argument to see if the iterable is exhausted.
        for idx_component, src_component in zip(src_path_components,
                                                indexed_components):
            if idx_component != src_component:
                raise ValueError(
                    "Path component of file and input file differs: '"
                    + idx_component + "' <-> '" + src_component + "'")
        return _change_path(data, dest_components + list(indexed_components),
                            prefix=prefix)
    return change


def _change_path(data: dict, dest_components: tuple, prefix: str = "") -> dict:
    """Creates a modified version of data

    :data: The indexed metadata of the moved file
    :dest_components: The destination path split into components
    """
    moved_data = data.copy()
    del moved_data["_version_"]
    moved_data["id"] = prefix + "/" + "/".join(dest_components)
    *dest_dir_components, base_name = dest_components
    moved_data.update({"path{}_s".format(i): component
                       for i, component in enumerate(dest_dir_components)})
    moved_data["path_basename_s"] = base_name
    n = len(dest_dir_components)
    while True:
        if moved_data.pop("path{}_s".format(n), None) is None:
            break
        n += 1
    return moved_data


def prepare_payload(adds, delete_ids):
    """Takes metadata to be added to the index and ids to be deleted
    from the index. Creates the corresponding solr json request payload
    """
    payload = {DuplicateKey("add"): {"doc": doc} for doc in adds}
    payload["delete"] = [
        {"id": id_} for id_ in delete_ids]
    return payload


class DuplicateKey(str):
    """Allows dicts having multiple identical keys"""

    def __hash__(self):
        return id(self)


def extract_path(data: dict) -> str:
    """Extracts the path of the metadata
    """
    return "/" + "/".join(extract_path_components(data))


def extract_path_components(data: dict):
    """Extracts the path of the metadata in form of a
    generator yielding the components of the path
    """
    i = 0
    while True:
        component = data.get("path{}_s".format(i))
        if component is None:
            break
        yield component
        i += 1
    yield data["path_basename_s"]


def dict_map(mapping: dict):
    """Converts a dict into a function (for usage with map)"""
    def _map(s):
        return mapping[s]
    return _map


def append_prefix(prefix: str):
    """A mapping function to be used with :map:"""
    def append(s: str):
        return prefix + s
    return append


def get_files(host: str, ids: list) -> list:
    """Queries solr, searches for files whose id is in :ids:"""
    return get(host,
               "(" + ", ".join(
                   map('id:"{}"'.format, ids)) + ")"
               )


def get_files_in_dir(host: str, path: str) -> list:
    """Queries solr, searches for files in the folder :path:"""
    path_components = path.strip("/").split("/")
    return get(host,
               " AND ".join(
                   starmap(
                       'path{}_s:"{}"'.format, enumerate(path_components)
                   )))


def get(host: str, query: str) -> list:
    return sum(get_pages(host, query), [])


def get_pages(host: str, query: str, limit=50):
    """An iterator over the pages of a solr request response"""
    start = 0
    n_docs = limit
    query_url_template = host + "select?start={}&rows={}&q={}".format(
        "{}", limit, urllib.parse.quote(query))
    while start < n_docs:
        response = urllib.request.urlopen(
            query_url_template.format(start))
        data = json.loads(response.read().decode())["response"]
        n_docs = data["numFound"]
        start += limit
        yield data["docs"]


def post(host: str, data: dict):
    request = urllib.request.Request(
        host + "update/json?commit=true",
        data=json.dumps(data).encode(),
        headers={"Content-Type": "application/json"}
    )
    urllib.request.urlopen(request)


================================================
FILE: src/opensemanticetl/requirements.txt
================================================
celery
feedparser
lxml
numerizer
py2neo
pycurl
pyinotify
pysolr
python-dateutil
requests
rdflib
scrapy
SPARQLWrapper
tika
twint
warcio


================================================
FILE: src/opensemanticetl/tasks.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

#
# Queue tasks for batch processing and parallel processing
#

import os
import time
from celery import Celery
from kombu import Queue, Exchange

# ETL connectors
from etl import ETL
from etl_delete import Delete
from etl_file import Connector_File
from etl_web import Connector_Web
from etl_rss import Connector_RSS


verbose = True
quiet = False

broker = 'amqp://localhost'
if os.getenv('OPEN_SEMANTIC_ETL_MQ_BROKER'):
    broker = os.getenv('OPEN_SEMANTIC_ETL_MQ_BROKER')

app = Celery('etl.tasks', broker=broker)

app.conf.task_queues = [Queue('open_semantic_etl_tasks', Exchange(
    'open_semantic_etl_tasks'), routing_key='open_semantic_etl_tasks', queue_arguments={'x-max-priority': 100})]

app.conf.worker_max_tasks_per_child = 1
app.conf.worker_prefetch_multiplier = 1
app.conf.task_acks_late = True


# Max parallel tasks (Default: Use as many parallel ETL tasks as CPUs available).
# Warning: Some tools called by ETL plugins use multithreading, too,
# so used CPUs/threads can be more than that setting!

if os.getenv('OPEN_SEMANTIC_ETL_CONCURRENCY'):
    app.conf.worker_concurrency = int(os.getenv('OPEN_SEMANTIC_ETL_CONCURRENCY'))


etl_delete = Delete()
etl_web = Connector_Web()
etl_rss = Connector_RSS()


#
# Delete document with URI from index
#

@app.task(name='etl.delete')
def delete(uri):
    etl_delete.delete(uri=uri)


#
# Index a file
#

@app.task(name='etl.index_file')
def index_file(filename, additional_plugins=(), wait=0, commit=False, config=None):

    if wait:
        time.sleep(wait)

    etl_file = Connector_File()

    # set alternate config options (will overwrite config options from config file)
    if config:
        for option in config:
            etl_file.config[option] = config[option]

    etl_file.index_file(filename=filename,
                        additional_plugins=additional_plugins)

    if commit:
        etl_file.commit()

#
# Index file directory
#


@app.task(name='etl.index_filedirectory')
def index_filedirectory(filename, config=None):

    from etl_filedirectory import Connector_Filedirectory

    etl_filedirectory = Connector_Filedirectory()

    # set alternate config options (will overwrite config options from config file)
    if config:
        for option in config:
            etl_filedirectory.config[option] = config[option]

    result = etl_filedirectory.index(filename)
    etl_filedirectory.commit()

    return result


#
# Index a webpage
#
@app.task(name='etl.index_web')
def index_web(uri, wait=0, downloaded_file=False, downloaded_headers=None):

    if wait:
        time.sleep(wait)

    result = etl_web.index(uri, downloaded_file=downloaded_file,
                           downloaded_headers=downloaded_headers)

    return result


#
# Index full website
#

@app.task(name='etl.index_web_crawl')
def index_web_crawl(uri, crawler_type="PATH"):

    import etl_web_crawl

    etl_web_crawl.index(uri, crawler_type)


#
# Index webpages from sitemap
#

@app.task(name='etl.index_sitemap')
def index_sitemap(uri):

    from etl_sitemap import Connector_Sitemap

    connector_sitemap = Connector_Sitemap()

    result = connector_sitemap.index(uri)

    return result


#
# Index RSS Feed
#

@app.task(name='etl.index_rss')
def index_rss(uri):

    result = etl_rss.index(uri)

    return result


#
# Enrich with / run plugins
#

@app.task(name='etl.enrich')
def enrich(plugins, uri, wait=0):

    if wait:
        time.sleep(wait)

    etl = ETL()
    etl.read_configfile('/etc/opensemanticsearch/etl')
    etl.read_configfile('/etc/opensemanticsearch/enhancer-rdf')

    etl.config['plugins'] = plugins.split(',')

    filename = uri

    # if exist delete protocoll prefix file://
    if filename.startswith("file://"):
        filename = filename.replace("file://", '', 1)

    parameters = etl.config.copy()

    parameters['id'] = uri
    parameters['filename'] = filename

    parameters, data = etl.process(parameters=parameters, data={})

    return data


@app.task(name='etl.index_twitter_scraper')
def index_twitter_scraper(search=None, username=None, Profile_full=False, limit=None, Index_Linked_Webpages=False):

    import opensemanticetl.etl_twitter_scraper

    opensemanticetl.etl_twitter_scraper.index(username=username, search=search, limit=limit, Profile_full=Profile_full, Index_Linked_Webpages=Index_Linked_Webpages)


#
# Read command line arguments and start
#

# if running (not imported to use its functions), run main function
if __name__ == "__main__":

    from optparse import OptionParser

    parser = OptionParser("etl-tasks [options]")
    parser.add_option("-q", "--quiet", dest="quiet", action="store_true",
                      default=False, help="Don\'t print status (filenames) while indexing")
    parser.add_option("-v", "--verbose", dest="verbose",
                      action="store_true", default=False, help="Print debug messages")

    (options, args) = parser.parse_args()

    if options.verbose == False or options.verbose == True:
        verbose = options.verbose
        etl_delete.verbose = options.verbose
        etl_web.verbose = options.verbose
        etl_rss.verbose = options.verbose

    if options.quiet == False or options.quiet == True:
        quiet = options.quiet

    app.worker_main(['worker'])


================================================
FILE: src/opensemanticetl/test_enhance_detect_language_tika_server.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

import enhance_detect_language_tika_server

class Test_enhance_detect_language_tika_server(unittest.TestCase):

    def test(self):

        enhancer = enhance_detect_language_tika_server.enhance_detect_language_tika_server()

        # English
        parameters, data = enhancer.process(data={'content_txt': 'This sentence is written in english language.'})
        self.assertEqual(data['language_s'], 'en')

        # German
        parameters, data = enhancer.process(data={'content_txt': 'Dies ist ein Satz in der Sprache Deutsch.'})
        self.assertEqual(data['language_s'], 'de')

       
if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_extract_email.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

import enhance_extract_email

class Test_enhance_extract_email(unittest.TestCase):

    def test(self):

        enhancer = enhance_extract_email.enhance_extract_email()

        data = {}
        data['content_txt'] = "one@localnet.localdomain at begin and two@localnet2.localdomain in the middle and end of the line three@localnet3.localdomain\na_underscore@localnet.localdomain and some.points.here@localnet.localdomain"

        parameters, data = enhancer.process(data=data)

        self.assertTrue('one@localnet.localdomain' in data['email_ss'])
        self.assertTrue('two@localnet2.localdomain' in data['email_ss'])
        self.assertTrue('three@localnet3.localdomain' in data['email_ss'])
        self.assertTrue('a_underscore@localnet.localdomain' in data['email_ss'])
        self.assertTrue('some.points.here@localnet.localdomain' in data['email_ss'])

        self.assertTrue('localnet.localdomain' in data['email_domain_ss'])
        self.assertTrue('localnet2.localdomain' in data['email_domain_ss'])
        self.assertTrue('localnet3.localdomain' in data['email_domain_ss'])
       
if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_extract_law.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

from etl import ETL

class Test_enhance_extract_law(unittest.TestCase):

    def test(self):

        etl = ETL()
        etl.config['plugins'] = ['enhance_entity_linking', 'enhance_extract_law']
        etl.config['raise_pluginexception'] = True
        data = {}
        data['content_txt'] = "\n".join([
            "abc § 888 xyz"
            "abc § 987 b xyz"
            "§12",
            "§ 123",
            "§345a",
            "§456 b",
            "§ 567 c",
            "BGB § 153 Abs. 1 Satz 2",
            "§ 52 Absatz 1 Nummer 2 Buchstabe c STGB",
            "§ 444 CC"
        ])


        # run ETL of test.pdf with configured plugins and PDF OCR (result of etl_file.py)
        parameters, data = etl.process(parameters={'id': 'test_enhance_extract_law'}, data=data)

        self.assertTrue('§ 888' in data['law_clause_ss'])
        self.assertTrue('§ 987 b' in data['law_clause_ss'])
        self.assertTrue('§ 12' in data['law_clause_ss'])
        self.assertTrue('§ 123' in data['law_clause_ss'])
        self.assertTrue('§ 345a' in data['law_clause_ss'])
        self.assertTrue('§ 456 b' in data['law_clause_ss'])
        self.assertTrue('§ 567 c' in data['law_clause_ss'])

        self.assertTrue('§ 153 Abs. 1 Satz 2' in data['law_clause_ss'])
        self.assertTrue('§ 52 Absatz 1 Nummer 2 Buchstabe c' in data['law_clause_ss'])
        
        self.assertTrue('Strafgesetzbuch' in data['law_code_ss'])
        self.assertTrue('Bürgerliches Gesetzbuch' in data['law_code_ss'])
        
        self.assertTrue('Swiss Civil Code' in data['law_code_ss'])


    def test_blacklist(self):

        etl = ETL()
        etl.config['plugins'] = ['enhance_entity_linking', 'enhance_extract_law']
        etl.config['raise_pluginexception'] = True
        data = {}
        data['content_txt'] = "\n".join([
            "No clause for law code alias CC"
        ])

        parameters, data = etl.process(parameters={'id': 'test_enhance_extract_law'}, data=data)
        
        self.assertFalse('Swiss Civil Code' in data['law_code_ss'])

        data['content_txt'] = "\n".join([
            "No clause for blacklisted law code alias CC but not blacklisted label of this alias: Swiss Civil Code"
        ])

        parameters, data = etl.process(parameters={'id': 'test_enhance_extract_law'}, data=data)
        
        self.assertTrue('Swiss Civil Code' in data['law_code_ss'])


if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_extract_money.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

from etl import ETL

class Test_enhance_extract_money(unittest.TestCase):

    def test(self):

        etl = ETL()
        etl.config['plugins'] = ['enhance_entity_linking', 'enhance_extract_money']
        etl.config['raise_pluginexception'] = True
        data = {}
        data['content_txt'] = "\n".join([
            "abc $ 123 xyz",
            "abc $ 124,000 xyz",
            "abc 234 $ xyz",
            "abc 235,000 $ xyz",
            "abc 236,99 $ xyz",
            "abc $1234 xyz",
            "abc 2345$ xyz",
            "4444 dollar",
            "44444 USD",
            "444 €",
            "445.000 €",
            "450,99 €",
            "4444 EUR",
            "46.000 EUR",
            "47.000,99 EUR",
            "44,22 EURO",
            "if ambiguous like $ 77 € for more completeness we want to extract both possible variants",
        ])


        parameters, data = etl.process(parameters={'id': 'test_enhance_extract_money'}, data=data)

        self.assertTrue('$ 123' in data['money_ss'])
        self.assertTrue('$ 124,000' in data['money_ss'])
        self.assertTrue('234 $' in data['money_ss'])
        self.assertTrue('235,000 $' in data['money_ss'])
        self.assertTrue('236,99 $' in data['money_ss'])
        self.assertTrue('$1234' in data['money_ss'])
        self.assertTrue('2345$' in data['money_ss'])
        self.assertTrue('4444 dollar' in data['money_ss'])
        self.assertTrue('44444 USD' in data['money_ss'])
        self.assertTrue('444 €' in data['money_ss'])
        self.assertTrue('445.000 €' in data['money_ss'])
        self.assertTrue('450,99 €' in data['money_ss'])
        self.assertTrue('4444 EUR' in data['money_ss'])
        self.assertTrue('46.000 EUR' in data['money_ss'])
        self.assertTrue('47.000,99 EUR' in data['money_ss'])
        self.assertTrue('44,22 EURO' in data['money_ss'])
        self.assertTrue('$ 77' in data['money_ss'])
        self.assertTrue('77 €' in data['money_ss'])

    def test_numerizer(self):

        etl = ETL()
        etl.config['plugins'] = ['enhance_entity_linking', 'enhance_extract_money']
        etl.config['raise_pluginexception'] = True
        data = { 'language_s': 'en' }
        data['content_txt'] = "\n".join([
            "So two million two hundred and fifty thousand and seven $ were given to them",
            "We got twenty one thousand four hundred and seventy three dollars from someone",
        ])

        parameters, data = etl.process(parameters={'id': 'test_enhance_extract_money_numerize'}, data=data)

        self.assertTrue('2250007 $' in data['money_ss'])
        self.assertTrue('21473 dollars' in data['money_ss'])

if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_extract_text_tika_server.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest
import os

import enhance_extract_text_tika_server

class TestEnhanceExtractTextTikaServer(unittest.TestCase):

    # delete OCR cache entries for the images used in this test class
    def delete_ocr_cache_entries(self):
        filenames = [
            '/var/cache/tesseract/eng-4c6bf51d4455e1cb58b7d8dd20fb8846f15a3d2c884dc8859802ed689f74ae7a-e96c4b1545a83d86d05f7fbb12ade96d.txt',
            '/var/cache/tesseract/eng-526959d31f4e6b1947bb00c3a02959ef008ce19b9487d95b3df0656159f55a7a-e96c4b1545a83d86d05f7fbb12ade96d.txt',
            '/var/cache/tesseract/eng-c93c49c9dfc9764a4307c2757eb378b2d8cd00f3007ac450605b83f23ecda900-e96c4b1545a83d86d05f7fbb12ade96d.txt',
            '/var/cache/tesseract/eng-ebce8ee4ea7d3d24fe9384212d944adeb58e8f18be15ec06103454f7eade70f5-e96c4b1545a83d86d05f7fbb12ade96d.txt'
        ]
        for filename in filenames:
            if os.path.exists(filename):
                os.remove(filename)

    def setUp(self):
        self.delete_ocr_cache_entries()
    def tearDown(self):
        self.delete_ocr_cache_entries()

    def test_text_extraction_pdf(self):

        enhancer = enhance_extract_text_tika_server.enhance_extract_text_tika_server()

        parameters = {'filename': os.path.dirname(os.path.realpath(__file__)) + '/testdata/test.pdf'}

        parameters, data = enhancer.process(parameters=parameters)

        # check extracted content type
        self.assertTrue(data['content_type_ss'] == 'application/pdf'
                        or sorted(data['content_type_ss']) == ['application/pdf', 'image/jpeg', 'image/png'])

        # check extracted title
        self.assertEqual(data['title_txt'], 'TestPDFtitle')

        # check extracted content of PDF text
        self.assertTrue('TestPDFContent1 on TestPDFPage1' in data['content_txt'])
        self.assertTrue('TestPDFContent2 on TestPDFPage2' in data['content_txt'])

        # check disabled OCR of embedded images in PDF
        self.assertFalse('TestPDFOCRImage1Content1' in data['content_txt'])
        self.assertFalse('TestPDFOCRImage1Content2' in data['content_txt'])
        self.assertFalse('TestPDFOCRImage2Content1' in data['content_txt'])
        self.assertFalse('TestPDFOCRImage2Content2' in data['content_txt'])

    def test_text_extraction_pdf_ocr(self):

        enhancer = enhance_extract_text_tika_server.enhance_extract_text_tika_server()

        parameters = {'ocr': True, 'plugins': ['enhance_pdf_ocr'],
                      'filename': os.path.dirname(os.path.realpath(__file__)) + '/testdata/test.pdf'}

        parameters, data = enhancer.process(parameters=parameters)

        # check extracted content type
        self.assertTrue(sorted(data['content_type_ss']) == ['application/pdf', 'image/jpeg', 'image/png'])

        # check extracted title
        self.assertEqual(data['title_txt'], 'TestPDFtitle')

        # check extracted content of PDF text
        self.assertTrue('TestPDFContent1 on TestPDFPage1' in data['content_txt'])
        self.assertTrue('TestPDFContent2 on TestPDFPage2' in data['content_txt'])

        # check OCR of embedded images in PDF
        self.assertTrue('TestPDFOCRImage1Content1' in data['content_txt'])
        self.assertTrue('TestPDFOCRImage1Content2' in data['content_txt'])
        self.assertTrue('TestPDFOCRImage2Content1' in data['content_txt'])
        self.assertTrue('TestPDFOCRImage2Content2' in data['content_txt'])

    def test_text_extraction_pdf_ocr_cache(self):

        # add text (changed for this test) to ocr cache, so we can prove that the cache was used
        file = open('/var/cache/tesseract/eng-c93c49c9dfc9764a4307c2757eb378b2d8cd00f3007ac450605b83f23ecda900-e96c4b1545a83d86d05f7fbb12ade96d.txt', "w")
        file.write("TestPDFOCRCacheImage1Content1\n\nTestPDFOCRCacheImage1Content2")
        file.close()

        file = open('/var/cache/tesseract/eng-526959d31f4e6b1947bb00c3a02959ef008ce19b9487d95b3df0656159f55a7a-e96c4b1545a83d86d05f7fbb12ade96d.txt', "w")
        file.write("TestPDFOCRCacheImage2Content1\n\nTestPDFOCRCacheImage2Content2")
        file.close()

        enhancer = enhance_extract_text_tika_server.enhance_extract_text_tika_server()

        parameters = {'ocr': True, 'plugins': ['enhance_pdf_ocr'],
                      'filename': os.path.dirname(os.path.realpath(__file__)) + '/testdata/test.pdf'}

        parameters, data = enhancer.process(parameters=parameters)

        # check extracted content type
        self.assertTrue(sorted(data['content_type_ss']) == ['application/pdf', 'image/jpeg', 'image/png'])

        # check extracted title
        self.assertEqual(data['title_txt'], 'TestPDFtitle')

        # check extracted content of PDF text
        self.assertTrue('TestPDFContent1 on TestPDFPage1' in data['content_txt'])
        self.assertTrue('TestPDFContent2 on TestPDFPage2' in data['content_txt'])

        # check OCR of embedded images in PDF
        self.assertTrue('TestPDFOCRCacheImage1Content1' in data['content_txt'])
        self.assertTrue('TestPDFOCRCacheImage1Content2' in data['content_txt'])
        self.assertTrue('TestPDFOCRCacheImage2Content1' in data['content_txt'])
        self.assertTrue('TestPDFOCRCacheImage2Content2' in data['content_txt'])

    def test_ocr_png(self):

        enhancer = enhance_extract_text_tika_server.enhance_extract_text_tika_server()

        parameters = {'ocr': True,
                      'filename': os.path.dirname(os.path.realpath(__file__)) + '/testdata/Test_OCR_Image1.png'}

        parameters, data = enhancer.process(parameters=parameters)

        # check extracted content type
        self.assertEqual(data['content_type_ss'], 'image/png')

        # check OCR
        self.assertTrue('TestOCRImage1Content1' in data['content_txt'])
        self.assertTrue('TestOCRImage1Content2' in data['content_txt'])

    def test_ocr_jpg(self):

        enhancer = enhance_extract_text_tika_server.enhance_extract_text_tika_server()

        parameters = {'ocr': True,
                      'filename': os.path.dirname(os.path.realpath(__file__)) + '/testdata/Test_OCR_Image2.jpg'}

        parameters, data = enhancer.process(parameters=parameters)

        # check extracted content type
        self.assertEqual(data['content_type_ss'], 'image/jpeg')

        # check OCR
        self.assertTrue('TestOCRImage2Content1' in data['content_txt'])
        self.assertTrue('TestOCRImage2Content2' in data['content_txt'])

    def test_disabled_ocr_png(self):

        enhancer = enhance_extract_text_tika_server.enhance_extract_text_tika_server()

        parameters = {'ocr': False,
                      'filename': os.path.dirname(os.path.realpath(__file__)) + '/testdata/Test_OCR_Image1.png'}

        parameters, data = enhancer.process(parameters=parameters)

        # check extracted content type
        self.assertEqual(data['content_type_ss'], 'image/png')

        # check disabled OCR
        self.assertFalse('TestOCRImage1Content1' in data['content_txt'])
        self.assertFalse('TestOCRImage1Content2' in data['content_txt'])

        # check if Fake tesseract wrapper returned status
        self.assertTrue('[Image (no OCR yet)]' in data['content_txt'])
       

if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_mapping_id.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

import enhance_mapping_id

class Test_enhance_mapping_id(unittest.TestCase):

    def test(self):

        enhancer = enhance_mapping_id.enhance_mapping_id()

        mappings = {
                       "/": "file:///",
                       "/testdir1/": "file:///deep1testdir1/",
                       "/testdir1/testdir2/": "file:///deep2testdir1/deep2testdir2/",
        }
        
        docid = '/test'
        parameters, data = enhancer.process(parameters={'id': docid, 'mappings': mappings})
        self.assertEqual(parameters['id'], 'file:///test')

        docid = '/testdir1/test'
        parameters, data = enhancer.process(parameters={'id': docid, 'mappings': mappings})
        self.assertEqual(parameters['id'], 'file:///deep1testdir1/test')

        docid = '/testdir1/testdir2/test'
        parameters, data = enhancer.process(parameters={'id': docid, 'mappings': mappings})
        self.assertEqual(parameters['id'], 'file:///deep2testdir1/deep2testdir2/test')


    def test_reverse(self):

        mappings = {
                       "/": "file:///",
                       "/testdir1/": "file:///deep1testdir1/",
                       "/testdir1/testdir2/": "file:///deep2testdir1/deep2testdir2/",
        }
        
        docid = 'file:///test'
        reversed_value = enhance_mapping_id.mapping_reverse (docid, mappings)
        self.assertEqual(reversed_value, '/test')

        docid = 'file:///deep1testdir1/test'
        reversed_value = enhance_mapping_id.mapping_reverse (docid, mappings)
        self.assertEqual(reversed_value, '/testdir1/test')

        docid = 'file:///deep2testdir1/deep2testdir2/test'
        reversed_value = enhance_mapping_id.mapping_reverse (docid, mappings)
        self.assertEqual(reversed_value, '/testdir1/testdir2/test')

       
if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_ner_spacy.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

import enhance_ner_spacy

config = {
    'spacy_ner_classifiers': {
        'de': 'de_core_news_sm',
        'en': 'en_core_web_md'
    }
}

class Test_enhance_ner_spacy(unittest.TestCase):

    def test_en(self):

        enhancer = enhance_ner_spacy.enhance_ner_spacy()

        parameters = config.copy()
        data = {
            'language_s': 'en',
            'content_txt': "Some years ago, Mr. Barack Obama, a member of Democratic Party, was president of the USA."
        }

        parameters, data = enhancer.process(parameters=parameters, data=data)

        self.assertTrue('Barack Obama' in data['person_ss'])
        self.assertTrue('Democratic Party' in data['organization_ss'])
        self.assertTrue('USA' in data['location_ss'])


    def test_de(self):

        enhancer = enhance_ner_spacy.enhance_ner_spacy()

        parameters = config.copy()
        data = {
            'language_s': 'de',
            'content_txt': "Der Text ist über Frau Dr. Angela Merkel. Sie ist Mitglied in der CDU. Sie lebt in Deutschland."
        }

        parameters, data = enhancer.process(parameters=parameters, data=data)

        self.assertTrue('Angela Merkel' in data['person_ss'])
        self.assertTrue('CDU' in data['organization_ss'])
        self.assertTrue('Deutschland' in data['location_ss'])

       
if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_path.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

import enhance_path

class Test_enhance_path(unittest.TestCase):

    def test(self):

        enhancer = enhance_path.enhance_path()


        docid = '/home/user/test.pdf'
        parameters, data = enhancer.process(parameters={'id': docid})

        self.assertEqual(data['path0_s'], 'home')
        self.assertEqual(data['path1_s'], 'user')
        self.assertEqual(data['path_basename_s'], 'test.pdf')
        self.assertEqual(data['filename_extension_s'], 'pdf')


        docid = '/home/user/test_without_filename_extension'
        parameters, data = enhancer.process(parameters={'id': docid})

        self.assertFalse('filename_extension_s' in data)


        docid = '/home/user/test.PDF'
        parameters, data = enhancer.process(parameters={'id': docid})

        self.assertEqual(data['filename_extension_s'], 'pdf')


if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_pdf_ocr.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest
import os

import enhance_pdf_ocr

class Test_enhance_pdf_ocr(unittest.TestCase):

    # check OCR of embedded images in PDF
    def test_pdf_ocr(self):
        
        enhancer = enhance_pdf_ocr.enhance_pdf_ocr()

        parameters = {'ocr_pdf_tika': False, 'filename': os.path.dirname(os.path.realpath(__file__)) + '/testdata/test.pdf', 'ocr_cache': '/var/cache/tesseract', 'content_type_ss': 'application/pdf', 'plugins':[]}

        parameters, data = enhancer.process(parameters=parameters)

        self.assertTrue('TestPDFOCRImage1Content1' in data['ocr_t'])
        self.assertTrue('TestPDFOCRImage1Content2' in data['ocr_t'])
        self.assertTrue('TestPDFOCRImage2Content1' in data['ocr_t'])
        self.assertTrue('TestPDFOCRImage2Content2' in data['ocr_t'])


if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_regex.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest

import enhance_regex

class Test_enhance_regex(unittest.TestCase):

    def test(self):

        enhancer = enhance_regex.enhance_regex()

        parameters = {}
        parameters['verbose'] = True
        parameters['regex_lists'] = ['/etc/opensemanticsearch/regex/iban.tsv']

        data = {}
        data['content_txt'] = "An IBAN DE75512108001245126199 from Germany and GB33BUKB20201555555555 from GB and not 75512108001245126199"

        parameters, data = enhancer.process(data=data, parameters=parameters)

        self.assertTrue('DE75512108001245126199' in data['iban_ss'])
        self.assertTrue('GB33BUKB20201555555555' in data['iban_ss'])

        self.assertFalse('75512108001245126199' in data['iban_ss'])
       
       
if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_enhance_warc.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest
import os

from etl_file import Connector_File
from etl_delete import Delete
from export_solr import export_solr

class Test_enhance_warc(unittest.TestCase):

    @unittest.expectedFailure # Test fails on deleting in Solr index until next release of pysolr (https://github.com/opensemanticsearch/open-semantic-etl/issues/154)
    def test_warc(self):

        etl_file = Connector_File()
        exporter = export_solr()

        filename = os.path.dirname(os.path.realpath(__file__)) + '/testdata/example.warc'

        # run ETL of example.warc with configured plugins and warc extractor
        parameters, data = etl_file.index_file(filename = filename)

        contained_doc_id = 'http://example.com/<urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>'
        fields = ['id', 'title_txt', 'content_type_ss', 'content_txt']

        data = exporter.get_data(contained_doc_id, fields)

        # delete from search index
        etl_delete = Delete()
        etl_delete.delete(filename)
        etl_delete.delete(contained_doc_id)

        self.assertEqual(data['title_txt'], ['Example Domain'])

        self.assertEqual(data['content_type_ss'], ['text/html; charset=UTF-8'])

        self.assertTrue('This domain is established to be used for illustrative examples in documents.' in data['content_txt'][0])

if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_etl_file.py
================================================
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import unittest
import os

from etl_file import Connector_File
from etl_delete import Delete

class Test_ETL_file(unittest.TestCase):

    def test_pdf_and_ocr_by_tika(self):

        etl_file = Connector_File()
        filename = os.path.dirname(os.path.realpath(__file__)) + '/testdata/test.pdf'

        # run ETL of test.pdf with configured plugins and PDF OCR (result of etl_file.py)
        parameters, data = etl_file.index_file(filename = filename, additional_plugins=['enhance_pdf_ocr'])

        # delete from search index
        etl_delete = Delete()
        etl_delete.delete(filename)

        # check extracted content type
        self.assertTrue(data['content_type_ss'] == 'application/pdf' or sorted(data['content_type_ss']) == ['application/pdf', 'image/jpeg', 'image/png'])

        # check content type group which is mapped to this content type (result of plugin enhance_contenttype_group.py)
        self.assertTrue(data['content_type_group_ss'] == ['Text document'] or sorted(data['content_type_group_ss']) == ['Image', 'Text document'])

        # check extracted title (result of plugin enhance_extract_text_tika_server.py)
        self.assertEqual(data['title_txt'], 'TestPDFtitle')

        # check extracted content of PDF text (result of plugin enhance_extract_text_tika_server.py)
        self.assertTrue('TestPDFContent1 on TestPDFPage1' in data['content_txt'])
        self.assertTrue('TestPDFContent2 on TestPDFPage2' in data['content_txt'])

        # check OCR of embedded images in PDF (result of plugin enhance_pdf_ocr.py)
        self.assertTrue('TestPDFOCRImage1Content1' in data['content_txt'])
        self.assertTrue('TestPDFOCRImage1Content2' in data['content_txt'])
        self.assertTrue('TestPDFOCRImage2Content1' in data['content_txt'])
        self.assertTrue('TestPDFOCRImage2Content2' in data['content_txt'])

        # OCR done by Tika so in field content_txt, not in OCR plugin field ocr_t
        self.assertFalse('ocr_t' in data)

        # OCR text copied to default search field by plugin enhance_multilingual?
        default_search_field_data = ' '.join(data['_text_'])
        self.assertTrue('TestPDFOCRImage1Content1' in default_search_field_data)
        self.assertTrue('TestPDFOCRImage1Content2' in default_search_field_data)
        self.assertTrue('TestPDFOCRImage2Content1' in default_search_field_data)
        self.assertTrue('TestPDFOCRImage2Content2' in default_search_field_data)

        # check if a Open Semantic ETL plugin threw an exception
        self.assertEqual(data['etl_error_plugins_ss'], [])


    def test_ocr_by_plugin_enhance_pdf_ocr(self):

        etl_file = Connector_File()
        filename = os.path.dirname(os.path.realpath(__file__)) + '/testdata/test.pdf'

        etl_file.config['ocr_pdf_tika'] = False

        # run ETL of test.pdf with configured plugins and PDF OCR (result of etl_file.py)
        parameters, data = etl_file.index_file(filename = filename, additional_plugins=['enhance_pdf_ocr'])

        # delete from search index
        etl_delete = Delete()
        etl_delete.delete(filename)

        # check OCR of embedded images in PDF (result of plugin enhance_pdf_ocr.py)
        self.assertTrue('TestPDFOCRImage1Content1' in data['ocr_t'])
        self.assertTrue('TestPDFOCRImage1Content2' in data['ocr_t'])
        self.assertTrue('TestPDFOCRImage2Content1' in data['ocr_t'])
        self.assertTrue('TestPDFOCRImage2Content2' in data['ocr_t'])

        # OCR text copied to default search field?
        default_search_field_data = ' '.join(data['_text_'])
        self.assertTrue('TestPDFOCRImage1Content1' in default_search_field_data)
        self.assertTrue('TestPDFOCRImage1Content2' in default_search_field_data)
        self.assertTrue('TestPDFOCRImage2Content1' in default_search_field_data)
        self.assertTrue('TestPDFOCRImage2Content2' in default_search_field_data)

        # check if a Open Semantic ETL plugin threw an exception
        self.assertEqual(data['etl_error_plugins_ss'], [])

if __name__ == '__main__':
    unittest.main()


================================================
FILE: src/opensemanticetl/test_move_indexed_files.py
================================================
import unittest
from unittest import mock

import json
import itertools

import move_indexed_file


class TestMove(unittest.TestCase):
    def test_move_files(self):
        mock_get_files = mock.Mock(return_value=[
            {'id': 'file:///book1', 'title_t': 'Snow Crash',
             'copies_i': 5, 'cat_ss': ['Science Fiction'],
             'path_basename_s': 'book1', '_version_': 1641756143516647424},
            {'id': 'file:///folder/book2',
             'title_t': 'Other book',
             'copies_i': 3, 'cat_ss': ['Round House Kicks'],
             'path0_s': 'folder',
             'path_basename_s': 'book2', '_version_': 1641756143518744576}])
        mock_post = mock.Mock()

        def mock_prepare(adds, delete_ids):
            self.assertEqual(
                list(adds),
                [
                    {'id': 'file:///snow_crash', 'title_t': 'Snow Crash',
                     'copies_i': 5, 'cat_ss': ['Science Fiction'],
                     'path_basename_s': 'snow_crash'},
                    {'id': 'file:///other_book',
                     'title_t': 'Other book',
                     'copies_i': 3, 'cat_ss': ['Round House Kicks'],
                     'path_basename_s': 'other_book'}])
            self.assertEqual(tuple(delete_ids),
                             ("file:///book1", "file:///folder/book2"))
        with mock.patch("move_indexed_file.get_files", mock_get_files), \
                mock.patch("move_indexed_file.post", mock_post), \
                mock.patch("move_indexed_file.prepare_payload",
                           mock_prepare):
            move_indexed_file.move_files(
                None, {"/book1": "/snow_crash",
                       "/folder/book2": "/other_book"},
                prefix="file://")

    def test_move_dir(self):
        mock_get_files = mock.Mock(return_value=[
            {'id': 'file:///folder/book1', 'title_t': 'Snow Crash',
             'copies_i': 5, 'cat_ss': ['Science Fiction'],
             'path0_s': 'folder',
             'path_basename_s': 'book1', '_version_': 1641756143516647424},
            {'id': 'file:///folder/book2',
             'title_t': 'Other book',
             'copies_i': 3, 'cat_ss': ['Round House Kicks'],
             'path0_s': 'folder',
             'path_basename_s': 'book2', '_version_': 1641756143518744576}])
        mock_post = mock.Mock()

        def mock_prepare(adds, delete_ids):
            self.assertEqual(
                list(adds),
                [
                    {'id': 'file:///dest/book1', 'title_t': 'Snow Crash',
                     'copies_i': 5, 'cat_ss': ['Science Fiction'],
                     'path0_s': 'dest',
                     'path_basename_s': 'book1'},
                    {'id': 'file:///dest/book2',
                     'title_t': 'Other book',
                     'copies_i': 3, 'cat_ss': ['Round House Kicks'],
                     'path0_s': 'dest',
                     'path_basename_s': 'book2'}])
            self.assertEqual(tuple(delete_ids),
                             ("file:///folder/book1", "file:///folder/book2"))
        with mock.patch("move_indexed_file.get_files_in_dir",
                        mock_get_files), \
                mock.patch("move_indexed_file.post", mock_post), \
                mock.patch("move_indexed_file.prepare_payload",
                           mock_prepare):
            move_indexed_file.move_dir(
                None, src="/folder", dest="/dest/",
                prefix="file://")

    def test_get_pages(self):
        step = 3

        original_docs = [{'id': i} for i in range(10)]

        def mock_urlopen():
            docs_iter = iter(original_docs)
            responses = (
                mock_response(
                    {
                        "response": {
                            "numFound": len(original_docs),
                            "docs": list(itertools.islice(docs_iter, step))
                        }
                    }
                )
                for __ in range(0, len(original_docs), step))

            def _mock(*_, **__):
                return next(responses)
            return _mock
        with mock.patch("urllib.request.urlopen",
                        mock_urlopen()):
            docs = sum(move_indexed_file.get_pages("", "", limit=step), [])
        self.assertEqual(docs, original_docs)


def mock_response(data):
    return type("MockResponse", (object,), {
        "read": json.dumps(data).encode})


================================================
FILE: src/opensemanticetl/testdata/README.md
================================================
Automated tests by unittest
===========================

Automated tests are implemented using the Python library unittest:

https://docs.python.org/3/library/unittest.html

The code for the unit tests is located in the directory "src/opensemanticetl" in files with the prefix "test_".

Some files with testdata like test documents (see section "Testdata") are located in the subdirectory "test".


Run all tests
=============

Within the directory "src/opensemanticetl" call

python3 -m unittest

to run all available tests for all modules and plugins.


Run tests for a single plugin
=============================

You can run only the tests for a single plugin you currently work on:

For example to test only the Tika plugin for text extraction ("enhance_extract_text_tika_server.py"), call

python3 -m unittest test_enhance_extract_text_tika_server


CI/CD
=====

The script run_tests.sh is called for automated tests within a Docker container configured by docker-compose.test.yml in the root directory of this Git repository.


Testdata
========

Test documents located in subdirectory "test":


test.pdf
--------

A test PDF with two pages with text content to test content extraction and embedded images to test OCR.


Test_OCR_Image1.png
-------------------

PNG image with content "TestOCRImage1Content1" and "TestOCRImage1Content2" to test OCR.


Test_OCR_Image2.jpg
-------------------

JPEG image with content "TestOCRImage2Content1" and "TestOCRImage2Content2" to test OCR.


================================================
FILE: src/opensemanticetl/testdata/run_integrationtests.sh
================================================
#!/bin/sh

python3 -m unittest discover -s /usr/lib/python3/dist-packages/entity_linking/

python3 -m unittest discover -s /usr/lib/python3/dist-packages/opensemanticetl/


================================================
FILE: src/opensemanticetl/testdata/run_tests.sh
================================================
#!/bin/sh

cd /usr/lib/python3/dist-packages/opensemanticetl

python3 -m unittest \
 test_enhance_extract_email \
 test_enhance_mapping_id \
 test_enhance_path