Full Code of MantisAI/nervaluate for AI

main cde2d1b392a4 cached
25 files
192.7 KB
48.3k tokens
157 symbols
1 requests
Download .txt
Showing preview only (202K chars total). Download the full file or copy to clipboard to get everything.
Repository: MantisAI/nervaluate
Branch: main
Commit: cde2d1b392a4
Files: 25
Total size: 192.7 KB

Directory structure:
gitextract_gjg9je1z/

├── .gitchangelog.rc
├── .github/
│   ├── pull_request_template.md
│   └── workflows/
│       └── CI-checks.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CHANGELOG.rst
├── CITATION.cff
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── examples/
│   ├── example_no_loader.py
│   └── run_example.sh
├── pyproject.toml
├── src/
│   └── nervaluate/
│       ├── __init__.py
│       ├── entities.py
│       ├── evaluator.py
│       ├── loaders.py
│       ├── strategies.py
│       └── utils.py
└── tests/
    ├── __init__.py
    ├── test_entities.py
    ├── test_evaluator.py
    ├── test_loaders.py
    ├── test_strategies.py
    └── test_utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitchangelog.rc
================================================
# -*- coding: utf-8; mode: python -*-
##
## Format
##
##   ACTION: [AUDIENCE:] COMMIT_MSG [!TAG ...]
##
## Description
##
##   ACTION is one of 'chg', 'fix', 'new'
##
##       Is WHAT the change is about.
##
##       'chg' is for refactor, small improvement, cosmetic changes...
##       'fix' is for bug fixes
##       'new' is for new features, big improvement
##
##   AUDIENCE is optional and one of 'dev', 'usr', 'pkg', 'test', 'doc'
##
##       Is WHO is concerned by the change.
##
##       'dev'  is for developpers (API changes, refactors...)
##       'usr'  is for final users (UI changes)
##       'pkg'  is for packagers   (packaging changes)
##       'test' is for testers     (test only related changes)
##       'doc'  is for doc guys    (doc only changes)
##
##   COMMIT_MSG is ... well ... the commit message itself.
##
##   TAGs are additionnal adjective as 'refactor' 'minor' 'cosmetic'
##
##       They are preceded with a '!' or a '@' (prefer the former, as the
##       latter is wrongly interpreted in github.) Commonly used tags are:
##
##       'refactor' is obviously for refactoring code only
##       'minor' is for a very meaningless change (a typo, adding a comment)
##       'cosmetic' is for cosmetic driven change (re-indentation, 80-col...)
##       'wip' is for partial functionality but complete subfunctionality.
##
## Example:
##
##   new: usr: support of bazaar implemented
##   chg: re-indentend some lines !cosmetic
##   new: dev: updated code to be compatible with last version of killer lib.
##   fix: pkg: updated year of licence coverage.
##   new: test: added a bunch of test around user usability of feature X.
##   fix: typo in spelling my name in comment. !minor
##
##   Please note that multi-line commit message are supported, and only the
##   first line will be considered as the "summary" of the commit message. So
##   tags, and other rules only applies to the summary.  The body of the commit
##   message will be displayed in the changelog without reformatting.


##
## ``ignore_regexps`` is a line of regexps
##
## Any commit having its full commit message matching any regexp listed here
## will be ignored and won't be reported in the changelog.
##
ignore_regexps = [
    r'@minor', r'!minor',
    r'@cosmetic', r'!cosmetic',
    r'@refactor', r'!refactor',
    r'@wip', r'!wip',
    r'^([cC]hg|[fF]ix|[nN]ew)\s*:\s*[p|P]kg:',
    r'^([cC]hg|[fF]ix|[nN]ew)\s*:\s*[d|D]ev:',
    r'^(.{3,3}\s*:)?\s*[fF]irst commit.?\s*$',
    r'^$',  ## ignore commits with empty messages
]


## ``section_regexps`` is a list of 2-tuples associating a string label and a
## list of regexp
##
## Commit messages will be classified in sections thanks to this. Section
## titles are the label, and a commit is classified under this section if any
## of the regexps associated is matching.
##
## Please note that ``section_regexps`` will only classify commits and won't
## make any changes to the contents. So you'll probably want to go check
## ``subject_process`` (or ``body_process``) to do some changes to the subject,
## whenever you are tweaking this variable.
##
section_regexps = [
    ('New', [
        r'^[nN]ew\s*:\s*((dev|use?r|pkg|test|doc)\s*:\s*)?([^\n]*)$',
     ]),
    ('Changes', [
        r'^[cC]hg\s*:\s*((dev|use?r|pkg|test|doc)\s*:\s*)?([^\n]*)$',
     ]),
    ('Fix', [
        r'^[fF]ix\s*:\s*((dev|use?r|pkg|test|doc)\s*:\s*)?([^\n]*)$',
     ]),

    ('Other', None ## Match all lines
     ),

]


## ``body_process`` is a callable
##
## This callable will be given the original body and result will
## be used in the changelog.
##
## Available constructs are:
##
##   - any python callable that take one txt argument and return txt argument.
##
##   - ReSub(pattern, replacement): will apply regexp substitution.
##
##   - Indent(chars="  "): will indent the text with the prefix
##     Please remember that template engines gets also to modify the text and
##     will usually indent themselves the text if needed.
##
##   - Wrap(regexp=r"\n\n"): re-wrap text in separate paragraph to fill 80-Columns
##
##   - noop: do nothing
##
##   - ucfirst: ensure the first letter is uppercase.
##     (usually used in the ``subject_process`` pipeline)
##
##   - final_dot: ensure text finishes with a dot
##     (usually used in the ``subject_process`` pipeline)
##
##   - strip: remove any spaces before or after the content of the string
##
##   - SetIfEmpty(msg="No commit message."): will set the text to
##     whatever given ``msg`` if the current text is empty.
##
## Additionally, you can `pipe` the provided filters, for instance:
#body_process = Wrap(regexp=r'\n(?=\w+\s*:)') | Indent(chars="  ")
#body_process = Wrap(regexp=r'\n(?=\w+\s*:)')
#body_process = noop
body_process = ReSub(r'((^|\n)[A-Z]\w+(-\w+)*: .*(\n\s+.*)*)+$', r'') | strip


## ``subject_process`` is a callable
##
## This callable will be given the original subject and result will
## be used in the changelog.
##
## Available constructs are those listed in ``body_process`` doc.
subject_process = (strip |
    ReSub(r'^([cC]hg|[fF]ix|[nN]ew)\s*:\s*((dev|use?r|pkg|test|doc)\s*:\s*)?([^\n@]*)(@[a-z]+\s+)*$', r'\4') |
    SetIfEmpty("No commit message.") | ucfirst | final_dot)


## ``tag_filter_regexp`` is a regexp
##
## Tags that will be used for the changelog must match this regexp.
##
tag_filter_regexp = r'^[0-9]+\.[0-9]+(\.[0-9]+)?$'


## ``unreleased_version_label`` is a string or a callable that outputs a string
##
## This label will be used as the changelog Title of the last set of changes
## between last valid tag and HEAD if any.
unreleased_version_label = "(unreleased)"


## ``output_engine`` is a callable
##
## This will change the output format of the generated changelog file
##
## Available choices are:
##
##   - rest_py
##
##        Legacy pure python engine, outputs ReSTructured text.
##        This is the default.
##
##   - mustache(<template_name>)
##
##        Template name could be any of the available templates in
##        ``templates/mustache/*.tpl``.
##        Requires python package ``pystache``.
##        Examples:
##           - mustache("markdown")
##           - mustache("restructuredtext")
##
##   - makotemplate(<template_name>)
##
##        Template name could be any of the available templates in
##        ``templates/mako/*.tpl``.
##        Requires python package ``mako``.
##        Examples:
##           - makotemplate("restructuredtext")
##
output_engine = rest_py
#output_engine = mustache("restructuredtext")
#output_engine = mustache("markdown")
#output_engine = makotemplate("restructuredtext")


## ``include_merge`` is a boolean
##
## This option tells git-log whether to include merge commits in the log.
## The default is to include them.
include_merge = True


## ``log_encoding`` is a string identifier
##
## This option tells gitchangelog what encoding is outputed by ``git log``.
## The default is to be clever about it: it checks ``git config`` for
## ``i18n.logOutputEncoding``, and if not found will default to git's own
## default: ``utf-8``.
#log_encoding = 'utf-8'


## ``publish`` is a callable
##
## Sets what ``gitchangelog`` should do with the output generated by
## the output engine. ``publish`` is a callable taking one argument
## that is an interator on lines from the output engine.
##
## Some helper callable are provided:
##
## Available choices are:
##
##   - stdout
##
##        Outputs directly to standard output
##        (This is the default)
##
##   - FileInsertAtFirstRegexMatch(file, pattern, idx=lamda m: m.start())
##
##        Creates a callable that will parse given file for the given
##        regex pattern and will insert the output in the file.
##        ``idx`` is a callable that receive the matching object and
##        must return a integer index point where to insert the
##        the output in the file. Default is to return the position of
##        the start of the matched string.
##
##   - FileRegexSubst(file, pattern, replace, flags)
##
##        Apply a replace inplace in the given file. Your regex pattern must
##        take care of everything and might be more complex. Check the README
##        for a complete copy-pastable example.
##
# publish = FileInsertIntoFirstRegexMatch(
#     "CHANGELOG.rst",
#     r'/(?P<rev>[0-9]+\.[0-9]+(\.[0-9]+)?)\s+\([0-9]+-[0-9]{2}-[0-9]{2}\)\n--+\n/',
#     idx=lambda m: m.start(1)
# )
#publish = stdout


## ``revs`` is a list of callable or a list of string
##
## callable will be called to resolve as strings and allow dynamical
## computation of these. The result will be used as revisions for
## gitchangelog (as if directly stated on the command line). This allows
## to filter exaclty which commits will be read by gitchangelog.
##
## To get a full documentation on the format of these strings, please
## refer to the ``git rev-list`` arguments. There are many examples.
##
## Using callables is especially useful, for instance, if you
## are using gitchangelog to generate incrementally your changelog.
##
## Some helpers are provided, you can use them::
##
##   - FileFirstRegexMatch(file, pattern): will return a callable that will
##     return the first string match for the given pattern in the given file.
##     If you use named sub-patterns in your regex pattern, it'll output only
##     the string matching the regex pattern named "rev".
##
##   - Caret(rev): will return the rev prefixed by a "^", which is a
##     way to remove the given revision and all its ancestor.
##
## Please note that if you provide a rev-list on the command line, it'll
## replace this value (which will then be ignored).
##
## If empty, then ``gitchangelog`` will act as it had to generate a full
## changelog.
##
## The default is to use all commits to make the changelog.
#revs = ["^1.0.3", ]
#revs = [
#    Caret(
#        FileFirstRegexMatch(
#            "CHANGELOG.rst",
#            r"(?P<rev>[0-9]+\.[0-9]+(\.[0-9]+)?)\s+\([0-9]+-[0-9]{2}-[0-9]{2}\)\n--+\n")),
#    "HEAD"
#]
revs = []

include_merge = False


================================================
FILE: .github/pull_request_template.md
================================================
### Related Issues

- fixes #issue-number

### Proposed Changes:

 <!--- In case of a bug: Describe what caused the issue and how you solved it -->
 <!--- In case of a feature: Describe what did you add and how it works -->

### How did you test it?

<!-- unit tests, integration tests, manual verification, instructions for manual tests -->

### Notes for the reviewer

<!-- E.g. point out section where the reviewer  -->

### Checklist

- I have read the [contributors guidelines](https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md) and the [code of conduct](https://github.com/deepset-ai/haystack/blob/main/code_of_conduct.txt)
- I have updated the related issue with new insights and changes
- I added unit tests and updated the docstrings
- I've used one of the [conventional commit types](https://www.conventionalcommits.org/en/v1.0.0/) for my PR title: `fix:`, `feat:`, `build:`, `chore:`, `ci:`, `docs:`, `style:`, `refactor:`, `perf:`, `test:` and added `!` in case the PR includes breaking changes.
- I documented my code
- I ran [pre-commit hooks](https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#installation) and fixed any issue


================================================
FILE: .github/workflows/CI-checks.yml
================================================
name: Linting, Type Checking, and Testing

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, windows-latest, macos-latest]
        python-version: ["3.11"]

    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}
    
    - name: Install Hatch
      run: |
        python -m pip install --upgrade pip
        pip install hatch
    
    - name: Running linters
      run: |
        hatch -e dev run lint

    - name: Type checking with mypy
      run: |
        hatch -e dev run typing

    - name: Running tests
      run: |
        hatch -e dev run test

================================================
FILE: .gitignore
================================================
**.coverage
**.ipynb_checkpoints/
**.mypy_cache/
**/.python-version
**__pycache__/
.tox/
.venv/
build/
coverage.xml
dist/
nervaluate.egg-info/
**/.DS_Store
.idea


================================================
FILE: .pre-commit-config.yaml
================================================
repos:

- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v4.1.0
  hooks:
  - id: check-yaml

- repo: https://github.com/psf/black
  rev: 22.3.0
  hooks:
  - id: black
    args: [-t, py38, -l 120]

- repo: local
  hooks:
  - id: pylint
    name: pylint
    entry: pylint
    language: system
    types: [ python ]
    args: [--rcfile=pylint.cfg]

- repo: local
  hooks:
  - id: flake8
    name: flake8
    entry: flake8
    language: system
    types: [ python ]
    args: [--config=setup.cfg]

- repo: local
  hooks:
    - id: mypy
      name: mypy
      entry: mypy
      language: python
      language_version: python3.8
      types: [python]
      exclude: examples|tests
      require_serial: true  # use require_serial so that script is only called once per commit
      verbose: true  # print the number of files as a sanity-check
      args: [--config, setup.cfg]

================================================
FILE: CHANGELOG.rst
================================================
Changelog
=========


(unreleased)
------------
- Adding tests + updating README.md. [David S. Batista]
- Fix partial and ent_type precision/recall when merging multi-document
  results. [David S. Batista]

  _merge_results() was calling compute_metrics() with no arguments after
  merging counts, so partial_or_type defaulted to False and strict
  formula (COR/ACT, COR/POS) was used for all strategies. That overwrote
  the correct partial/ent_type P/R (COR+0.5*PAR)/ACT and (COR+0.5*PAR)/POS.

  Now pass strategy_name into _merge_results and call
  compute_metrics(partial_or_type=True) for 'partial' and 'ent_type'
  so merged results keep the SemEval partial-match formula.

  Fixes the bug where partial (and ent_type) reported same P/R as strict
  (e.g. README example showed 0.40 instead of 0.70 for partial).


1.2.0 (2026-03-09)
------------------
- 1.2.0 release. [David S. Batista]
- Updating CHANGELOG. [David S. Batista]
- Bumping version to 1.2.0. [David S. Batista]
- Adding more tests. [David S. Batista]
- Explaining new behaviour docstring + README.md. [David S. Batista]
- Fixing typo. [David S. Batista]
- Fixing typo + linting error. [David S. Batista]
- Refactor: better naming. [DinizNicolas]
- Docs: update docstring. [DinizNicolas]
- Docs: docstring update. [DinizNicolas]
- Style: lint. [DinizNicolas]
- Refactor: change ugly ifs. [DinizNicolas]
- Feat: nested entities support. [DinizNicolas]

  Change partial evaluation to resolve nested entities edge cases.
- Test: nested entities support. [DinizNicolas]

  Add tests for nested entities partial evaluation
- Feat: nested entities support. [DinizNicolas]

  Change exact evaluation to resolve nested entities edge cases.

  Same changes as strict evaluation
- Test: nested entities support. [DinizNicolas]

  Add tests for nested entities exact evaluation
- Feat: nested entities support. [DinizNicolas]

  Change entity type evaluation to resolve nested entities edge cases.

  Before : when a sufficient overlap is found between to entities of the same label, it was counted as correct.

  Now : we search for the best match to count a correct entity. The best match being the one with minimum gap between predicted and true entities boundaries.
- Add tests for nested entities entity type evaluation. [diniznicol]
- Feat: nested entities support Change strict evaluation to resolve
  nested entities edge cases. Before: without a perfect match beetwen a
  true/pred pair, if an overlap was found, it was directly counted as
  incorrect. Now : Only when no perfect match is found in every true
  entity, the first overlapping pred entity found is counted as
  incorrect. [DinizNicolas]
- Test: nested entities support test correction. [DinizNicolas]
- Test: nested entities support. [DinizNicolas]

  Add tests for nested entities strict evaluation


1.1.0 (2025-09-06)
------------------
- 1.1.0 release. [David S. Batista]
- 1.1.0 release. [David S. Batista]
- Testing for single character entities. [David S. Batista]
- Fixing linting issues. [David S. Batista]
- Fixing linting issues. [David S. Batista]
- Defining a min ground truth percentage to be considered an overlap.
  [David S. Batista]
- Chore: removing script to compare old and new version outputs. [David
  S. Batista]


1.0.0 (2025-08-18)
------------------
- 1.0.0 release. [David S. Batista]
- Bumping version. [David S. Batista]
- Removing pandas dependency. [David S. Batista]
- Relaxing tests invalid mode and scenario. [David S. Batista]
- Saving to CSV file or return CSV string. [David S. Batista]
- Adds tests for result indices in all strategies. [Jack Boylan]
- Adds indices tests for `ent_type` strategy. [Jack Boylan]
- Linting import statment. [David S. Batista]
- Wip: using hatch in contributing. [David S. Batista]
- Updating CITATION and removing flake. [David S. Batista]
- Renaming evaluation_strategies to strategies and improving README.
  [David S. Batista]
- Removing old files. [David S. Batista]
- Removing old files. [David S. Batista]
- Updating README.MD. [David S. Batista]
- One more use case. [David S. Batista]
- Comparative indices report overall. [David S. Batista]
- Wip: fixing report indices. [David S. Batista]
- Wip. [David S. Batista]
- Wip: fixing report for entities. [David S. Batista]
- Adding function to generate synthetic data. [David S. Batista]
- Wip: fixing report for entities. [David S. Batista]
- Wip: fixing report for entities. [David S. Batista]
- Only showing entities report for entities that actually apper on
  either true or pred data. [David S. Batista]
- Wip: checking summary with aggregated entities and a specific
  scenario. [David S. Batista]
- Wip: checking summary with aggregated entities and a specific
  scenario. [David S. Batista]
- Updating evaluation strategies tests. [David S. Batista]
- Correcting and fixing type strategy. [David S. Batista]
- Correcting and fixing strict strategy. [David S. Batista]
- Correcting and fixing partial strategy. [David S. Batista]
- Adding partial to evaluation strategies. [David S. Batista]
- Fixing docs lenghts tests. [David S. Batista]
- Fixing docs lenghts tests. [David S. Batista]
- Working on comparative example. [David S. Batista]
- Fixes. [David S. Batista]
- Fxing empty entities. [David S. Batista]
- Fixing imports. [David S. Batista]
- Moving reporting to the Evaluator class. [David S. Batista]
- Working on new versions of summary reports. [David S. Batista]
- Cleaning up README.MD. [David S. Batista]
- Adding missed pyproject.toml. [David S. Batista]
- Type checking. [David S. Batista]
- Fixing all tests. [David S. Batista]
- Adding refactored code. [David S. Batista]
- Separating new and old evaluator logic. [David S. Batista]
- Fixing loaders. [David S. Batista]
- Fixing loading test_conll_loader. [David S. Batista]
- Fixing loading test_dict_loader. [David S. Batista]
- Fixing loading test_list_loader. [David S. Batista]
- Adding tests. [David S. Batista]


0.3.1 (2025-06-05)
------------------
- Fixing pandas dependency. [David S. Batista]
- Fixing pandas dependency. [David S. Batista]


0.3.0 (2025-06-05)
------------------

Changes
~~~~~~~
- Update changelog for 0.2.0 release. [Matthew Upson]

Fix
~~~
- Mypy configuration error. [angelo-digian]
- Typo in type annotation. [angelo-digian]
- Switched order of imports. [angelo-digian]

Other
~~~~~
- 0.3.0 release. [David S. Batista]
- Adding deprecation warnings. [David S. Batista]
- Create pull_request_template.md. [David S. Batista]
- Upgrading dev tools versions. [David S. Batista]
- Initial import. [David S. Batista]
- Adding scenario type for summary report. [David S. Batista]
- Update README.md. [David S. Batista]
- Updating README.MD. [David S. Batista]
- Removing unused variable. [David S. Batista]
- Update src/nervaluate/reporting.py. [Copilot, David S. Batista]
- Update src/nervaluate/reporting.py. [Copilot, David S. Batista]
- Removing Makefile. [David S. Batista]
- Drafting CONTRIBUTE.md. [David S. Batista]
- Drafting CONTRIBUTE.md. [David S. Batista]
- Removing flake8. [David S. Batista]
- Removing old config files. [David S. Batista]
- Running on ubuntu, windows and macos. [David S. Batista]
- Reverting to ubuntu only. [David S. Batista]
- Adding new file. [David S. Batista]
- Removing old workflow file. [David S. Batista]
- Adding windows and macos to CI. [David S. Batista]
- Streamlining CI checks. [David S. Batista]
- Disabling old github workflow and triggering new one. [David S.
  Batista]
- Changing github workflow. [David S. Batista]
- Fixing linting and typing issues. [David S. Batista]
- Adding pytest-cov as dependency. [David S. Batista]
- Adding hatch as project manager; linting and typing. [David S.
  Batista]
- Fixing type hints. [David S. Batista]
- Wip. [David S. Batista]
- Adding docstrings. [David S. Batista]
- Adding more tests. [David S. Batista]
- Adding more tests. [David S. Batista]
- Adding docstrings and increasing test coverage. [David S. Batista]
- Removing requirements_dev.txt. [David S. Batista]
- Blackening for py311. [David S. Batista]
- Fixing pyprojec.toml dependencies. [David S. Batista]
- Fixing pyprojec.toml dependencies. [David S. Batista]
- Fixing pyprojec.toml dependencies. [David S. Batista]
- Fixing pyprojec.toml dependencies. [David S. Batista]
- Fixing pyprojec.toml dependencies. [David S. Batista]
- Refactor: move dev dependencies to pyproject.toml and update CI
  workflow. [David S. Batista]
- Adding wrongly removed pre-commit. [David S. Batista]
- Fixing type hints. [David S. Batista]
- Removing unused imports and mutuable default arguments. [David S.
  Batista]
- Update README.md. [Tim Miller]
- Update README.md. [adgianv]
- Update README.md - change the pdf link. [adgianv]
- Added type annotations to functions. [angelo-digian]
- Pandas version downgraded to 2.0.1 because incompatible with python
  version. [angelo-digian]
- Fixed pandas version to 2.2.1. [angelo-digian]
- Add pandas as a dependency in pyproject.toml. [angelo-digian]
- Adding pandas in the requirements file. [angelo-digian]
- Update tests/test_evaluator.py. [David S. Batista]
- Modified results_to_df method and added test. [angelo-digian]
- Expanded evaluator class: added method to return results of the nested
  dictionary as a dataframe. [angelo-digian]


0.2.0 (2024-04-10)
------------------

New
~~~
- Add pre-commit. [Matthew Upson]
- Add CITATION.cff file. [Matthew Upson]
- Upload artefacts to codecov. [Matthew Upson]
- Run tests on windows instance. [Matthew Upson]

Changes
~~~~~~~
- Add codecov config. [Matthew Upson]
- Remove .travis.yml. [Matthew Upson]
- Update tox.ini. [Matthew Upson]
- Update versions to test. [Matthew Upson]
- Add tox tests as github action. [Matthew Upson]

Fix
~~~
- Grant write permission to CICD workflow. [Matthew Upson]
- Run on windows and linux matrix. [Matthew Upson]

Other
~~~~~
- Updates README to reflect new functionality. [Jack Boylan]
- Removes extra 'indices' printed. [Jack Boylan]
- Bump black from 23.3.0 to 24.3.0. [dependabot[bot]]

  Bumps [black](https://github.com/psf/black) from 23.3.0 to 24.3.0.
  - [Release notes](https://github.com/psf/black/releases)
  - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
  - [Commits](https://github.com/psf/black/compare/23.3.0...24.3.0)

  ---
  updated-dependencies:
  - dependency-name: black
    dependency-type: direct:development
  ...
- Fixed Typo in README. [Giovanni Casari]
- Reformats quotes in `test_nervaluate.py` [Jack Boylan]
- Initial import. [David S. Batista]
- Handles case when `predictions` is empty. [Jack Boylan]
- Adds unit tests for evaluation indices output. [Jack Boylan]
- Adds summary print functions for overall indices and per-entity
  indices results. [Jack Boylan]
- Adds `within_instance_index` to evaluation indices outputs. [Jack
  Boylan]
- Ensures compatibility with existing unit tests. [Jack Boylan]
- Adheres to code quality checks. [Jack Boylan]
- Adds more descriptive variable names. [Jack Boylan]
- Adds correct indices to result indices output. [Jack Boylan]
- Moves evaluation indices to separate data structures. [Jack Boylan]
- Adds index lists to output for examples with incorrect, partial,
  spurious, and missed entities. [Jack Boylan]
- Docs: fix typo "spurius" > "spurious" [DanShatford]
- Added test for issue #40. [g.casari]
- Solved issue #40. [g.casari]
- Update README.md. [David S. Batista]
- Cleaning README.MD. [David S. Batista]
- Attending PR comments. [David S. Batista]
- Fixing links on README.MD. [David S. Batista]
- Updating pyproject.toml. [David S. Batista]
- Updating pyproject.toml. [David S. Batista]
- Updating README.MD and bumping version to 0.2.0. [David S. Batista]
- Updating README.MD. [David S. Batista]
- Reverting to Python 3.8. [David S. Batista]
- Adding some badges to the README. [David S. Batista]
- Initial commit. [David S. Batista]
- Wip: adding poetry. [David S. Batista]
- Full working example. [David S. Batista]
- Nit. [David S. Batista]
- Wip: adding summary report and examples. [David S. Batista]
- Wip: adding summary report and examples. [David S. Batista]
- Wip: adding summary report and examples. [David S. Batista]
- Wip: adding summary report and examples. [David S. Batista]
- Wip: adding summary report and examples. [David S. Batista]
- Wip: adding summary report. [David S. Batista]
- Wip: adding summary report. [David S. Batista]
- Removed codecov from requirements.txt. [David S. Batista]
- Removing duplicated code and fixing type hit. [David S. Batista]
- Updated Makefile: install package in editable mode. [David S. Batista]
- Updated name. [David S. Batista]
- Minimum version Python 3.8. [David S. Batista]
- Fixing Makefile and pre-commit. [David S. Batista]
- Adding DS_Store and .idea to gitignore. [David S. Batista]
- Updating Makefile. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- WIP: pre-commit. [David S. Batista]
- Fixing types. [David S. Batista]
- Finished adding type hints, some were skipped, code needs refactoring.
  [David S. Batista]
- WIP: adding type hints. [David S. Batista]
- WIP: adding type hints. [David S. Batista]
- WIP: adding type hints. [David S. Batista]
- WIP: adding type hints. [David S. Batista]
- Adding some execptions, code needs refactoring. [David S. Batista]
- Fixing pyling and flake8 issues. [David S. Batista]
- Replaced setup.py with pyproject.toml. [David S. Batista]
- Reverting utils import. [David S. Batista]
- Fixing types and wrappint at 120 characters. [David S. Batista]
- Update CITATION.cff. [David S. Batista]

  updating orcid
- Fix recall formula readme. [fgh95]
- Update LICENSE. [ivyleavedtoadflax]
- Update LICENSE. [ivyleavedtoadflax]
- Delete .python-version. [ivyleavedtoadflax]


0.1.8 (2020-10-16)
------------------

New
~~~
- Add test for whole span length entities (see #32) [Matthew Upson]
- Summarise blog post in README. [Matthew Upson]

Changes
~~~~~~~
- Bump version in setup.py. [Matthew Upson]
- Update CHANGELOG (#36) [ivyleavedtoadflax]
- Fix tests to match #32. [Matthew Upson]

Fix
~~~
- Correct catch sequence of just one entity. [Matthew Upson]

  Incorporate edits in #28 but includes tests.

Other
~~~~~
- Add code coverage. [ivyleavedtoadflax]
- Crucial fixes for evaluation. [Alex Flückiger]
- Update utils.py. [ivyleavedtoadflax]

  Tiny change to kick off CI
- Fix to catch last entites Small change to catch entities that go up
  until last character when there is no tag. [pim]


0.1.7 (2019-12-07)
------------------

New
~~~
- Add tests. [Matthew Upson]

  * Linting
  * Rename existing tests to disambiguate
- Add loaders to nervaluate. [Matthew Upson]

  * Add list and conll formats

Changes
~~~~~~~
- Update README. [Matthew Upson]

Fix
~~~
- Issue with setup.py. [Matthew Upson]

  * Add docstring to __version__.py


0.1.6 (2019-12-07)
------------------

New
~~~
- Add gitchangelog and Makefile recipe. [Matthew Upson]

Changes
~~~~~~~
- Bump version to 0.1.6. [Matthew Upson]
- Remove examples. [Matthew Upson]

  These are not accessible from the package in any case.
- Add dev requirements. [Matthew Upson]


0.1.5 (2019-12-06)
------------------

Changes
~~~~~~~
- Bump version to 0.1.5. [Matthew Upson]
- Update setup.py. [Matthew Upson]
- Update package url to point at pypi. [Matthew Upson]


0.1.4 (2019-12-06)
------------------

New
~~~
- Add dist to .gitignore. [Matthew Upson]
- Create pypi friendly README/long description. [Matthew Upson]
- Clean entity dicts of extraneous keys. [Matthew Upson]

  * Failing to do this can cause problems in evaluations
  * Add tests

Changes
~~~~~~~
- Bump version to 0.1.4. [Matthew Upson]
- Make setup.py pypi compliant. [Matthew Upson]


0.1.2 (2019-12-04)
------------------

New
~~~
- Add missing prodigy format tests. [Matthew Upson]
- Pass argument when using list. [Matthew Upson]
- Setup module structure. [Matthew Upson]
- Add get_tags() and tests. [Matthew Upson]

  Adds function to extract all the NER tags from a list of sentences.
- Add Evaluator class. [Matthew Upson]

  * Add some logging statements
  * Add input checks on number of documents and tokens per document
  * Allow target labels to be passed as argument to compute_metrics. Note
      that if a label is predicted and it is not in this list, then it
      will be classed as spurious for the aggregated scores, and on each
      entity level result (because it is unclear where the spurious value
      should be applied, it is applied to all)
  * linting
  * Add many new tests
- Don't evaluate precision and recall for each sentence. [Matthew Upson]

  Rather than automatically calculate precision and recall at the sentence
  level, this change adds a new function compute_precision_recall_wrapper
  which can be run after all the metrics whether for 1 document, or 1000,
  have been calculated. This has the benefit that we can reuse the same
  code for calculating precision/recall, and allows us to calculate entity
  level precision/recall if required.
- Calculate entity level score. [Matthew Upson]
- Add compute_actual_possible function. [Matthew Upson]
- Record results for each entity type. [Matthew Upson]
- Add scenario comments matching blog table. [Matthew Upson]
- Test results at individual entity level. [Matthew Upson]
- Add .gitinore file. [Matthew Upson]
- Add requirements.txt. [Matthew Upson]

Changes
~~~~~~~
- Bump version to 0.1.2. [Matthew Upson]
- Bump version number to 0.1.1. [Matthew Upson]
- Reduce logging verbosity. [ivyleavedtoadflax]
- Add example to README.md. [Matthew Upson]
- Create virtualenv recipe. [Matthew Upson]

  * Move example dependencies to requirements_example.txt
  * Add virtualenv recipe to Makefile
  * Update .gitignore
- Remove unused dependencies. [Matthew Upson]

  * Dependencies for the examples should not be included in setup.py, instead
  move them to requirements_examples.txt
- Update example notebook. [Matthew Upson]
- Remove unwanted tags from pred_named_entities. [Matthew Upson]
- Remove superfluous get_tags() function. [Matthew Upson]
- Update notebook. [Matthew Upson]
- Update notebook. [Matthew Upson]
- Update tests. [Matthew Upson]
- Update .gitignore. [Matthew Upson]
- Replace spurius with spurious. [Matthew Upson]
- Update README with requirements and test info. [Matthew Upson]
- Update setup.cfg with source and omit paths. [Matthew Upson]
- Use pytest instead of unittest. [Matthew Upson]

Other
~~~~~
- Revert "Remove tox and use pytest" [Matthew Upson]

  * Better to keep tox for local testing in the Makefile and resolve
    issues running tox on the developers machine.

  This reverts commit 8578795e62ca384adf054c1b85a1c1d7f0d089d5.
- Remove tox and use pytest. [Elizabeth Gallagher]
- Add f1 output to nervaluate and update all tests. [Elizabeth
  Gallagher]
- Update .travis.yml. [ivyleavedtoadflax]
- Update README.md. [Matt Upson]
- Build(deps): bump nltk from 3.4.4 to 3.4.5. [dependabot[bot]]

  Bumps [nltk](https://github.com/nltk/nltk) from 3.4.4 to 3.4.5.
  - [Release notes](https://github.com/nltk/nltk/releases)
  - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
  - [Commits](https://github.com/nltk/nltk/compare/3.4.4...3.4.5)
- Update __version__.py. [Matt Upson]
- PEPed8 things a bit. [David Soares Batista]
- Update README.md. [David S. Batista]
- Update README.md. [David S. Batista]
- Notebook. [David Soares Batista]
- Updated notebook. [David Soares Batista]
- Update README.md. [David S. Batista]
- Update README.md. [David S. Batista]
- Renamed notebook. [David Soares Batista]
- Bug fixing. [David Soares Batista]
- Test. [David Soares Batista]
- Typo in comment. [David Soares Batista]
- Use find_overlap to find all overlap cases. [Matthew Upson]

  Adds the find_overlap function which captures the three possible overlap
  scenarios (Total, Start, and End). This is examplained in graph below.

  Character Offset:   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
  True:               |   |   |   |LOC|LOC|LOC|LOC|LOC|   |   |
  Total Overlap:      |   |   |LOC|LOC|LOC|LOC|LOC|LOC|LOC|   |
  Start Overlap:      |   |   |LOC|LOC|LOC|   |   |   |   |   |
  End Overlap:        |   |   |   |   |   |   |LOC|LOC|LOC|   |
- Removed debug stamt. [David Soares Batista]
- Added partial and exact evaluation and tests. [David Soares Batista]
- Update. [David Soares Batista]
- Updated README. [David Soares Batista]
- - fixed bugs and added tests - added pytest. [David Soares Batista]
- Update ner_evaluation.py. [David S. Batista]
- Redefined evaluation according to discussion here:
  https://github.com/davidsbatista/NER-Evaluation/issues/2. [David
  Soares Batista]
- Fixed a BUG in collect_named_entites() issued by
  rjlotok.dblma@gmail.com. [David Soares Batista]
- Update README.md. [David S. Batista]
- Update README.md. [David S. Batista]
- Major refactoring. [David Soares Batista]
- Create README.md. [David S. Batista]
- Initial import. [David Soares Batista]
- Initial commit. [David S. Batista]




================================================
FILE: CITATION.cff
================================================
cff-version: 1.2.1
message: "If you use this software, please cite it as below."
title: "nervaluate"
date-released: 2026-03-12
url: "https://github.com/mantisnlp/nervaluate"
version: 1.2.1
authors:
- family-names: "Batista"
  given-names: "David"
  orcid: "https://orcid.org/0000-0002-9324-5773"
- family-names: "Upson"
  given-names: "Matthew Antony"
  orcid: "https://orcid.org/0000-0002-1040-8048"





================================================
FILE: CONTRIBUTING.md
================================================
# Contributing to `nervaluate`

Thank you for your interest in contributing to `nervaluate`! This document provides guidelines and instructions for contributing to the project.

## Development Setup

1. Fork the repository
2. Clone your fork:
   ```bash
   git clone https://github.com/your-username/nervaluate.git
   cd nervaluate
   ```
3. Make sure you have hatch installed, then create a virtual environment:
   # ToDo

## Adding Tests

`nervaluate` uses pytest for testing. Here are the guidelines for adding tests:

1. All new features and bug fixes should include tests
2. Tests should be placed in the `tests/` directory
3. Test files should be named `test_*.py`
4. Test functions should be named `test_*`
5. Use pytest fixtures when appropriate for test setup and teardown
6. Run tests locally before submitting a pull request:
   ```bash
   hatch -e 
   ```


## Changelog Management

`nervaluate` uses gitchangelog to maintain the CHANGELOG.rst file. Here's how to use it:

1. Make your changes in a new branch
2. Write your commit messages following these conventions:
   - Use present tense ("Add feature" not "Added feature")
   - Use imperative mood ("Move cursor to..." not "Moves cursor to...")
   - Limit the first line to 72 characters or less
   - Reference issues and pull requests liberally after the first line

3. The commit message format should be:
   ```
   type(scope): subject

   body
   ```

   Where type can be:
   - feat: A new feature
   - fix: A bug fix
   - docs: Documentation changes
   - style: Changes that do not affect the meaning of the code
   - refactor: A code change that neither fixes a bug nor adds a feature
   - perf: A code change that improves performance
   - test: Adding missing tests or correcting existing tests
   - chore: Changes to the build process or auxiliary tools

4. After committing your changes, you can generate the changelog:
   ```bash
   gitchangelog > CHANGELOG.rst
   ```

## Pull Request Process

1. Update the README.md with details of changes if needed
2. Update the CHANGELOG.rst using gitchangelog
3. The PR will be merged once you have the sign-off of at least one other developer
4. Make sure all tests pass and there are no linting errors

## Code Style

- Follow PEP 8 guidelines
- Use type hints

## Questions?

Feel free to open an issue if you have any questions about contributing to `nervaluate`. 

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2020 David S. Batista and Matthew A. Upson

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
[![python](https://img.shields.io/badge/Python-3.11-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
&nbsp;
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
&nbsp;
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
&nbsp;
![GitHub](https://img.shields.io/github/license/ivyleavedtoadflax/nervaluate)
&nbsp;
![Pull Requests Welcome](https://img.shields.io/badge/pull%20requests-welcome-brightgreen.svg)
&nbsp;
![PyPI](https://img.shields.io/pypi/v/nervaluate)

# nervaluate

`nervaluate` is a module for evaluating Named Entity Recognition (NER) models as defined in the SemEval 2013 - 9.1 task.

The evaluation metrics output by nervaluate go beyond a simple token/tag based schema, and consider different scenarios 
based on whether all the tokens that belong to a named entity were classified or not, and also whether the correct 
entity type was assigned.

This full problem is described in detail in the [original blog](http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) 
post by [David Batista](https://github.com/davidsbatista), and this package extends the code in the [original repository](https://github.com/davidsbatista/NER-Evaluation) 
which accompanied the blog post.

The code draws heavily on the papers:

* [SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013)](https://www.aclweb.org/anthology/S13-2056)

* [SemEval-2013 Task 9.1 - Evaluation Metrics](https://davidsbatista.net/assets/documents/others/semeval_2013-task-9_1-evaluation-metrics.pdf)

# Usage example

```
pip install nervaluate
```

A possible input format are lists of NER labels, where each list corresponds to a sentence and each label is a token label.
Initialize the `Evaluator` class with the true labels and predicted labels, and specify the entity types we want to evaluate.

```python
from nervaluate.evaluator import Evaluator

true = [
    ['O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'B-ORG', 'I-ORG'],  # "The John Smith who works at Google Inc"
    ['O', 'B-LOC', 'B-PER', 'I-PER', 'O', 'O', 'B-DATE'],      # "In Paris Marie Curie lived in 1895"
]
  
pred = [
    ['O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG'],
    ['O', 'B-LOC', 'I-LOC', 'B-PER', 'O', 'O', 'B-DATE'],
]
   
evaluator = Evaluator(true, pred, tags=['PER', 'ORG', 'LOC', 'DATE'], loader="list")
```

Print the summary report for the evaluation, which will show the metrics for each entity type and evaluation scenario:

```python

print(evaluator.summary_report())

Scenario: all

              correct   incorrect     partial      missed    spurious   precision      recall    f1-score

ent_type            5           0           0           0           0        1.00        1.00        1.00
   exact            2           3           0           0           0        0.40        0.40        0.40
 partial            2           0           3           0           0        0.70        0.70        0.70
  strict            2           3           0           0           0        0.40        0.40        0.40
```  

or aggregated by entity type under a specific evaluation scenario:

```python
print(evaluator.summary_report(mode='entities'))  
  
Scenario: strict

             correct   incorrect     partial      missed    spurious   precision      recall    f1-score

   DATE            1           0           0           0           0        1.00        1.00        1.00
    LOC            0           1           0           0           0        0.00        0.00        0.00
    ORG            1           0           0           0           0        1.00        1.00        1.00
    PER            0           2           0           0           0        0.00        0.00        0.00
```

# Evaluation Scenarios

## Token level evaluation for NER is too simplistic

When running machine learning models for NER, it is common to report metrics at the individual token level. This may 
not be the best approach, as a named entity can be made up of multiple tokens, so a full-entity accuracy would be 
desirable.

When comparing the golden standard annotations with the output of a NER system different scenarios might occur:

__I. Surface string and entity type match__

| Token | Gold  | Prediction |
|-------|-------|------------|
| in    | O     | O          |
| New   | B-LOC | B-LOC      |
| York  | I-LOC | I-LOC      |
| .     | O     | O          |

__II. System hypothesized an incorrect entity__

| Token    | Gold | Prediction |
|----------|------|------------|
| an       | O    | O          |
| Awful    | O    | B-ORG      |
| Headache | O    | I-ORG      |
| in       | O    | O          |

__III. System misses an entity__

| Token | Gold  | Prediction |
|-------|-------|------------|
| in    | O     | O          |
| Palo  | B-LOC | O          |
| Alto  | I-LOC | O          |
| ,     | O     | O          |

Based on these three scenarios we have a simple classification evaluation that can be measured in terms of false 
positives, true positives, false negatives and false positives, and subsequently compute precision, recall and 
F1-score for each named-entity type.

However, this simple schema ignores the possibility of partial matches or other scenarios when the NER system gets
the named-entity surface string correct but the type wrong. We might also want to evaluate these scenarios 
again at a full-entity level.

For example:

__IV. System identifies the surface string but assigns the wrong entity type__

| Token | Gold  | Prediction |
|-------|-------|------------|
| I     | O     | O          |
| live  | O     | O          |
| in    | O     | O          |
| Palo  | B-LOC | B-ORG      |
| Alto  | I-LOC | I-ORG      |
| ,     | O     | O          |

__V. System gets the boundaries of the surface string wrong__

| Token   | Gold  | Prediction |
|---------|-------|------------|
| Unless  | O     | B-PER      |
| Karl    | B-PER | I-PER      |
| Smith   | I-PER | I-PER      |
| resigns | O     | O          |

__VI. System gets the boundaries and entity type wrong__

| Token   | Gold  | Prediction |
|---------|-------|------------|
| Unless  | O     | B-ORG      |
| Karl    | B-PER | I-ORG      |
| Smith   | I-PER | I-ORG      |
| resigns | O     | O          |


## Defining evaluation metrics

How can we incorporate these described scenarios into evaluation metrics? See the [original blog](http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/) 
for a great explanation, a summary is included here.

We can define the following five metrics to consider different categories of errors:

| Error type      | Explanation                                                              |
|-----------------|--------------------------------------------------------------------------|
| Correct (COR)   | both are the same                                                        |
| Incorrect (INC) | the output of a system and the golden annotation don’t match             |
| Partial (PAR)   | system and the golden annotation are somewhat “similar” but not the same |
| Missing (MIS)   | a golden annotation is not captured by a system                          |
| Spurious (SPU)  | system produces a response which doesn’t exist in the golden annotation  |

These five metrics can be measured in four different ways:

| Evaluation schema | Explanation                                                                       |
|-------------------|-----------------------------------------------------------------------------------|
| Strict            | exact boundary surface string match and entity type                               |
| Exact             | exact boundary match over the surface string, regardless of the type              |
| Partial           | partial boundary match over the surface string, regardless of the type            |
| Type              | some overlap between the system tagged entity and the gold annotation is required |

These five errors and four evaluation schema interact in the following ways:

| Scenario | Gold entity | Gold string    | Pred entity | Pred string         | Type | Partial | Exact | Strict |
|----------|-------------|----------------|-------------|---------------------|------|---------|-------|--------|
| III      | BRAND       | tikosyn        |             |                     | MIS  | MIS     | MIS   | MIS    |
| II       |             |                | BRAND       | healthy             | SPU  | SPU     | SPU   | SPU    |
| V        | DRUG        | warfarin       | DRUG        | of warfarin         | COR  | PAR     | INC   | INC    |
| IV       | DRUG        | propranolol    | BRAND       | propranolol         | INC  | COR     | COR   | INC    |
| I        | DRUG        | phenytoin      | DRUG        | phenytoin           | COR  | COR     | COR   | COR    |
| VI       | GROUP       | contraceptives | DRUG        | oral contraceptives | INC  | PAR     | INC   | INC    |

Then precision, recall and f1-score are calculated for each different evaluation schema. In order to achieve data, 
two more quantities need to be calculated:

```
POSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN
ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP
```

Then we can compute precision, recall, f1-score, where roughly describing precision is the percentage of correct 
named-entities found by the NER system. Recall as the percentage of the named-entities in the golden annotations 
that are retrieved by the NER system. 

This is computed in two different ways depending on whether we want an exact  match (i.e., strict and exact ) or a 
partial match (i.e., partial and type) scenario:

__Exact Match (i.e., strict and exact )__
```
Precision = (COR / ACT) = TP / (TP + FP)
Recall = (COR / POS) = TP / (TP+FN)
```

__Partial Match (i.e., partial and type)__
```
Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FN)
```

__Putting all together:__

| Measure   | Type | Partial | Exact | Strict |
|-----------|------|---------|-------|--------|
| Correct   | 3    | 3       | 3     | 2      |
| Incorrect | 2    | 0       | 2     | 3      |
| Partial   | 0    | 2       | 0     | 0      |
| Missed    | 1    | 1       | 1     | 1      |
| Spurious  | 1    | 1       | 1     | 1      |
| Precision | 0.5  | 0.66    | 0.5   | 0.33   |
| Recall    | 0.5  | 0.66    | 0.5   | 0.33   |
| F1        | 0.5  | 0.66    | 0.5   | 0.33   |


## Notes:

In scenarios IV and VI the entity type of the `true` and `pred` does not match, in both cases we only scored against 
the true entity, not the predicted one. You can argue that the predicted entity could also be scored as spurious, 
but according to the definition of `spurious`:

* Spurious (SPU) : system produces a response which does not exist in the golden annotation;

In this case there exists an annotation, but with a different entity type, so we assume it's only incorrect.

For the **Type** (ent_type) strategy, if multiple true entities of the same label overlap a
prediction, the match is resolved by closest boundaries. This can change which
``(instance_index, entity_index)`` appears in ``missed_indices`` compared to list order,
while aggregate counts stay the same.


## Contributing to the `nervaluate` package

### Extending the package to accept more formats

The `Evaluator` accepts the following formats:

* Nested lists containing NER labels
* CoNLL style tab delimited strings
* [prodi.gy](https://prodi.gy) style lists of spans

Additional formats can easily be added by creating a new loader class in `nervaluate/loaders.py`. The  loader class 
should inherit from the `DataLoader` base class and implement the `load` method. 

The `load` method should return a list of entity lists, where each entity is represented as a dictionary 
with `label`, `start`, and `end` keys.

The new loader can then be added to the `_setup_loaders` method in the `Evaluator` class, and can be selected with the
 `loader` argument when instantiating the `Evaluator` class.

Here is list of formats we intend to [include](https://github.com/MantisAI/nervaluate/issues/3).

### General Contributing

Improvements, adding new features and bug fixes are welcome. If you wish to participate in the development of `nervaluate` 
please read the guidelines in the [CONTRIBUTING.md](CONTRIBUTING.md) file.

---

Give a ⭐️ if this project helped you!


================================================
FILE: examples/example_no_loader.py
================================================
import nltk
import sklearn_crfsuite
from sklearn.metrics import classification_report

from nervaluate import Evaluator, collect_named_entities, summary_report_ent, summary_report_overall


def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        "bias": 1.0,
        "word.lower()": word.lower(),
        "word[-3:]": word[-3:],
        "word[-2:]": word[-2:],
        "word.isupper()": word.isupper(),
        "word.istitle()": word.istitle(),
        "word.isdigit()": word.isdigit(),
        "postag": postag,
        "postag[:2]": postag[:2],
    }
    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update(
            {
                "-1:word.lower()": word1.lower(),
                "-1:word.istitle()": word1.istitle(),
                "-1:word.isupper()": word1.isupper(),
                "-1:postag": postag1,
                "-1:postag[:2]": postag1[:2],
            }
        )
    else:
        features["BOS"] = True

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update(
            {
                "+1:word.lower()": word1.lower(),
                "+1:word.istitle()": word1.istitle(),
                "+1:word.isupper()": word1.isupper(),
                "+1:postag": postag1,
                "+1:postag[:2]": postag1[:2],
            }
        )
    else:
        features["EOS"] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


def sent2tokens(sent):
    return [token for token, postag, label in sent]


def main():
    print("Loading CoNLL 2002 NER Spanish data")
    nltk.corpus.conll2002.fileids()
    train_sents = list(nltk.corpus.conll2002.iob_sents("esp.train"))
    test_sents = list(nltk.corpus.conll2002.iob_sents("esp.testb"))

    x_train = [sent2features(s) for s in train_sents]
    y_train = [sent2labels(s) for s in train_sents]

    x_test = [sent2features(s) for s in test_sents]
    y_test = [sent2labels(s) for s in test_sents]

    print("Train a CRF on the CoNLL 2002 NER Spanish data")
    crf = sklearn_crfsuite.CRF(algorithm="lbfgs", c1=0.1, c2=0.1, max_iterations=10, all_possible_transitions=True)
    try:
        crf.fit(x_train, y_train)
    except AttributeError:
        pass

    y_pred = crf.predict(x_test)
    labels = list(crf.classes_)
    labels.remove("O")  # remove 'O' label from evaluation
    sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))  # group B- and I- results
    y_test_flat = [y for msg in y_test for y in msg]
    y_pred_flat = [y for msg in y_pred for y in msg]
    print(classification_report(y_test_flat, y_pred_flat, labels=sorted_labels))

    test_sents_labels = []
    for sentence in test_sents:
        sentence = [token[2] for token in sentence]
        test_sents_labels.append(sentence)

    pred_collected = [collect_named_entities(msg) for msg in y_pred]
    test_collected = [collect_named_entities(msg) for msg in y_test]

    evaluator = Evaluator(test_collected, pred_collected, ["LOC", "MISC", "PER", "ORG"])
    results, results_agg = evaluator.evaluate()

    print("\n\nOverall")
    print(summary_report_overall(results))
    print("\n\n'Strict'")
    print(summary_report_ent(results_agg, scenario="strict"))
    print("\n\n'Ent_Type'")
    print(summary_report_ent(results_agg, scenario="ent_type"))
    print("\n\n'Partial'")
    print(summary_report_ent(results_agg, scenario="partial"))
    print("\n\n'Exact'")
    print(summary_report_ent(results_agg, scenario="exact"))


if __name__ == "__main__":
    main()


================================================
FILE: examples/run_example.sh
================================================
#!/bin/bash

pip install nltk
pip install sklearn
pip install sklearn_crfsuite
python -m nltk.downloader conll2002
python example_no_loader.py


================================================
FILE: pyproject.toml
================================================
[build-system]
requires = ["setuptools", "setuptools-scm"]
build-backend = "setuptools.build_meta"

[project]
name = "nervaluate"
version = "1.2.1"
authors = [
    { name="David S. Batista"},
    { name="Matthew Upson"}
]
description = "NER evaluation considering partial match scoring"
readme = "README.md"
requires-python = ">=3.11"
keywords = ["named-entity-recognition", "ner", "evaluation-metrics", "partial-match-scoring", "nlp"]
license = {text = "MIT License"}
classifiers = [
    "Programming Language :: Python :: 3",
    "Operating System :: OS Independent"
]

[project.optional-dependencies]
dev = [
    "black>=25.1.0",
    "coverage>=7.8.0",
    "gitchangelog",
    "mypy>=1.15.0",
    "pre-commit==3.3.1",
    "pylint>=3.3.7",
    "pytest>=8.3.5",
    "pytest-cov>=6.1.1",
]

[project.urls]
"Homepage" = "https://github.com/MantisAI/nervaluate"
"Bug Tracker" = "https://github.com/MantisAI/nervaluate/issues"

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
addopts = "--cov=nervaluate --cov-report=term-missing"

[tool.coverage.run]
source = ["nervaluate"]
omit = ["*__init__*"]

[tool.coverage.report]
show_missing = true
precision = 2
sort = "Miss"

[tool.black]
line-length = 120
target-version = ["py311"]

[tool.pylint.messages_control]
disable = [
    "C0111",  # missing-docstring
    "C0103",  # invalid-name
    "W0511",  # fixme
    "W0603",  # global-statement
    "W1202",  # logging-format-interpolation
    "W1203",  # logging-fstring-interpolation
    "E1126",  # invalid-sequence-index
    "E1137",  # invalid-slice-index
    "I0011",  # bad-option-value
    "I0020",  # bad-option-value
    "R0801",  # duplicate-code
    "W9020",  # bad-option-value
    "W0621",  # redefined-outer-name
    "W0212",  # protected-access
]

[tool.pylint.'DESIGN']
max-args = 38           # Default is 5
max-attributes = 28     # Default is 7
max-branches = 14       # Default is 12
max-locals = 45         # Default is 15
max-module-lines = 2468 # Default is 1000
max-nested-blocks = 9   # Default is 5
max-statements = 206    # Default is 50
min-public-methods = 1  # Allow classes with just one public method

[tool.pylint.format]
max-line-length = 120

[tool.pylint.basic]
accept-no-param-doc = true
accept-no-raise-doc = true
accept-no-return-doc = true
accept-no-yields-doc = true
default-docstring-type = "numpy"

[tool.pylint.master]
load-plugins = ["pylint.extensions.docparams"]
ignore-paths = ["./examples/.*"]

[tool.mypy]
python_version = "3.11"
ignore_missing_imports = true
disallow_any_unimported = true
disallow_untyped_defs = true
warn_redundant_casts = true
warn_unused_ignores = true
warn_unused_configs = true

[[tool.mypy.overrides]]
module = "examples.*"
follow_imports = "skip"

[tool.hatch.envs.dev]
dependencies = [
    "black==24.3.0",
    "coverage==7.2.5",
    "gitchangelog",
    "mypy==1.3.0",
    "pre-commit==3.3.1",
    "pylint==2.17.4",
    "pytest==7.3.1",
    "pytest-cov==4.1.0",
]

[tool.hatch.envs.dev.scripts]
lint = [
    "black -t py311 -l 120 src tests",
    "pylint src tests"
]
typing = "mypy src"
test = "pytest"
clean = "rm -rf dist src/nervaluate.egg-info .coverage .mypy_cache .pytest_cache"
changelog = "gitchangelog > CHANGELOG.rst"
all = [
    "clean",
    "lint",
    "typing",
    "test"
]


================================================
FILE: src/nervaluate/__init__.py
================================================
from .evaluator import Evaluator
from .utils import collect_named_entities, conll_to_spans, list_to_spans, split_list


================================================
FILE: src/nervaluate/entities.py
================================================
from dataclasses import dataclass
from typing import List, Tuple


@dataclass
class Entity:
    """Represents a named entity with its position and label."""

    label: str
    start: int
    end: int

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, Entity):
            return NotImplemented
        return self.label == other.label and self.start == other.start and self.end == other.end

    def __hash__(self) -> int:
        return hash((self.label, self.start, self.end))


@dataclass
class EvaluationResult:
    """Represents the evaluation metrics for a single entity type or overall."""

    correct: int = 0
    incorrect: int = 0
    partial: int = 0
    missed: int = 0
    spurious: int = 0
    precision: float = 0.0
    recall: float = 0.0
    f1: float = 0.0
    actual: int = 0
    possible: int = 0

    def compute_metrics(self, partial_or_type: bool = False) -> None:
        """Compute precision, recall and F1 score."""
        self.actual = self.correct + self.incorrect + self.partial + self.spurious
        self.possible = self.correct + self.incorrect + self.partial + self.missed

        if partial_or_type:
            precision = (self.correct + 0.5 * self.partial) / self.actual if self.actual > 0 else 0
            recall = (self.correct + 0.5 * self.partial) / self.possible if self.possible > 0 else 0
        else:
            precision = self.correct / self.actual if self.actual > 0 else 0
            recall = self.correct / self.possible if self.possible > 0 else 0

        self.precision = precision
        self.recall = recall
        self.f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0


@dataclass
class EvaluationIndices:
    """Represents the indices of entities in different evaluation categories."""

    correct_indices: List[Tuple[int, int]] = None  # type: ignore
    incorrect_indices: List[Tuple[int, int]] = None  # type: ignore
    partial_indices: List[Tuple[int, int]] = None  # type: ignore
    missed_indices: List[Tuple[int, int]] = None  # type: ignore
    spurious_indices: List[Tuple[int, int]] = None  # type: ignore

    def __post_init__(self) -> None:
        if self.correct_indices is None:
            self.correct_indices = []
        if self.incorrect_indices is None:
            self.incorrect_indices = []
        if self.partial_indices is None:
            self.partial_indices = []
        if self.missed_indices is None:
            self.missed_indices = []
        if self.spurious_indices is None:
            self.spurious_indices = []


================================================
FILE: src/nervaluate/evaluator.py
================================================
from typing import List, Dict, Any, Union, Optional
import csv
import io

from .entities import EvaluationResult, EvaluationIndices
from .strategies import (
    EvaluationStrategy,
    StrictEvaluation,
    PartialEvaluation,
    EntityTypeEvaluation,
    ExactEvaluation,
)
from .loaders import DataLoader, ConllLoader, ListLoader, DictLoader
from .entities import Entity


class Evaluator:
    """Main evaluator class for NER evaluation."""

    def __init__(
        self, true: Any, pred: Any, tags: List[str], loader: str = "default", min_overlap_percentage: float = 1.0
    ) -> None:
        """
        Initialize the evaluator.

        Args:
            true: True entities in any supported format
            pred: Predicted entities in any supported format
            tags: List of valid entity tags
            loader: Name of the loader to use
            min_overlap_percentage: Minimum overlap percentage for partial matches (1-100)
        """
        self.tags = tags
        self.min_overlap_percentage = min_overlap_percentage
        self._setup_loaders()
        self._load_data(true, pred, loader)
        self._setup_evaluation_strategies()

    def _setup_loaders(self) -> None:
        """Setup available data loaders."""
        self.loaders: Dict[str, DataLoader] = {"conll": ConllLoader(), "list": ListLoader(), "dict": DictLoader()}

    def _setup_evaluation_strategies(self) -> None:
        """Setup evaluation strategies with overlap threshold."""
        self.strategies: Dict[str, EvaluationStrategy] = {
            "strict": StrictEvaluation(self.min_overlap_percentage),
            "partial": PartialEvaluation(self.min_overlap_percentage),
            "ent_type": EntityTypeEvaluation(self.min_overlap_percentage),
            "exact": ExactEvaluation(self.min_overlap_percentage),
        }

    def _load_data(self, true: Any, pred: Any, loader: str) -> None:
        """Load the true and predicted data."""
        if loader == "default":
            # Try to infer the loader based on input type
            if isinstance(true, str):
                loader = "conll"
            elif isinstance(true, list) and true and isinstance(true[0], list):
                if isinstance(true[0][0], dict):
                    loader = "dict"
                else:
                    loader = "list"
            else:
                raise ValueError("Could not infer loader from input type")

        if loader not in self.loaders:
            raise ValueError(f"Unknown loader: {loader}")

        # For list loader, check document lengths before loading
        if loader == "list":
            if len(true) != len(pred):
                raise ValueError("Number of predicted documents does not equal true")

            # Check that each document has the same length
            for i, (true_doc, pred_doc) in enumerate(zip(true, pred)):
                if len(true_doc) != len(pred_doc):
                    raise ValueError(f"Document {i} has different lengths: true={len(true_doc)}, pred={len(pred_doc)}")

        self.true = self.loaders[loader].load(true)
        self.pred = self.loaders[loader].load(pred)

        if len(self.true) != len(self.pred):
            raise ValueError("Number of predicted documents does not equal true")

    def evaluate(self) -> Dict[str, Any]:
        """
        Run the evaluation.

        Returns:
            Dictionary containing evaluation results for each strategy and entity type
        """
        results = {}
        # Get unique tags that appear in either true or predicted data
        used_tags = set()  # type: ignore
        for doc in self.true:
            used_tags.update(e.label for e in doc)
        for doc in self.pred:
            used_tags.update(e.label for e in doc)
        # Only keep tags that are both used and in the allowed tags list
        used_tags = used_tags.intersection(set(self.tags))

        entity_results: Dict[str, Dict[str, EvaluationResult]] = {tag: {} for tag in used_tags}
        indices = {}
        entity_indices: Dict[str, Dict[str, EvaluationIndices]] = {tag: {} for tag in used_tags}

        # Evaluate each document
        for doc_idx, (true_doc, pred_doc) in enumerate(zip(self.true, self.pred)):
            # Filter entities by valid tags
            true_doc = [e for e in true_doc if e.label in self.tags]
            pred_doc = [e for e in pred_doc if e.label in self.tags]

            # Evaluate with each strategy
            for strategy_name, strategy in self.strategies.items():
                result, doc_indices = strategy.evaluate(true_doc, pred_doc, self.tags, doc_idx)

                # Update overall results
                if strategy_name not in results:
                    results[strategy_name] = result
                    indices[strategy_name] = doc_indices
                else:
                    self._merge_results(results[strategy_name], result, strategy_name)
                    self._merge_indices(indices[strategy_name], doc_indices)

                # Update entity-specific results
                for tag in used_tags:
                    # Filter entities for this specific tag
                    true_tag_doc = [e for e in true_doc if e.label == tag]
                    pred_tag_doc = [e for e in pred_doc if e.label == tag]

                    # Evaluate only entities of this tag
                    tag_result, tag_indices = strategy.evaluate(true_tag_doc, pred_tag_doc, [tag], doc_idx)

                    if tag not in entity_results:
                        entity_results[tag] = {}
                        entity_indices[tag] = {}
                    if strategy_name not in entity_results[tag]:
                        entity_results[tag][strategy_name] = tag_result
                        entity_indices[tag][strategy_name] = tag_indices
                    else:
                        self._merge_results(entity_results[tag][strategy_name], tag_result, strategy_name)
                        self._merge_indices(entity_indices[tag][strategy_name], tag_indices)

        return {
            "overall": results,
            "entities": entity_results,
            "overall_indices": indices,
            "entity_indices": entity_indices,
        }

    @staticmethod
    def _merge_results(
        target: EvaluationResult, source: EvaluationResult, strategy_name: str
    ) -> None:
        """Merge two evaluation results."""
        target.correct += source.correct
        target.incorrect += source.incorrect
        target.partial += source.partial
        target.missed += source.missed
        target.spurious += source.spurious
        use_partial_formula = strategy_name in ("partial", "ent_type")
        target.compute_metrics(partial_or_type=use_partial_formula)

    @staticmethod
    def _merge_indices(target: EvaluationIndices, source: EvaluationIndices) -> None:
        """Merge two evaluation indices."""
        target.correct_indices.extend(source.correct_indices)
        target.incorrect_indices.extend(source.incorrect_indices)
        target.partial_indices.extend(source.partial_indices)
        target.missed_indices.extend(source.missed_indices)
        target.spurious_indices.extend(source.spurious_indices)

    def results_to_csv(
        self, mode: str = "overall", scenario: str = "strict", file_path: Optional[str] = None
    ) -> Union[str, None]:
        """
        Convert results to CSV format.

        Args:
            mode: Either 'overall' for overall metrics or 'entities' for per-entity metrics
            scenario: The scenario to report on (only used when mode is 'entities')
            file_path: Optional path to save CSV file. If None, returns CSV as string

        Returns:
            CSV content as string if file_path is None, otherwise None (saves to file)
        """
        valid_modes = {"overall", "entities"}
        valid_scenarios = {"strict", "ent_type", "partial", "exact"}

        if mode not in valid_modes:
            raise ValueError(f"Invalid mode: must be one of {valid_modes}")

        if mode == "entities" and scenario not in valid_scenarios:
            raise ValueError(f"Invalid scenario: must be one of {valid_scenarios}")

        results = self.evaluate()

        if mode == "overall":
            # For overall mode, include all scenarios
            csv_data = [
                ["Strategy", "Correct", "Incorrect", "Partial", "Missed", "Spurious", "Precision", "Recall", "F1-Score"]
            ]
            results_data = results["overall"]
            for strategy_name, strategy_result in results_data.items():
                csv_data.append(
                    [
                        strategy_name,
                        strategy_result.correct,
                        strategy_result.incorrect,
                        strategy_result.partial,
                        strategy_result.missed,
                        strategy_result.spurious,
                        strategy_result.precision,
                        strategy_result.recall,
                        strategy_result.f1,
                    ]
                )
        else:
            csv_data = [
                ["Entity", "Correct", "Incorrect", "Partial", "Missed", "Spurious", "Precision", "Recall", "F1-Score"]
            ]
            results_data = results["entities"]
            for entity_type, entity_results in results_data.items():
                if scenario in entity_results:
                    strategy_result = entity_results[scenario]
                    csv_data.append(
                        [
                            entity_type,
                            strategy_result.correct,
                            strategy_result.incorrect,
                            strategy_result.partial,
                            strategy_result.missed,
                            strategy_result.spurious,
                            strategy_result.precision,
                            strategy_result.recall,
                            strategy_result.f1,
                        ]
                    )

        if file_path:
            with open(file_path, "w", newline="", encoding="utf-8") as csvfile:
                writer = csv.writer(csvfile)
                writer.writerows(csv_data)
            return None

        output = io.StringIO()
        writer = csv.writer(output)
        writer.writerows(csv_data)
        return output.getvalue()

    def summary_report(self, mode: str = "overall", scenario: str = "strict", digits: int = 2) -> str:
        """
        Generate a summary report of the evaluation results.

        Args:
            mode: Either 'overall' for overall metrics or 'entities' for per-entity metrics.
            scenario: The scenario to report on. Only used when mode is 'entities'.
                      Must be one of:
                        - 'strict' exact boundary surface string match and entity type;
                        - 'exact': exact boundary match over the surface string and entity type;
                        - 'partial': partial boundary match over the surface string, regardless of the type;
                        - 'ent_type': exact boundary match over the surface string, regardless of the type;
            digits: The number of digits to round the results to.

        Returns:
            A string containing the summary report.

        Raises:
            ValueError: If the scenario or mode is invalid.
        """
        valid_scenarios = {"strict", "ent_type", "partial", "exact"}
        valid_modes = {"overall", "entities"}

        if mode not in valid_modes:
            raise ValueError(f"Invalid mode: must be one of {valid_modes}")

        if mode == "entities" and scenario not in valid_scenarios:
            raise ValueError(f"Invalid scenario: must be one of {valid_scenarios}")

        headers = ["correct", "incorrect", "partial", "missed", "spurious", "precision", "recall", "f1-score"]
        rows = [headers]

        results = self.evaluate()
        if mode == "overall":
            # Process overall results - show all scenarios
            results_data = results["overall"]
            for eval_schema in sorted(valid_scenarios):  # Sort to ensure consistent order
                if eval_schema not in results_data:
                    continue
                results_schema = results_data[eval_schema]
                rows.append(
                    [
                        eval_schema,
                        results_schema.correct,
                        results_schema.incorrect,
                        results_schema.partial,
                        results_schema.missed,
                        results_schema.spurious,
                        results_schema.precision,
                        results_schema.recall,
                        results_schema.f1,
                    ]
                )
        else:
            # Process entity-specific results for the specified scenario only
            results_data = results["entities"]
            target_names = sorted(results_data.keys())
            for ent_type in target_names:
                if scenario not in results_data[ent_type]:
                    continue  # Skip if scenario not available for this entity type

                results_ent = results_data[ent_type][scenario]
                rows.append(
                    [
                        ent_type,
                        results_ent.correct,
                        results_ent.incorrect,
                        results_ent.partial,
                        results_ent.missed,
                        results_ent.spurious,
                        results_ent.precision,
                        results_ent.recall,
                        results_ent.f1,
                    ]
                )

        # Format the report
        name_width = max(len(str(row[0])) for row in rows)
        width = max(name_width, digits)
        head_fmt = "{:>{width}s} " + " {:>11}" * len(headers)
        report = f"Scenario: {scenario if mode == 'entities' else 'all'}\n\n" + head_fmt.format(
            "", *headers, width=width
        )
        report += "\n\n"
        row_fmt = "{:>{width}s} " + " {:>11}" * 5 + " {:>11.{digits}f}" * 3 + "\n"

        for row in rows[1:]:
            report += row_fmt.format(*row, width=width, digits=digits)

        return report

    def summary_report_indices(  # pylint: disable=too-many-branches
        self, mode: str = "overall", scenario: str = "strict", colors: bool = False
    ) -> str:
        """
        Generate a summary report of the evaluation indices.

        Args:
            mode: Either 'overall' for overall metrics or 'entities' for per-entity metrics.
            scenario: The scenario to report on. Must be one of: 'strict', 'ent_type', 'partial', 'exact'.
                     Only used when mode is 'entities'. Defaults to 'strict'.
            colors: Whether to use colors in the output. Defaults to False.

        Returns:
            A string containing the summary report of indices.

        Raises:
            ValueError: If the scenario or mode is invalid.
        """
        valid_scenarios = {"strict", "ent_type", "partial", "exact"}
        valid_modes = {"overall", "entities"}

        if mode not in valid_modes:
            raise ValueError(f"Invalid mode: must be one of {valid_modes}")

        if mode == "entities" and scenario not in valid_scenarios:
            raise ValueError(f"Invalid scenario: must be one of {valid_scenarios}")

        # ANSI color codes
        COLORS = {
            "reset": "\033[0m",
            "bold": "\033[1m",
            "red": "\033[91m",
            "green": "\033[92m",
            "yellow": "\033[93m",
            "blue": "\033[94m",
            "magenta": "\033[95m",
            "cyan": "\033[96m",
            "white": "\033[97m",
        }

        def colorize(text: str, color: str) -> str:
            """Helper function to colorize text if colors are enabled."""
            if colors:
                return f"{COLORS[color]}{text}{COLORS['reset']}"
            return text

        def get_prediction_info(pred: Union[Entity, str]) -> str:
            """Helper function to get prediction info based on pred type."""
            if isinstance(pred, Entity):
                return f"Label={pred.label}, Start={pred.start}, End={pred.end}"
            # String (BIO tag)
            return f"Tag={pred}"

        results = self.evaluate()
        report = ""

        # Create headers for the table
        headers = ["Category", "Instance", "Entity", "Details"]
        header_fmt = "{:<20} {:<10} {:<8} {:<25}"
        row_fmt = "{:<20} {:<10} {:<8} {:<10}"

        if mode == "overall":
            # Get the indices from the overall results
            indices_data = results["overall_indices"][scenario]
            report += f"\n{colorize('Indices for error schema', 'bold')} '{colorize(scenario, 'cyan')}':\n\n"
            report += colorize(header_fmt.format(*headers), "bold") + "\n"
            report += colorize("-" * 78, "white") + "\n"

            for category, indices in indices_data.__dict__.items():
                if not category.endswith("_indices"):
                    continue
                category_name = category.replace("_indices", "").replace("_", " ").capitalize()

                # Color mapping for categories
                category_colors = {
                    "Correct": "green",
                    "Incorrect": "red",
                    "Partial": "yellow",
                    "Missed": "magenta",
                    "Spurious": "blue",
                }

                if indices:
                    for instance_index, entity_index in indices:
                        if self.pred != [[]]:
                            pred = self.pred[instance_index][entity_index]
                            prediction_info = get_prediction_info(pred)
                            report += (
                                row_fmt.format(
                                    colorize(category_name, category_colors.get(category_name, "white")),
                                    f"{instance_index}",
                                    f"{entity_index}",
                                    prediction_info,
                                )
                                + "\n"
                            )
                        else:
                            report += (
                                row_fmt.format(
                                    colorize(category_name, category_colors.get(category_name, "white")),
                                    f"{instance_index}",
                                    f"{entity_index}",
                                    "No prediction info",
                                )
                                + "\n"
                            )
                else:
                    report += (
                        row_fmt.format(
                            colorize(category_name, category_colors.get(category_name, "white")), "-", "-", "None"
                        )
                        + "\n"
                    )
        else:
            # Get the indices from the entity-specific results
            for entity_type, entity_results in results["entity_indices"].items():
                report += f"\n{colorize('Entity Type', 'bold')}: {colorize(entity_type, 'cyan')}\n"
                report += f"{colorize('Error Schema', 'bold')}: '{colorize(scenario, 'cyan')}'\n\n"
                report += colorize(header_fmt.format(*headers), "bold") + "\n"
                report += colorize("-" * 78, "white") + "\n"

                error_data = entity_results[scenario]
                for category, indices in error_data.__dict__.items():
                    if not category.endswith("_indices"):
                        continue
                    category_name = category.replace("_indices", "").replace("_", " ").capitalize()

                    # Color mapping for categories
                    category_colors = {
                        "Correct": "green",
                        "Incorrect": "red",
                        "Partial": "yellow",
                        "Missed": "magenta",
                        "Spurious": "blue",
                    }

                    if indices:
                        for instance_index, entity_index in indices:
                            if self.pred != [[]]:
                                pred = self.pred[instance_index][entity_index]
                                prediction_info = get_prediction_info(pred)
                                report += (
                                    row_fmt.format(
                                        colorize(category_name, category_colors.get(category_name, "white")),
                                        f"{instance_index}",
                                        f"{entity_index}",
                                        prediction_info,
                                    )
                                    + "\n"
                                )
                            else:
                                report += (
                                    row_fmt.format(
                                        colorize(category_name, category_colors.get(category_name, "white")),
                                        f"{instance_index}",
                                        f"{entity_index}",
                                        "No prediction info",
                                    )
                                    + "\n"
                                )
                    else:
                        report += (
                            row_fmt.format(
                                colorize(category_name, category_colors.get(category_name, "white")), "-", "-", "None"
                            )
                            + "\n"
                        )

        return report


================================================
FILE: src/nervaluate/loaders.py
================================================
from abc import ABC, abstractmethod
from typing import List, Dict, Any

from .entities import Entity


class DataLoader(ABC):
    """Abstract base class for data loaders."""

    @abstractmethod
    def load(self, data: Any) -> List[List[Entity]]:
        """Load data into a list of entity lists."""


class ConllLoader(DataLoader):
    """Loader for CoNLL format data."""

    def load(self, data: str) -> List[List[Entity]]:  # pylint: disable=too-many-branches
        """Load CoNLL format data into a list of Entity lists."""
        if not isinstance(data, str):
            raise ValueError("ConllLoader expects string input")

        if not data:
            return []

        result: List[List[Entity]] = []
        # Strip trailing whitespace and newlines to avoid empty documents
        documents = data.rstrip().split("\n\n")

        for doc in documents:
            if not doc.strip():
                result.append([])
                continue

            current_doc = []
            start_offset = None
            end_offset = None
            ent_type = None
            has_entities = False

            for offset, line in enumerate(doc.split("\n")):
                if not line.strip():
                    continue

                parts = line.split("\t")
                if len(parts) < 2:
                    raise ValueError(f"Invalid CoNLL format: line '{line}' does not contain a tab separator")

                token_tag = parts[1]

                if token_tag == "O":
                    if ent_type is not None and start_offset is not None:
                        end_offset = offset - 1
                        if isinstance(start_offset, int) and isinstance(end_offset, int):
                            current_doc.append(Entity(label=ent_type, start=start_offset, end=end_offset))
                        start_offset = None
                        end_offset = None
                        ent_type = None

                elif ent_type is None:
                    if not (token_tag.startswith("B-") or token_tag.startswith("I-")):
                        raise ValueError(f"Invalid tag format: {token_tag}")
                    ent_type = token_tag[2:]  # Remove B- or I- prefix
                    start_offset = offset
                    has_entities = True

                elif ent_type != token_tag[2:] or (ent_type == token_tag[2:] and token_tag[:1] == "B"):
                    end_offset = offset - 1
                    if isinstance(start_offset, int) and isinstance(end_offset, int):
                        current_doc.append(Entity(label=ent_type, start=start_offset, end=end_offset))

                    # start of a new entity
                    if not (token_tag.startswith("B-") or token_tag.startswith("I-")):
                        raise ValueError(f"Invalid tag format: {token_tag}")
                    ent_type = token_tag[2:]
                    start_offset = offset
                    end_offset = None
                    has_entities = True

            # Catches an entity that goes up until the last token
            if ent_type is not None and start_offset is not None and end_offset is None:
                if isinstance(start_offset, int):
                    current_doc.append(Entity(label=ent_type, start=start_offset, end=len(doc.split("\n")) - 1))
                has_entities = True

            result.append(current_doc if has_entities else [])

        return result


class ListLoader(DataLoader):
    """Loader for list format data."""

    def load(self, data: List[List[str]]) -> List[List[Entity]]:  # pylint: disable=too-many-branches
        """Load list format data into a list of entity lists."""
        if not isinstance(data, list):
            raise ValueError("ListLoader expects list input")

        if not data:
            return []

        result = []

        for doc in data:
            if not isinstance(doc, list):
                raise ValueError("Each document must be a list of tags")

            current_doc = []
            start_offset = None
            end_offset = None
            ent_type = None

            for offset, token_tag in enumerate(doc):
                if not isinstance(token_tag, str):
                    raise ValueError(f"Invalid tag type: {type(token_tag)}")

                if token_tag == "O":
                    if ent_type is not None and start_offset is not None:
                        end_offset = offset - 1
                        if isinstance(start_offset, int) and isinstance(end_offset, int):
                            current_doc.append(Entity(label=ent_type, start=start_offset, end=end_offset))
                        start_offset = None
                        end_offset = None
                        ent_type = None

                elif ent_type is None:
                    if not (token_tag.startswith("B-") or token_tag.startswith("I-")):
                        raise ValueError(f"Invalid tag format: {token_tag}")
                    ent_type = token_tag[2:]  # Remove B- or I- prefix
                    start_offset = offset

                elif ent_type != token_tag[2:] or (ent_type == token_tag[2:] and token_tag[:1] == "B"):
                    end_offset = offset - 1
                    if isinstance(start_offset, int) and isinstance(end_offset, int):
                        current_doc.append(Entity(label=ent_type, start=start_offset, end=end_offset))

                    # start of a new entity
                    if not (token_tag.startswith("B-") or token_tag.startswith("I-")):
                        raise ValueError(f"Invalid tag format: {token_tag}")
                    ent_type = token_tag[2:]
                    start_offset = offset
                    end_offset = None

            # Catches an entity that goes up until the last token
            if ent_type is not None and start_offset is not None and end_offset is None:
                if isinstance(start_offset, int):
                    current_doc.append(Entity(label=ent_type, start=start_offset, end=len(doc) - 1))

            result.append(current_doc)

        return result


class DictLoader(DataLoader):
    """Loader for dictionary format data."""

    def load(self, data: List[List[Dict[str, Any]]]) -> List[List[Entity]]:
        """Load dictionary format data into a list of entity lists."""
        if not isinstance(data, list):
            raise ValueError("DictLoader expects list input")

        if not data:
            return []

        result = []

        for doc in data:
            if not isinstance(doc, list):
                raise ValueError("Each document must be a list of entity dictionaries")

            current_doc = []
            for entity in doc:
                if not isinstance(entity, dict):
                    raise ValueError(f"Invalid entity type: {type(entity)}")

                required_keys = {"label", "start", "end"}
                if not all(key in entity for key in required_keys):
                    raise ValueError(f"Entity missing required keys: {required_keys}")

                if not isinstance(entity["label"], str):
                    raise ValueError("Entity label must be a string")

                if not isinstance(entity["start"], int) or not isinstance(entity["end"], int):
                    raise ValueError("Entity start and end must be integers")

                current_doc.append(Entity(label=entity["label"], start=entity["start"], end=entity["end"]))
            result.append(current_doc)

        return result


================================================
FILE: src/nervaluate/strategies.py
================================================
from abc import ABC, abstractmethod
from typing import List, Tuple

from .entities import Entity, EvaluationResult, EvaluationIndices


class EvaluationStrategy(ABC):
    """Abstract base class for evaluation strategies."""

    def __init__(self, min_overlap_percentage: float = 1.0):
        """
        Initialize strategy with minimum overlap threshold.

        Args:
            min_overlap_percentage: Minimum overlap percentage required (1-100)
        """
        if not 1.0 <= min_overlap_percentage <= 100.0:
            raise ValueError("min_overlap_percentage must be between 1.0 and 100.0")
        self.min_overlap_percentage = min_overlap_percentage

    @staticmethod
    def _calculate_overlap_percentage(pred: Entity, true: Entity) -> float:
        """
        Calculate the percentage overlap between predicted and true entities.

        Returns:
            Overlap percentage based on true entity span (0-100)
        """
        # Check if there's any overlap first
        if pred.start > true.end or pred.end < true.start:
            return 0.0

        # Calculate overlap boundaries
        overlap_start = max(pred.start, true.start)
        overlap_end = min(pred.end, true.end)

        # Calculate spans (adding 1 because end is inclusive)
        overlap_span = overlap_end - overlap_start + 1
        true_span = true.end - true.start + 1

        # Calculate percentage based on true entity span
        return (overlap_span / true_span) * 100.0

    @staticmethod
    def _calculate_boundaries_distance(pred: Entity, true: Entity) -> float:
        """
        Calculate distance between predicted and true entities boundaries.

        Returns:
            Distance between predicted and true boundaries
        """
        # Calculate boundaries gaps
        distance_starts = abs(pred.start - true.start)
        distance_ends = abs(pred.end - true.end)

        return distance_starts + distance_ends

    def _has_sufficient_overlap(self, pred: Entity, true: Entity) -> bool:
        """Check if entities have sufficient overlap based on threshold."""
        overlap_percentage = EvaluationStrategy._calculate_overlap_percentage(pred, true)
        return overlap_percentage >= self.min_overlap_percentage

    @abstractmethod
    def evaluate(
        self, true_entities: List[Entity], pred_entities: List[Entity], tags: List[str], instance_index: int = 0
    ) -> Tuple[EvaluationResult, EvaluationIndices]:
        """Evaluate the predicted entities against the true entities."""


class StrictEvaluation(EvaluationStrategy):
    """
    Strict evaluation strategy - entities must match exactly.

    If there's a predicted entity that perfectly matches a true entity and they have the same label
    we mark it as correct.
    If there's a predicted entity that doesn't perfectly match any true entity, we mark it as spurious.
    If there's a true entity that doesn't perfecly match any predicted entity, we mark it as missed.
    All other cases are marked as incorrect.
    """

    def evaluate(
        self, true_entities: List[Entity], pred_entities: List[Entity], tags: List[str], instance_index: int = 0
    ) -> Tuple[EvaluationResult, EvaluationIndices]:
        """
        Evaluate the predicted entities against the true entities using strict matching.
        """
        result = EvaluationResult()
        indices = EvaluationIndices()
        matched_true = set()

        for pred_idx, pred in enumerate(pred_entities):
            found_match = False
            found_incorrect = False

            for true_idx, true in enumerate(true_entities):
                if true_idx in matched_true:
                    continue

                # Check for perfect match (same boundaries and label)
                if pred.label == true.label and pred.start == true.start and pred.end == true.end:
                    result.correct += 1
                    indices.correct_indices.append((instance_index, pred_idx))
                    matched_true.add(true_idx)
                    found_match = True
                    break
                # Check for sufficient overlap with min threshold
                if self._has_sufficient_overlap(pred, true) and not found_incorrect:
                    incorrect_true_idx = true_idx
                    incorrect_pred_idx = pred_idx
                    found_incorrect = True

            if not found_match:
                if found_incorrect:
                    result.incorrect += 1
                    indices.incorrect_indices.append((instance_index, incorrect_pred_idx))
                    matched_true.add(incorrect_true_idx)
                else:
                    result.spurious += 1
                    indices.spurious_indices.append((instance_index, pred_idx))

        for true_idx, true in enumerate(true_entities):
            if true_idx not in matched_true:
                result.missed += 1
                indices.missed_indices.append((instance_index, true_idx))

        result.compute_metrics()
        return result, indices


class PartialEvaluation(EvaluationStrategy):
    """
    Partial evaluation strategy - allows for partial matches.

    If there's a predicted entity that perfectly matches a true entity, we mark it as correct.
    If there's a predicted entity that doesn't match any true entity and that has some minimum
    overlap with a true entity we mark it as partial.
    If there's a predicted entity that doesn't match any true entity, we mark it as spurious.
    If there's a true entity that doesn't match any predicted entity, we mark it as missed.

    There's never entity type/label checking in this strategy, and there's never an entity marked as incorrect.
    """

    def evaluate(
        self, true_entities: List[Entity], pred_entities: List[Entity], tags: List[str], instance_index: int = 0
    ) -> Tuple[EvaluationResult, EvaluationIndices]:
        result = EvaluationResult()
        indices = EvaluationIndices()
        matched_true = set()

        for pred_idx, pred in enumerate(pred_entities):
            found_match = False
            found_partial = False

            for true_idx, true in enumerate(true_entities):
                if true_idx in matched_true:
                    continue

                # Check for sufficient overlap with min threshold
                if self._has_sufficient_overlap(pred, true):
                    if pred.start == true.start and pred.end == true.end:
                        result.correct += 1
                        indices.correct_indices.append((instance_index, pred_idx))
                        matched_true.add(true_idx)
                        found_match = True
                        break
                    if not found_partial:
                        partial_pred_idx = pred_idx
                        partial_true_idx = true_idx
                        found_partial = True

            if not found_match:
                if found_partial:
                    result.partial += 1
                    indices.partial_indices.append((instance_index, partial_pred_idx))
                    matched_true.add(partial_true_idx)
                else:
                    result.spurious += 1
                    indices.spurious_indices.append((instance_index, pred_idx))

        for true_idx, true in enumerate(true_entities):
            if true_idx not in matched_true:
                result.missed += 1
                indices.missed_indices.append((instance_index, true_idx))

        result.compute_metrics(partial_or_type=True)
        return result, indices


class EntityTypeEvaluation(EvaluationStrategy):
    """
    Entity type evaluation strategy - only checks entity types.

    In strategy, we check for overlap between the predicted entity and the true entity.

    If there's a predicted entity that perfectly matches or only some minimum overlap with a
    true entity, and the same label, we mark it as correct. If there are multiple entities
    with at least some minimum overlap, we mark as correct the one with boundaries closest to
    a true entity.
    If there's a predicted entity that doesn't match any true entity and that has some minimum
    overlap or perfectly matches but has the wrong label we mark it as incorrect.
    If there's a predicted entity that doesn't match any true entity, we mark it as spurious.
    If there's a true entity that doesn't match any predicted entity, we mark it as missed.

    When multiple true entities of the same label overlap a prediction, the match is chosen by
    closest boundaries (minimum sum of start and end offset differences), so which true entity
    is considered "missed" may differ from list order.
    """


    def evaluate(
        self, true_entities: List[Entity], pred_entities: List[Entity], tags: List[str], instance_index: int = 0
    ) -> Tuple[EvaluationResult, EvaluationIndices]:
        result = EvaluationResult()
        indices = EvaluationIndices()
        matched_true = set()

        for pred_idx, pred in enumerate(pred_entities):
            found_match = False
            found_incorrect = False
            current_match_boundaries_distance = None

            for true_idx, true in enumerate(true_entities):
                if true_idx in matched_true:
                    continue

                # Check for sufficient overlap with min threshold
                if self._has_sufficient_overlap(pred, true):
                    boundaries_distance = self._calculate_boundaries_distance(pred, true)
                    if pred.label == true.label:
                        if (
                            current_match_boundaries_distance is None
                            or boundaries_distance < current_match_boundaries_distance
                        ):
                            correct_true_idx = true_idx
                            correct_pred_idx = pred_idx
                            current_match_boundaries_distance = boundaries_distance
                            found_match = True

                    elif not found_incorrect:
                        incorrect_true_idx = true_idx
                        incorrect_pred_idx = pred_idx
                        found_incorrect = True

            if found_match:
                result.correct += 1
                indices.correct_indices.append((instance_index, correct_pred_idx))
                matched_true.add(correct_true_idx)
            else:
                if found_incorrect:
                    result.incorrect += 1
                    indices.incorrect_indices.append((instance_index, incorrect_pred_idx))
                    matched_true.add(incorrect_true_idx)
                else:
                    result.spurious += 1
                    indices.spurious_indices.append((instance_index, pred_idx))

        for true_idx, true in enumerate(true_entities):
            if true_idx not in matched_true:
                result.missed += 1
                indices.missed_indices.append((instance_index, true_idx))

        result.compute_metrics(partial_or_type=True)
        return result, indices


class ExactEvaluation(EvaluationStrategy):
    """
    Exact evaluation strategy - exact boundary match over the surface string, regardless of the type.

    If there's a predicted entity that perfectly matches a true entity, regardless of the label, we mark it as correct.
    If there's a predicted entity that doesn't match any true entity and that has only some minimum
    overlap with a true entity, we mark it as incorrect.
    If there's a predicted entity that doesn't match any true entity, we mark it as spurious.
    If there's a true entity that doesn't match any predicted entity, we mark it as missed.
    """

    def evaluate(
        self, true_entities: List[Entity], pred_entities: List[Entity], tags: List[str], instance_index: int = 0
    ) -> Tuple[EvaluationResult, EvaluationIndices]:
        """
        Evaluate the predicted entities against the true entities using exact boundary matching.
        Entity type is not considered in the matching.
        """
        result = EvaluationResult()
        indices = EvaluationIndices()
        matched_true = set()

        for pred_idx, pred in enumerate(pred_entities):
            found_match = False
            found_incorrect = False

            for true_idx, true in enumerate(true_entities):
                if true_idx in matched_true:
                    continue

                # Check for exact boundary match (regardless of label)
                if pred.start == true.start and pred.end == true.end:
                    result.correct += 1
                    indices.correct_indices.append((instance_index, pred_idx))
                    matched_true.add(true_idx)
                    found_match = True
                    break
                # Check for sufficient overlap with min threshold
                if self._has_sufficient_overlap(pred, true) and not found_incorrect:
                    incorrect_true_idx = true_idx
                    incorrect_pred_idx = pred_idx
                    found_incorrect = True

            if not found_match:
                if found_incorrect:
                    result.incorrect += 1
                    indices.incorrect_indices.append((instance_index, incorrect_pred_idx))
                    matched_true.add(incorrect_true_idx)
                else:
                    result.spurious += 1
                    indices.spurious_indices.append((instance_index, pred_idx))

        for true_idx, true in enumerate(true_entities):
            if true_idx not in matched_true:
                result.missed += 1
                indices.missed_indices.append((instance_index, true_idx))

        result.compute_metrics()
        return result, indices


================================================
FILE: src/nervaluate/utils.py
================================================
def split_list(token: list[str], split_chars: list[str] | None = None) -> list[list[str]]:
    """
    Split a list into sublists based on a list of split characters.

    If split_chars is None, the list is split on empty strings.

    :param token: The list to split.
    :param split_chars: The characters to split on.

    :returns:
        A list of lists.
    """
    if split_chars is None:
        split_chars = [""]
    out = []
    chunk = []
    for i, item in enumerate(token):
        if item not in split_chars:
            chunk.append(item)
            if i + 1 == len(token):
                out.append(chunk)
        else:
            out.append(chunk)
            chunk = []
    return out


def conll_to_spans(doc: str) -> list[list[dict]]:
    """
    Convert a CoNLL-formatted string to a list of spans.

    :param doc: The CoNLL-formatted string.

    :returns:
        A list of spans.
    """
    out = []
    doc_parts = split_list(doc.split("\n"), split_chars=None)

    for example in doc_parts:
        labels = []
        for token in example:
            token_parts = token.split("\t")
            label = token_parts[1]
            labels.append(label)
        out.append(labels)

    spans = list_to_spans(out)

    return spans


def list_to_spans(doc: list[list[str]]) -> list[list[dict]]:
    """
    Convert a list of tags to a list of spans.

    :param doc: The list of tags.

    :returns:
        A list of spans.
    """
    spans = [collect_named_entities(tokens) for tokens in doc]
    return spans


def collect_named_entities(tokens: list[str]) -> list[dict]:
    """
    Creates a list of Entity named-tuples, storing the entity type and the start and end offsets of the entity.

    :param tokens: a list of tags

    :returns:
        A list of Entity named-tuples.
    """

    named_entities = []
    start_offset = None
    end_offset = None
    ent_type = None

    for offset, token_tag in enumerate(tokens):
        if token_tag == "O":
            if ent_type is not None and start_offset is not None:
                end_offset = offset - 1
                named_entities.append({"label": ent_type, "start": start_offset, "end": end_offset})
                start_offset = None
                end_offset = None
                ent_type = None

        elif ent_type is None:
            ent_type = token_tag[2:]
            start_offset = offset

        elif ent_type != token_tag[2:] or (ent_type == token_tag[2:] and token_tag[:1] == "B"):
            end_offset = offset - 1
            named_entities.append({"label": ent_type, "start": start_offset, "end": end_offset})

            # start of a new entity
            ent_type = token_tag[2:]
            start_offset = offset
            end_offset = None

    # Catches an entity that goes up until the last token
    if ent_type is not None and start_offset is not None and end_offset is None:
        named_entities.append({"label": ent_type, "start": start_offset, "end": len(tokens) - 1})

    return named_entities


def find_overlap(true_range: range, pred_range: range) -> set:
    """
    Find the overlap between two ranges.

    :param true_range: The true range.
    :param pred_range: The predicted range.

    :returns:
        A set of overlapping values.

    Examples:
        >>> find_overlap(range(1, 3), range(2, 4))
        {2}
        >>> find_overlap(range(1, 3), range(3, 5))
        set()
    """

    true_set = set(true_range)
    pred_set = set(pred_range)
    overlaps = true_set.intersection(pred_set)

    return overlaps


def clean_entities(ent: dict) -> dict:
    """
    Returns just the useful keys if additional keys are present in the entity
    dict.

    This may happen if passing a list of spans directly from prodigy, which
    typically may include 'token_start' and 'token_end'.
    """
    return {"start": ent["start"], "end": ent["end"], "label": ent["label"]}


================================================
FILE: tests/__init__.py
================================================
import sys

sys.path.append("../src/nervaluate")


================================================
FILE: tests/test_entities.py
================================================
from nervaluate.entities import Entity, EvaluationResult


def test_entity_equality():
    """Test Entity equality comparison."""
    entity1 = Entity(label="PER", start=0, end=1)
    entity2 = Entity(label="PER", start=0, end=1)
    entity3 = Entity(label="ORG", start=0, end=1)

    assert entity1 == entity2
    assert entity1 != entity3
    assert entity1 != "not an entity"


def test_entity_hash():
    """Test Entity hashing."""
    entity1 = Entity(label="PER", start=0, end=1)
    entity2 = Entity(label="PER", start=0, end=1)
    entity3 = Entity(label="ORG", start=0, end=1)

    assert hash(entity1) == hash(entity2)
    assert hash(entity1) != hash(entity3)


def test_evaluation_result_compute_metrics():
    """Test computation of evaluation metrics."""
    result = EvaluationResult(correct=5, incorrect=2, partial=1, missed=1, spurious=1)

    # Test strict metrics
    result.compute_metrics(partial_or_type=False)
    assert result.precision == 5 / 9  # 5/(5+2+1+1)
    assert result.recall == 5 / (5 + 2 + 1 + 1)

    # Test partial metrics
    result.compute_metrics(partial_or_type=True)
    assert result.precision == 5.5 / 9  # (5+0.5*1)/(5+2+1+1)
    assert result.recall == (5 + 0.5 * 1) / (5 + 2 + 1 + 1)


def test_evaluation_result_zero_cases():
    """Test evaluation metrics with zero values."""
    result = EvaluationResult()
    result.compute_metrics()
    assert result.precision == 0
    assert result.recall == 0
    assert result.f1 == 0


================================================
FILE: tests/test_evaluator.py
================================================
import csv
import io
import pytest
from nervaluate.evaluator import Evaluator


@pytest.fixture
def sample_data():
    true = [
        ["O", "B-PER", "O", "B-ORG", "I-ORG", "B-LOC"],
        ["O", "B-PER", "O", "B-ORG"],
    ]

    pred = [
        ["O", "B-PER", "O", "B-ORG", "O", "B-PER"],
        ["O", "B-PER", "O", "B-LOC"],
    ]

    return true, pred


def test_evaluator_initialization(sample_data):
    """Test evaluator initialization."""
    true, pred = sample_data
    evaluator = Evaluator(true, pred, ["PER", "ORG", "LOC"], loader="list")

    assert len(evaluator.true) == 2
    assert len(evaluator.pred) == 2
    assert evaluator.tags == ["PER", "ORG", "LOC"]


def test_evaluator_evaluation(sample_data):
    """Test evaluation process."""
    true, pred = sample_data
    evaluator = Evaluator(true, pred, ["PER", "ORG", "LOC"], loader="list")
    results = evaluator.evaluate()

    # Check that we have results for all strategies
    assert "overall" in results
    assert "entities" in results
    assert "strict" in results["overall"]
    assert "partial" in results["overall"]
    assert "ent_type" in results["overall"]

    # Check that we have results for each entity type
    for entity in ["PER", "ORG", "LOC"]:
        assert entity in results["entities"]
        assert "strict" in results["entities"][entity]
        assert "partial" in results["entities"][entity]
        assert "ent_type" in results["entities"][entity]


def test_evaluator_with_invalid_tags(sample_data):
    """Test evaluator with invalid tags."""
    true, pred = sample_data
    evaluator = Evaluator(true, pred, ["INVALID"], loader="list")
    results = evaluator.evaluate()

    for strategy in ["strict", "partial", "ent_type"]:
        assert results["overall"][strategy].correct == 0
        assert results["overall"][strategy].incorrect == 0
        assert results["overall"][strategy].partial == 0
        assert results["overall"][strategy].missed == 0
        assert results["overall"][strategy].spurious == 0


def test_partial_and_ent_type_metrics_use_partial_formula_after_merge():
    """
    Test that partial and ent_type strategies use (COR + 0.5*PAR)/ACT for precision/recall
    after merging multi-document results, not the strict formula COR/ACT.

    Uses the README usage example: 2 documents, partial has correct=2, partial=3.
    SemEval partial formula gives P=R=(2+0.5*3)/5=0.7; strict would give 0.4.
    This test would have caught the bug where _merge_results called compute_metrics()
    without partial_or_type=True, overwriting partial/ent_type metrics with strict values.
    """
    # README usage example (2 documents so _merge_results is exercised)
    true = [
        ["O", "B-PER", "I-PER", "O", "O", "O", "B-ORG", "I-ORG"],
        ["O", "B-LOC", "B-PER", "I-PER", "O", "O", "B-DATE"],
    ]
    pred = [
        ["O", "O", "B-PER", "I-PER", "O", "O", "B-ORG", "I-ORG"],
        ["O", "B-LOC", "I-LOC", "B-PER", "O", "O", "B-DATE"],
    ]
    evaluator = Evaluator(true, pred, tags=["PER", "ORG", "LOC", "DATE"], loader="list")
    results = evaluator.evaluate()

    strict_res = results["overall"]["strict"]
    partial_res = results["overall"]["partial"]
    ent_type_res = results["overall"]["ent_type"]

    # Partial has correct=2, partial=3, no incorrect/missed/spurious -> ACT=POS=5
    assert partial_res.correct == 2
    assert partial_res.partial == 3
    assert partial_res.incorrect == 0
    assert partial_res.missed == 0
    assert partial_res.spurious == 0
    assert partial_res.actual == 5
    assert partial_res.possible == 5

    # SemEval partial formula: (COR + 0.5*PAR) / ACT and / POS
    expected_partial_precision = (partial_res.correct + 0.5 * partial_res.partial) / partial_res.actual
    expected_partial_recall = (partial_res.correct + 0.5 * partial_res.partial) / partial_res.possible
    assert expected_partial_precision == pytest.approx(0.7)
    assert expected_partial_recall == pytest.approx(0.7)

    # Partial strategy must report these values (not strict 0.4)
    assert partial_res.precision == pytest.approx(expected_partial_precision)
    assert partial_res.recall == pytest.approx(expected_partial_recall)
    assert partial_res.precision != strict_res.precision
    assert partial_res.recall != strict_res.recall

    # ent_type for this example has no partial/incorrect, so P/R=1.0; ensure it used partial formula path
    assert ent_type_res.precision == pytest.approx(1.0)
    assert ent_type_res.recall == pytest.approx(1.0)


def test_evaluator_different_document_lengths():
    """Test that Evaluator raises ValueError when documents have different lengths."""
    true = [
        ["O", "B-PER", "I-PER", "O", "O", "O", "B-ORG", "I-ORG"],  # 8 tokens
        ["O", "B-LOC", "B-PER", "I-PER", "O", "O", "B-DATE"],  # 7 tokens
    ]
    pred = [
        ["O", "B-PER", "I-PER", "O", "O", "O", "B-ORG", "I-ORG"],  # 8 tokens
        ["O", "B-LOC", "I-LOC", "O", "B-PER", "I-PER", "O", "B-DATE", "I-DATE", "O"],  # 10 tokens
    ]
    tags = ["PER", "ORG", "LOC", "DATE"]

    # Test that ValueError is raised
    with pytest.raises(ValueError, match="Document 1 has different lengths: true=7, pred=10"):
        evaluator = Evaluator(true=true, pred=pred, tags=tags, loader="list")
        evaluator.evaluate()


def test_results_to_csv(sample_data, tmp_path):

    true, pred = sample_data
    evaluator = Evaluator(true, pred, ["PER", "ORG", "LOC"], loader="list")

    overall_csv_str = evaluator.results_to_csv(mode="overall")
    assert isinstance(overall_csv_str, str)

    csv_reader = csv.reader(io.StringIO(overall_csv_str))
    overall_csv = list(csv_reader)

    assert len(overall_csv) > 1  # should have header + at least one row
    assert overall_csv[0] == [
        "Strategy",
        "Correct",
        "Incorrect",
        "Partial",
        "Missed",
        "Spurious",
        "Precision",
        "Recall",
        "F1-Score",
    ]

    # check that all strategies are present
    strategies = {row[0] for row in overall_csv[1:]}
    assert strategies == {"strict", "partial", "ent_type", "exact"}

    # test entities mode - return as string
    entities_csv_str = evaluator.results_to_csv(mode="entities", scenario="strict")
    assert isinstance(entities_csv_str, str)

    # parse CSV string to check content
    csv_reader = csv.reader(io.StringIO(entities_csv_str))
    entities_csv = list(csv_reader)

    assert len(entities_csv) > 1  # should have header + at least one row
    assert entities_csv[0] == [
        "Entity",
        "Correct",
        "Incorrect",
        "Partial",
        "Missed",
        "Spurious",
        "Precision",
        "Recall",
        "F1-Score",
    ]

    # check that all entity types are present
    entity_types = {row[0] for row in entities_csv[1:]}
    assert entity_types == {"PER", "ORG", "LOC"}

    # test file saving - overall mode
    overall_file = tmp_path / "overall_results.csv"
    result = evaluator.results_to_csv(mode="overall", file_path=str(overall_file))
    assert result is None  # Should return None when saving to file
    assert overall_file.exists()

    # verify file content
    with open(overall_file, "r", encoding="utf-8") as f:
        saved_csv = list(csv.reader(f))
    assert len(saved_csv) > 1
    assert saved_csv[0][0] == "Strategy"

    # test file saving - entities mode
    entities_file = tmp_path / "entities_results.csv"
    result = evaluator.results_to_csv(mode="entities", scenario="partial", file_path=str(entities_file))
    assert result is None  # Should return None when saving to file
    assert entities_file.exists()

    # verify file content
    with open(entities_file, "r", encoding="utf-8") as f:
        saved_csv = list(csv.reader(f))
    assert len(saved_csv) > 1
    assert saved_csv[0][0] == "Entity"

    # test invalid mode
    with pytest.raises(ValueError, match="Invalid mode: must be one of"):
        evaluator.results_to_csv(mode="invalid")

    # test invalid scenario for entities mode
    with pytest.raises(ValueError, match="Invalid scenario: must be one of"):
        evaluator.results_to_csv(mode="entities", scenario="invalid")


def test_evaluator_with_min_overlap_percentage():
    """Test Evaluator class with minimum overlap percentage parameter."""

    # Test data: true entity spans positions 0-9 (10 tokens)
    true_entities = [[{"label": "PER", "start": 0, "end": 9}]]  # 10-token entity

    # Predicted entities with different overlap percentages
    pred_entities = [[{"label": "PER", "start": 0, "end": 2}]]  # 30% overlap

    # Test with default 1% threshold - should be partial match
    evaluator_default = Evaluator(true=true_entities, pred=pred_entities, tags=["PER"], loader="dict")
    results_default = evaluator_default.evaluate()
    partial_default = results_default["overall"]["partial"]
    assert partial_default.partial == 1
    assert partial_default.spurious == 0

    # Test with 50% threshold - should be spurious
    evaluator_50 = Evaluator(
        true=true_entities, pred=pred_entities, tags=["PER"], loader="dict", min_overlap_percentage=50.0
    )
    results_50 = evaluator_50.evaluate()
    partial_50 = results_50["overall"]["partial"]
    assert partial_50.partial == 0
    assert partial_50.spurious == 1


def test_evaluator_min_overlap_validation():
    """Test that Evaluator validates minimum overlap percentage."""
    true_entities = [[{"label": "PER", "start": 0, "end": 5}]]
    pred_entities = [[{"label": "PER", "start": 0, "end": 5}]]

    # Valid values should work
    Evaluator(true_entities, pred_entities, ["PER"], "dict", min_overlap_percentage=1.0)
    Evaluator(true_entities, pred_entities, ["PER"], "dict", min_overlap_percentage=50.0)
    Evaluator(true_entities, pred_entities, ["PER"], "dict", min_overlap_percentage=100.0)

    # Invalid values should raise ValueError during strategy initialization
    with pytest.raises(ValueError, match="min_overlap_percentage must be between 1.0 and 100.0"):
        Evaluator(true_entities, pred_entities, ["PER"], "dict", min_overlap_percentage=0.5)

    with pytest.raises(ValueError, match="min_overlap_percentage must be between 1.0 and 100.0"):
        Evaluator(true_entities, pred_entities, ["PER"], "dict", min_overlap_percentage=101.0)


def test_evaluator_min_overlap_affects_all_strategies():
    """Test that minimum overlap percentage affects all evaluation strategies."""
    true_entities = [[{"label": "PER", "start": 0, "end": 9}]]  # 10 tokens

    pred_entities = [[{"label": "PER", "start": 0, "end": 2}]]  # 30% overlap

    evaluator = Evaluator(
        true=true_entities, pred=pred_entities, tags=["PER"], loader="dict", min_overlap_percentage=50.0
    )

    results = evaluator.evaluate()

    # All strategies should respect the 50% threshold
    # 30% overlap < 50% threshold, so should be spurious for all strategies

    # Partial strategy
    partial_result = results["overall"]["partial"]
    assert partial_result.spurious == 1
    assert partial_result.correct == 0
    assert partial_result.partial == 0

    # Strict strategy
    strict_result = results["overall"]["strict"]
    assert strict_result.spurious == 1
    assert strict_result.correct == 0
    assert strict_result.incorrect == 0

    # Entity type strategy
    ent_type_result = results["overall"]["ent_type"]
    assert ent_type_result.spurious == 1
    assert ent_type_result.correct == 0
    assert ent_type_result.incorrect == 0

    # Exact strategy
    exact_result = results["overall"]["exact"]
    assert exact_result.spurious == 1
    assert exact_result.correct == 0
    assert exact_result.incorrect == 0


def test_evaluator_min_overlap_with_different_thresholds():
    """Test Evaluator with different overlap thresholds."""
    true_entities = [[{"label": "PER", "start": 0, "end": 9}]]  # 10 tokens

    # Test cases with different predicted entities
    test_cases = [
        # (pred_entities, threshold, expected_result_type)
        ([{"label": "PER", "start": 0, "end": 4}], 50.0, "partial"),  # 50% overlap = 50%
        ([{"label": "PER", "start": 0, "end": 4}], 51.0, "spurious"),  # 50% overlap < 51%
        ([{"label": "PER", "start": 0, "end": 6}], 75.0, "spurious"),  # 70% overlap < 75%
        ([{"label": "PER", "start": 0, "end": 7}], 75.0, "partial"),  # 80% overlap > 75%
        ([{"label": "PER", "start": 0, "end": 9}], 100.0, "correct"),  # 100% overlap = exact match
    ]

    for pred_data, threshold, expected_type in test_cases:
        pred_entities = [pred_data]

        evaluator = Evaluator(
            true=true_entities, pred=pred_entities, tags=["PER"], loader="dict", min_overlap_percentage=threshold
        )

        results = evaluator.evaluate()
        partial_results = results["overall"]["partial"]

        if expected_type == "correct":
            assert partial_results.correct == 1, f"Failed for {pred_data} with threshold {threshold}%"
            assert partial_results.partial == 0
            assert partial_results.spurious == 0
        elif expected_type == "partial":
            assert partial_results.partial == 1, f"Failed for {pred_data} with threshold {threshold}%"
            assert partial_results.correct == 0
            assert partial_results.spurious == 0
        elif expected_type == "spurious":
            assert partial_results.spurious == 1, f"Failed for {pred_data} with threshold {threshold}%"
            assert partial_results.correct == 0
            assert partial_results.partial == 0


def test_evaluator_min_overlap_with_multiple_entities():
    """Test Evaluator with multiple entities and minimum overlap threshold."""
    true_entities = [
        [
            {"label": "PER", "start": 0, "end": 4},  # 5 tokens
            {"label": "ORG", "start": 10, "end": 14},  # 5 tokens
            {"label": "LOC", "start": 20, "end": 24},  # 5 tokens
        ]
    ]

    pred_entities = [
        [
            {"label": "PER", "start": 0, "end": 1},  # 40% overlap (2/5 tokens)
            {"label": "ORG", "start": 10, "end": 12},  # 60% overlap (3/5 tokens)
            {"label": "LOC", "start": 20, "end": 24},  # 100% overlap (exact match)
            {"label": "MISC", "start": 30, "end": 32},  # No overlap (spurious)
        ]
    ]

    # Test with 50% threshold
    evaluator = Evaluator(
        true=true_entities,
        pred=pred_entities,
        tags=["PER", "ORG", "LOC", "MISC"],
        loader="dict",
        min_overlap_percentage=50.0,
    )

    results = evaluator.evaluate()
    partial_results = results["overall"]["partial"]

    assert partial_results.correct == 1  # LOC exact match
    assert partial_results.partial == 1  # ORG 60% overlap > 50%
    assert partial_results.spurious == 2  # PER 40% < 50% and MISC no overlap
    assert partial_results.missed == 1  # PER entity not sufficiently matched


def test_evaluator_min_overlap_backward_compatibility():
    """Test that the new feature maintains backward compatibility."""
    true_entities = [[{"label": "PER", "start": 0, "end": 9}]]

    pred_entities = [[{"label": "PER", "start": 9, "end": 9}]]  # 10% overlap (1 token out of 10)

    # Without specifying min_overlap_percentage (should default to 1.0)
    evaluator_default = Evaluator(true=true_entities, pred=pred_entities, tags=["PER"], loader="dict")

    # With explicitly setting to 1.0
    evaluator_explicit = Evaluator(
        true=true_entities, pred=pred_entities, tags=["PER"], loader="dict", min_overlap_percentage=1.0
    )

    results_default = evaluator_default.evaluate()
    results_explicit = evaluator_explicit.evaluate()

    # Results should be identical
    for strategy in ["strict", "partial", "ent_type", "exact"]:
        default_result = results_default["overall"][strategy]
        explicit_result = results_explicit["overall"][strategy]

        assert default_result.correct == explicit_result.correct
        assert default_result.partial == explicit_result.partial
        assert default_result.spurious == explicit_result.spurious
        assert default_result.missed == explicit_result.missed


================================================
FILE: tests/test_loaders.py
================================================
import pytest

from nervaluate.loaders import ConllLoader, ListLoader, DictLoader


def test_conll_loader():
    """Test CoNLL format loader."""
    true_conll = (
        "word\tO\nword\tO\nword\tO\nword\tO\nword\tO\nword\tO\n\n"
        "word\tO\nword\tO\nword\tB-ORG\nword\tI-ORG\nword\tO\nword\tO\n\n"
        "word\tO\nword\tO\nword\tB-MISC\nword\tI-MISC\nword\tO\nword\tO\n\n"
        "word\tB-MISC\nword\tI-MISC\nword\tI-MISC\nword\tI-MISC\nword\tI-MISC\nword\tI-MISC\n"
    )

    pred_conll = (
        "word\tO\nword\tO\nword\tB-PER\nword\tI-PER\nword\tO\nword\tO\n\n"
        "word\tO\nword\tO\nword\tB-ORG\nword\tI-ORG\nword\tO\nword\tO\n\n"
        "word\tO\nword\tO\nword\tB-MISC\nword\tI-MISC\nword\tO\nword\tO\n\n"
        "word\tB-MISC\nword\tI-MISC\nword\tI-MISC\nword\tI-MISC\nword\tI-MISC\nword\tI-MISC\n"
    )

    loader = ConllLoader()
    true_entities = loader.load(true_conll)
    pred_entities = loader.load(pred_conll)

    # Test true entities
    assert len(true_entities) == 4  # Four documents
    assert len(true_entities[0]) == 0  # First document has no entities (all O tags)
    assert len(true_entities[1]) == 1  # Second document has 1 entity (ORG)
    assert len(true_entities[2]) == 1  # Third document has 1 entity (MISC)
    assert len(true_entities[3]) == 1  # Fourth document has 1 entity (MISC)

    # Check first entity in second document
    assert true_entities[1][0].label == "ORG"
    assert true_entities[1][0].start == 2
    assert true_entities[1][0].end == 3

    # Test pred entities
    assert len(pred_entities) == 4  # Four documents
    assert len(pred_entities[0]) == 1  # First document has 1 entity (PER)
    assert len(pred_entities[1]) == 1  # Second document has 1 entity (ORG)
    assert len(pred_entities[2]) == 1  # Third document has 1 entity (MISC)
    assert len(pred_entities[3]) == 1  # Fourth document has 1 entity (MISC)

    # Check first entity in first document
    assert pred_entities[0][0].label == "PER"
    assert pred_entities[0][0].start == 2
    assert pred_entities[0][0].end == 3

    # Test empty document handling
    empty_doc = "word\tO\nword\tO\nword\tO\n\n"
    empty_entities = loader.load(empty_doc)
    assert len(empty_entities) == 1  # One document
    assert len(empty_entities[0]) == 0  # Empty list for document with only O tags


def test_list_loader():
    """Test list format loader."""
    true_list = [
        ["O", "O", "O", "O", "O", "O"],
        ["O", "O", "B-ORG", "I-ORG", "O", "O"],
        ["O", "O", "B-MISC", "I-MISC", "O", "O"],
        ["B-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC"],
    ]

    pred_list = [
        ["O", "O", "B-PER", "I-PER", "O", "O"],
        ["O", "O", "B-ORG", "I-ORG", "O", "O"],
        ["O", "O", "B-MISC", "I-MISC", "O", "O"],
        ["B-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC", "I-MISC"],
    ]

    loader = ListLoader()
    true_entities = loader.load(true_list)
    pred_entities = loader.load(pred_list)

    # Test true entities
    assert len(true_entities) == 4  # Four documents
    assert len(true_entities[0]) == 0  # First document has no entities (all O tags)
    assert len(true_entities[1]) == 1  # Second document has 1 entity (ORG)
    assert len(true_entities[2]) == 1  # Third document has 1 entity (MISC)
    assert len(true_entities[3]) == 1  # Fourth document has 1 entity (MISC)

    # Check no entities in the first document
    assert len(true_entities[0]) == 0

    # Check first entity in second document
    assert true_entities[1][0].label == "ORG"
    assert true_entities[1][0].start == 2
    assert true_entities[1][0].end == 3

    # Check only entity in the last document
    assert true_entities[3][0].label == "MISC"
    assert true_entities[3][0].start == 0
    assert true_entities[3][0].end == 5

    # Test pred entities
    assert len(pred_entities) == 4  # Four documents
    assert len(pred_entities[0]) == 1  # First document has 1 entity (PER)
    assert len(pred_entities[1]) == 1  # Second document has 1 entity (ORG)
    assert len(pred_entities[2]) == 1  # Third document has 1 entity (MISC)
    assert len(pred_entities[3]) == 1  # Fourth document has 1 entity (MISC)

    # Check first entity in first document
    assert pred_entities[0][0].label == "PER"
    assert pred_entities[0][0].start == 2
    assert pred_entities[0][0].end == 3

    # Test empty document handling
    empty_doc = [["O", "O", "O"]]
    empty_entities = loader.load(empty_doc)
    assert len(empty_entities) == 1  # One document
    assert len(empty_entities[0]) == 0  # Empty list for document with only O tags


def test_dict_loader():
    """Test dictionary format loader."""
    true_prod = [
        [],
        [{"label": "ORG", "start": 2, "end": 3}],
        [{"label": "MISC", "start": 2, "end": 3}],
        [{"label": "MISC", "start": 0, "end": 5}],
    ]

    pred_prod = [
        [{"label": "PER", "start": 2, "end": 3}],
        [{"label": "ORG", "start": 2, "end": 3}],
        [{"label": "MISC", "start": 2, "end": 3}],
        [{"label": "MISC", "start": 0, "end": 5}],
    ]

    loader = DictLoader()
    true_entities = loader.load(true_prod)
    pred_entities = loader.load(pred_prod)

    # Test true entities
    assert len(true_entities) == 4  # Four documents
    assert len(true_entities[0]) == 0  # First document has no entities
    assert len(true_entities[1]) == 1  # Second document has 1 entity (ORG)
    assert len(true_entities[2]) == 1  # Third document has 1 entity (MISC)
    assert len(true_entities[3]) == 1  # Fourth document has 1 entity (MISC)

    # Check first entity in second document
    assert true_entities[1][0].label == "ORG"
    assert true_entities[1][0].start == 2
    assert true_entities[1][0].end == 3

    # Check only entity in the last document
    assert true_entities[3][0].label == "MISC"
    assert true_entities[3][0].start == 0
    assert true_entities[3][0].end == 5

    # Test pred entities
    assert len(pred_entities) == 4  # Four documents
    assert len(pred_entities[0]) == 1  # First document has 1 entity (PER)
    assert len(pred_entities[1]) == 1  # Second document has 1 entity (ORG)
    assert len(pred_entities[2]) == 1  # Third document has 1 entity (MISC)
    assert len(pred_entities[3]) == 1  # Fourth document has 1 entity (MISC)

    # Check first entity in first document
    assert pred_entities[0][0].label == "PER"
    assert pred_entities[0][0].start == 2
    assert pred_entities[0][0].end == 3

    # Test empty document handling
    empty_doc = [[]]
    empty_entities = loader.load(empty_doc)
    assert len(empty_entities) == 1  # One document
    assert len(empty_entities[0]) == 0  # Empty list for empty document


def test_loader_with_empty_input():
    """Test loaders with empty input."""
    # Test ConllLoader with empty string
    conll_loader = ConllLoader()
    entities = conll_loader.load("")
    assert len(entities) == 0

    # Test ListLoader with empty list
    list_loader = ListLoader()
    entities = list_loader.load([])
    assert len(entities) == 0

    # Test DictLoader with empty list
    dict_loader = DictLoader()
    entities = dict_loader.load([])
    assert len(entities) == 0


def test_loader_with_invalid_data():
    """Test loaders with invalid data."""
    with pytest.raises(Exception):
        ConllLoader().load("invalid\tdata")

    with pytest.raises(Exception):
        ListLoader().load([["invalid"]])

    with pytest.raises(Exception):
        DictLoader().load([[{"invalid": "data"}]])


================================================
FILE: tests/test_strategies.py
================================================
from copy import deepcopy
import pytest
from nervaluate.entities import Entity
from nervaluate.strategies import EntityTypeEvaluation, ExactEvaluation, PartialEvaluation, StrictEvaluation


def create_entities_from_bio(bio_tags):
    """Helper function to create entities from BIO tags."""
    entities = []
    current_entity = None

    for i, tag in enumerate(bio_tags):
        if tag == "O":
            continue

        if tag.startswith("B-"):
            if current_entity:
                entities.append(current_entity)
            current_entity = Entity(tag[2:], i, i + 1)
        elif tag.startswith("I-"):
            if current_entity:
                current_entity.end = i + 1
            else:
                # Handle case where I- tag appears without B-
                current_entity = Entity(tag[2:], i, i + 1)

    if current_entity:
        entities.append(current_entity)

    return entities


@pytest.fixture
def base_sequence():
    """Base sequence: 'The John Smith who works at Google Inc'"""
    return ["O", "B-PER", "I-PER", "O", "O", "O", "B-ORG", "I-ORG"]


@pytest.fixture
def base_sequence_nested():
    """
    Base sequence: 'The Treaty of Westphalia negotiations concluded in 1648.'

    first_level_entity: Treaty of Westphalia negotiations
    second_level_entity: Treaty of Westphalia
    third_level_entity: Westphalia
    other_entity: 1648
    """
    first_level_entity = Entity("EVENT", 4, 37)
    second_level_entity = Entity("EVENT", 4, 24)
    third_level_entity = Entity("LOCATION", 14, 24)
    other_entity = Entity("DATE", 51, 55)

    return [first_level_entity, second_level_entity, third_level_entity, other_entity]


class TestStrictEvaluation:
    """Test cases for strict evaluation strategy."""

    def test_perfect_match(self, base_sequence):
        """Test case: Perfect match of all entities."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(base_sequence)

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested(self, base_sequence_nested):
        """Test case: Perfect match of all entities with nested entities."""
        evaluator = StrictEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
        """Test case: Perfect match of all entities in reverse order, with nested entities."""
        evaluator = StrictEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[::-1]
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_missed_entity(self, base_sequence):
        """Test case: One entity is missed in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "O"])

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 1
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 1)]
        assert result_indices.spurious_indices == []

    def test_missed_entity_nested(self, base_sequence_nested):
        """Test case: First level entity is missed in prediction."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[1:]

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 0)]
        assert result_indices.spurious_indices == []

    def test_wrong_label(self, base_sequence):
        """Test case: Entity with wrong label."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "I-LOC"])

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong label."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].label = "DATE"

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary(self, base_sequence):
        """Test case: Entity with wrong boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "O"])

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].end = 30

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity_nested(self, base_sequence_nested):
        """Test case: Extra (spurious) entity in prediction with nested entities (Scenario II)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested) + [Entity("MISC", 60, 65)]

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE", "MISC"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == [(0, 4)]

    def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary and wrong label (Scenario VI)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1] = Entity("DATE", 4, 30)

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_shifted_boundary(self, base_sequence):
        """Test case: Entity with shifted boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "B-LOC"])

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity(self, base_sequence):
        """Test case: Extra entity in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "B-PER", "O", "B-LOC", "I-LOC"])

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 2)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == [(0, 1)]


class TestEntityTypeEvaluation:
    """Test cases for entity type evaluation strategy."""

    def test_perfect_match(self, base_sequence):
        """Test case: Perfect match of all entities."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(base_sequence)

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested(self, base_sequence_nested):
        """Test case: Perfect match of all entities with nested entities."""
        evaluator = EntityTypeEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
        """Test case: Perfect match of all entities in reverse order, with nested entities."""
        evaluator = EntityTypeEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[::-1]
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_missed_entity(self, base_sequence):
        """Test case: One entity is missed in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "O"])

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 1
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 1)]
        assert result_indices.spurious_indices == []

    def test_missed_entity_nested(self, base_sequence_nested):
        """Test case: First level entity is missed in prediction."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[1:]

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 0)]
        assert result_indices.spurious_indices == []

    def test_wrong_label(self, base_sequence):
        """Test case: Entity with wrong label."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "I-LOC"])

        evaluator = EntityTypeEvaluation()
        result, _ = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0

    def test_wrong_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong label."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].label = "DATE"

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary(self, base_sequence):
        """Test case: Entity with wrong boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "O"])

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].end = 30

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity_nested(self, base_sequence_nested):
        """Test case: Extra (spurious) entity in prediction with nested entities (Scenario II)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested) + [Entity("MISC", 60, 65)]

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE", "MISC"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == [(0, 4)]

    def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary and wrong label (Scenario VI)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1] = Entity("DATE", 4, 30)

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_shifted_boundary(self, base_sequence):
        """Test case: Entity with shifted boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "B-LOC"])

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity(self, base_sequence):
        """Test case: Extra entity in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "B-PER", "O", "B-LOC", "I-LOC"])

        evaluator = EntityTypeEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 2)]
        assert result_indices.spurious_indices == [(0, 1)]
        assert result_indices.missed_indices == []
        assert result_indices.partial_indices == []


class TestExactEvaluation:
    """Test cases for exact evaluation strategy."""

    def test_perfect_match(self, base_sequence):
        """Test case: Perfect match of all entities."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(base_sequence)

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested(self, base_sequence_nested):
        """Test case: Perfect match of all entities with nested entities."""
        evaluator = ExactEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
        """Test case: Perfect match of all entities in reverse order, with nested entities."""
        evaluator = ExactEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[::-1]
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_missed_entity(self, base_sequence):
        """Test case: One entity is missed in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "O"])

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 1
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 1)]
        assert result_indices.spurious_indices == []

    def test_missed_entity_nested(self, base_sequence_nested):
        """Test case: First level entity is missed in prediction."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[1:]

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 0)]
        assert result_indices.spurious_indices == []

    def test_wrong_label(self, base_sequence):
        """Test case: Entity with wrong label."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "I-LOC"])

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong label."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].label = "DATE"

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary(self, base_sequence):
        """Test case: Entity with wrong boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "O"])

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].end = 30

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity_nested(self, base_sequence_nested):
        """Test case: Extra (spurious) entity in prediction with nested entities (Scenario II)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested) + [Entity("MISC", 60, 65)]

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE", "MISC"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == [(0, 4)]

    def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary and wrong label (Scenario VI)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1] = Entity("DATE", 4, 30)

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_shifted_boundary(self, base_sequence):
        """Test case: Entity with shifted boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "B-LOC"])

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == [(0, 1)]
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity(self, base_sequence):
        """Test case: Extra entity in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "B-PER", "O", "B-LOC", "I-LOC"])

        evaluator = ExactEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0), (0, 2)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == [(0, 1)]


class TestPartialEvaluation:
    """Test cases for partial evaluation strategy."""

    def test_perfect_match(self, base_sequence):
        """Test case: Perfect match of all entities."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(base_sequence)

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested(self, base_sequence_nested):
        """Test case: Perfect match of all entities with nested entities."""
        evaluator = PartialEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
        """Test case: Perfect match of all entities in reverse order, with nested entities."""
        evaluator = PartialEvaluation()
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[::-1]
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_missed_entity(self, base_sequence):
        """Test case: One entity is missed in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "O"])

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 1
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 1)]
        assert result_indices.spurious_indices == []

    def test_missed_entity_nested(self, base_sequence_nested):
        """Test case: First level entity is missed in prediction."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)[1:]

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == [(0, 0)]
        assert result_indices.spurious_indices == []

    def test_wrong_label(self, base_sequence):
        """Test case: Entity with wrong label."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "I-LOC"])

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong label."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].label = "DATE"

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary(self, base_sequence):
        """Test case: Entity with wrong boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "B-LOC", "O"])

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 0
        assert result.partial == 1
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == [(0, 1)]
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_wrong_boundary_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1].end = 30

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 0
        assert result.partial == 1
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == [(0, 1)]
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity_nested(self, base_sequence_nested):
        """Test case: Extra (spurious) entity in prediction with nested entities (Scenario II)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested) + [Entity("MISC", 60, 65)]

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE", "MISC"])

        assert result.correct == 4
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0), (0, 1), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == [(0, 4)]

    def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
        """Test case: Nested entity with wrong boundary and wrong label (Scenario VI)."""
        true = base_sequence_nested
        pred = deepcopy(base_sequence_nested)
        pred[1] = Entity("DATE", 4, 30)

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["EVENT", "LOCATION", "DATE"])

        assert result.correct == 3
        assert result.incorrect == 0
        assert result.partial == 1
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 2), (0, 3)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == [(0, 1)]
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_shifted_boundary(self, base_sequence):
        """Test case: Entity with shifted boundary."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "O", "O", "O", "B-LOC"])

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 1
        assert result.incorrect == 0
        assert result.partial == 1
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == [(0, 1)]
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == []

    def test_extra_entity(self, base_sequence):
        """Test case: Extra entity in prediction."""
        true = create_entities_from_bio(base_sequence)
        pred = create_entities_from_bio(["O", "B-PER", "I-PER", "O", "B-PER", "O", "B-LOC", "I-LOC"])

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG", "LOC"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 1
        assert result_indices.correct_indices == [(0, 0), (0, 2)]
        assert result_indices.incorrect_indices == []
        assert result_indices.partial_indices == []
        assert result_indices.missed_indices == []
        assert result_indices.spurious_indices == [(0, 1)]


class TestSingleCharacterEntities:
    """Test cases for single-character entities to ensure proper range handling."""

    def test_single_token_entities_strict(self):
        """Test case: Single token entities using strict evaluation."""
        # Create entities representing single characters/tokens
        # Entity at position 1 with start=1, end=2 (standard representation)
        true = [Entity("PER", 1, 2), Entity("ORG", 4, 5)]
        pred = [Entity("PER", 1, 2), Entity("ORG", 4, 5)]

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]

    def test_single_token_entities_same_start_end(self):
        """Test case: Single token entities where start==end (edge case)."""
        # Edge case: entities where start and end are the same
        # This tests the scenario mentioned in the user's question
        true = [Entity("PER", 1, 1), Entity("ORG", 4, 4)]
        pred = [Entity("PER", 1, 1), Entity("ORG", 4, 4)]

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]

    def test_single_token_entities_partial_evaluation(self):
        """Test case: Single token entities with partial evaluation."""
        true = [Entity("PER", 1, 1), Entity("ORG", 4, 4)]
        pred = [Entity("PER", 1, 1), Entity("ORG", 4, 4)]

        evaluator = PartialEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]

    def test_single_token_entities_overlap_detection(self):
        """Test case: Single token entities with overlapping positions."""
        # Test overlap detection for single character entities
        true = [Entity("PER", 1, 1)]  # Single token at position 1
        pred = [Entity("ORG", 1, 1)]  # Different label, same position

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        # Should be marked as incorrect due to label mismatch but position overlap
        assert result.correct == 0
        assert result.incorrect == 1
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.incorrect_indices == [(0, 0)]

    def test_single_token_adjacent_entities(self):
        """Test case: Adjacent single token entities."""
        # Test entities at adjacent positions
        true = [Entity("PER", 1, 1), Entity("ORG", 2, 2)]
        pred = [Entity("PER", 1, 1), Entity("ORG", 2, 2)]

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 2
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 0
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0), (0, 1)]

    def test_single_token_missed_entity(self):
        """Test case: Single token entity that is missed."""
        true = [Entity("PER", 1, 1), Entity("ORG", 4, 4)]
        pred = [Entity("PER", 1, 1)]  # Missing the ORG entity

        evaluator = StrictEvaluation()
        result, result_indices = evaluator.evaluate(true, pred, ["PER", "ORG"])

        assert result.correct == 1
        assert result.incorrect == 0
        assert result.partial == 0
        assert result.missed == 1
        assert result.spurious == 0
        assert result_indices.correct_indices == [(0, 0)]
        assert result_indices.missed_indices == [(0, 1)]


def test_minimum_overlap_percentage_validation():
    """Test that minimum overlap percentage validation works correctly."""

    # Valid values should work
    PartialEvaluation(min_overlap_percentage=1.0)
    PartialEvaluation(min_overlap_percentage=50.0)
    PartialEvaluation(min_overlap_percentage=100.0)

    # Invalid values should raise ValueError
    with pytest.raises(ValueError, match="min_overlap_percentage must be between 1.0 and 100.0"):
        PartialEvaluation(min_overlap_percentage=0.5)

    with pytest.raises(ValueError, match="min_overlap_percentage must be between 1.0 and 100.0"):
        PartialEvaluation(min_overlap_percentage=101.0)

    with pytest.raises(ValueError, match="min_overlap_percentage must be between 1.0 and 100.0"):
        PartialEvaluation(min_overlap_percentage=-5.0)


def test_overlap_percentage_calculation():
    """Test the overlap percentage calculation method."""
    strategy = PartialEvaluation(min_overlap_percentage=50.0)

    true_entity = Entity(label="PER", start=0, end=9)  # 10 tokens (0-9 inclusive)

    test_cases = [
        # (pred_entity, expected_percentage)
        (Entity(label="PER", start=0, end=9), 100.0),  # Complete overlap
        (Entity(label="PER", start=0, end=4), 50.0),  # Half overlap from start
        (Entity(label="PER", start=5, end=9), 50.0),  # Half overlap from end
        (Entity(label="PER", start=0, end=0), 10.0),  # Single token overlap at start
        (Entity(label="PER", start=9, end=9), 10.0),  # Single token overlap at end
        (Entity(label="PER", start=10, end=15), 0.0),  # No overlap (adjacent)
        (Entity(label="PER", start=-5, end=2), 30.0),  # Partial overlap from left (3 tokens: 0,1,2)
        (Entity(label="PER", start=7, end=12), 30.0),  # Partial overlap from right (3 tokens: 7,8,9)
        (Entity(label="PER", start=2, end=7), 60.0),  # Middle overlap (6 tokens: 2,3,4,5,6,7)
    ]

    for pred_entity, expected_percentage in test_cases:
        calculated = strategy._calculate_overlap_percentage(pred_entity, true_entity)
        assert (
            abs(calculated - expected_percentage) < 0.1
        ), f"Expected {expected_percentage}%, got {calculated}% for pred={pred_entity} vs true={true_entity}"


def test_has_sufficient_overlap():
    """Test the has_sufficient_overlap method with different thresholds."""

    true_entity = Entity(label="PER", start=0, end=9)  # 10 tokens

    # Test with 50% threshold
    strategy_50 = PartialEvaluation(min_overlap_percentage=50.0)

    # Should pass: 50% or more overlap
    assert strategy_50._has_sufficient_overlap(Entity(label="PER", start=0, end=4), true_entity)  # 50%
    assert strategy_50._has_sufficient_overlap(Entity(label="PER", start=0, end=6), true_entity)  # 70%
    assert strategy_50._has_sufficient_overlap(Entity(label="PER", start=0, end=9), true_entity)  # 100%

    # Should fail: less than 50% overlap
    assert not strategy_50._has_sufficient_overlap(Entity(label="PER", start=0, end=2), true_entity)  # 30%
    assert not strategy_50._has_sufficient_overlap(Entity(label="PER", start=0, end=0), true_entity)  # 10%
    assert not strategy_50._has_sufficient_overlap(Entity(label="PER", start=10, end=15), true_entity)  # 0%

    # Test with 75% threshold
    strategy_75 = PartialEvaluation(min_overlap_percentage=75.0)

    # Should pass: 75% or more overlap
    assert strategy_75._has_sufficient_overlap(Entity(label="PER", start=0, end=7), true_entity)  # 80%
    assert strategy_75._has_sufficient_overlap(Entity(label="PER", start=0, end=9), true_entity)  # 100%

    # Should fail: less than 75% overlap
    assert not strategy_75._has_sufficient_overlap(Entity(label="PER", start=0, end=6), true_entity)  # 70%
    assert not strategy_75._has_sufficient_overlap(Entity(label="PER", start=0, end=4), true_entity)  # 50%


def test_partial_evaluation_with_min_overlap():
    """Test PartialEvaluation strategy with different minimum overlap thresholds."""

    true_entities = [Entity(label="PER", start=0, end=9)]  # 10 tokens

    test_cases = [
        # (pred_entity, min_overlap_threshold, expected_correct, expected_partial, expected_spurious)
        (Entity(label="PER", start=0, end=4), 50.0, 0, 1, 0),  # 50% overlap -> partial
        (Entity(label="PER", start=0, end=2), 50.0, 0, 0, 1),  # 30% overlap < 50% -> spurious
        (Entity(label="PER", start=0, end=9), 50.0, 1, 0, 0),  # 100% overlap exact match -> correct
        (Entity(label="PER", start=0, end=6), 75.0, 0, 0, 1),  # 70% overlap < 75% -> spurious
        (Entity(label="PER", start=0, end=7), 75.0, 0, 1, 0),  # 80% overlap > 75% -> partial
    ]

    for pred_entity, threshold, expected_correct, expected_partial, expected_spurious in test_cases:
        pred_entities = [pred_entity]
        strategy = PartialEvaluation(min_overlap_percentage=threshold)
        result, _ = strategy.evaluate(true_entities, pred_entities, ["PER"], 0)

        assert (
            result.correct == expected_correct
        ), f"Expected {expected_correct} correct, got {result.correct} for {pred_entity} with threshold {threshold}%"
        assert (
            result.partial == expected_partial
        ), f"Expected {expected_partial} partial, got {result.partial} for {pred_entity} with threshold {threshold}%"
        assert (
            result.spurious == expected_spurious
        ), f"Expected {expected_spurious} spurious, got {result.spurious} for {pred_entity} with threshold {threshold}%"


def test_strict_evaluation_with_min_overlap():
    """Test StrictEvaluation strategy with minimum overlap threshold."""

    true_entities = [Entity(label="PER", start=0, end=9)]

    # Test case where pred has insufficient overlap -> should be spurious
    pred_entities = [Entity(label="PER", start=0, end=2)]  # 30% overlap
    strategy = StrictEvaluation(min_overlap_percentage=50.0)
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER"], 0)

    assert result.correct == 0
    assert result.incorrect == 0
    assert result.spurious == 1  # Insufficient overlap -> spurious
    assert result.missed == 1  # True entity not matched

    # Test case where pred has sufficient overlap but wrong label -> should be incorrect
    pred_entities = [Entity(label="ORG", start=0, end=6)]  # 70% overlap, wrong label
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER", "ORG"], 0)

    assert result.correct == 0
    assert result.incorrect == 1  # Sufficient overlap but wrong label
    assert result.spurious == 0
    assert result.missed == 0


def test_entity_type_evaluation_with_min_overlap():
    """Test EntityTypeEvaluation strategy with minimum overlap threshold."""

    true_entities = [Entity(label="PER", start=0, end=9)]

    # Test case: sufficient overlap with correct label -> correct
    pred_entities = [Entity(label="PER", start=0, end=6)]  # 70% overlap, correct label
    strategy = EntityTypeEvaluation(min_overlap_percentage=50.0)
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER"], 0)

    assert result.correct == 1
    assert result.incorrect == 0
    assert result.spurious == 0
    assert result.missed == 0

    # Test case: sufficient overlap with wrong label -> incorrect
    pred_entities = [Entity(label="ORG", start=0, end=6)]  # 70% overlap, wrong label
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER", "ORG"], 0)

    assert result.correct == 0
    assert result.incorrect == 1
    assert result.spurious == 0
    assert result.missed == 0

    # Test case: insufficient overlap -> spurious
    pred_entities = [Entity(label="PER", start=0, end=2)]  # 30% overlap < 50%
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER"], 0)

    assert result.correct == 0
    assert result.incorrect == 0
    assert result.spurious == 1
    assert result.missed == 1


def test_exact_evaluation_with_min_overlap():
    """Test ExactEvaluation strategy with minimum overlap threshold."""

    true_entities = [Entity(label="PER", start=0, end=9)]

    # Test case: exact boundaries (different label) -> correct
    pred_entities = [Entity(label="ORG", start=0, end=9)]  # Exact match, different label
    strategy = ExactEvaluation(min_overlap_percentage=50.0)
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER", "ORG"], 0)

    assert result.correct == 1
    assert result.incorrect == 0
    assert result.spurious == 0
    assert result.missed == 0

    # Test case: sufficient overlap but not exact -> incorrect
    pred_entities = [Entity(label="ORG", start=0, end=6)]  # 70% overlap, not exact
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER", "ORG"], 0)

    assert result.correct == 0
    assert result.incorrect == 1
    assert result.spurious == 0
    assert result.missed == 0

    # Test case: insufficient overlap -> spurious
    pred_entities = [Entity(label="ORG", start=0, end=2)]  # 30% overlap < 50%
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER", "ORG"], 0)

    assert result.correct == 0
    assert result.incorrect == 0
    assert result.spurious == 1
    assert result.missed == 1


def test_edge_cases_overlap_calculation():
    """Test edge cases for overlap calculation."""

    strategy = PartialEvaluation(min_overlap_percentage=100.0)

    # Test single-token entities
    true_single = Entity(label="ORG", start=5, end=5)  # Single token
    pred_single = Entity(label="ORG", start=5, end=5)  # Exact match

    overlap = strategy._calculate_overlap_percentage(pred_single, true_single)
    assert overlap == 100.0, "Single token exact match should be 100%"

    # Test adjacent but non-overlapping entities
    pred_adjacent = Entity(label="ORG", start=6, end=6)  # Adjacent token
    overlap = strategy._calculate_overlap_percentage(pred_adjacent, true_single)
    assert overlap == 0.0, "Adjacent non-overlapping should be 0%"

    # Test overlapping single-token entities
    pred_overlap = Entity(label="ORG", start=4, end=6)  # Overlaps with true_single at position 5
    overlap = strategy._calculate_overlap_percentage(pred_overlap, true_single)
    assert overlap == 100.0, "Single token overlap should be 100% of true entity"


def test_multiple_entities_with_min_overlap():
    """Test evaluation with multiple entities and minimum overlap."""

    true_entities = [Entity(label="PER", start=0, end=4), Entity(label="ORG", start=10, end=14)]  # 5 tokens  # 5 tokens

    pred_entities = [
        Entity(label="PER", start=0, end=1),  # 40% overlap with first entity
        Entity(label="ORG", start=10, end=12),  # 60% overlap with second entity
        Entity(label="LOC", start=20, end=22),  # No overlap (spurious)
    ]

    # With 50% threshold
    strategy = PartialEvaluation(min_overlap_percentage=50.0)
    result, _ = strategy.evaluate(true_entities, pred_entities, ["PER", "ORG", "LOC"], 0)

    assert result.correct == 0
    assert result.partial == 1  # Only the ORG entity has sufficient overlap (60% > 50%)
    assert result.spurious == 2  # PER entity (40% < 50%) and LOC entity (no overlap)
    assert result.missed == 1  # First true entity (PER) not sufficiently matched


================================================
FILE: tests/test_utils.py
================================================
from nervaluate import (
    collect_named_entities,
    conll_to_spans,
    list_to_spans,
    split_list,
)


def test_list_to_spans():
    before = [
        ["O", "B-LOC", "I-LOC", "B-LOC", "I-LOC", "O"],
        ["O", "B-GPE", "I-GPE", "B-GPE", "I-GPE", "O"],
    ]

    expected = [
        [
            {"label": "LOC", "start": 1, "end": 2},
            {"label": "LOC", "start": 3, "end": 4},
        ],
        [
            {"label": "GPE", "start": 1, "end": 2},
            {"label": "GPE", "start": 3, "end": 4},
        ],
    ]

    result = list_to_spans(before)

    assert result == expected


def test_list_to_spans_1():
    before = [
        ["O", "O", "O", "O", "O", "O"],
        ["O", "O", "B-ORG", "I-ORG", "O", "O"],
        ["O", "O", "B-MISC", "I-MISC", "O", "O"],
    ]

    expected = [
        [],
        [{"label": "ORG", "start": 2, "end": 3}],
        [{"label": "MISC", "start": 2, "end": 3}],
    ]

    actual = list_to_spans(before)

    assert actual == expected


def test_conll_to_spans():
    before = (
        ",\tO\n"
        "Davos\tB-PER\n"
        "2018\tO\n"
        ":\tO\n"
        "Soros\tB-PER\n"
        "accuses\tO\n"
        "Trump\tB-PER\n"
        "of\tO\n"
        "wanting\tO\n"
        "\n"
        "foo\tO\n"
    )

    after = [
        [
            {"label": "PER", "start": 1, "end": 1},
            {"label": "PER", "start": 4, "end": 4},
            {"label": "PER", "start": 6, "end": 6},
        ],
        [],
    ]

    out = conll_to_spans(before)

    assert after == out


def test_conll_to_spans_1():
    before = (
        "word\tO\nword\tO\nword\tO\nword\tO\nword\tO\nword\tO\n\n"
        "word\tO\nword\tO\nword\tB-ORG\nword\tI-ORG\nword\tO\nword\tO\n\n"
        "word\tO\nword\tO\nword\tB-MISC\nword\tI-MISC\nword\tO\nword\tO\n"
    )

    expected = [
        [],
        [{"label": "ORG", "start": 2, "end": 3}],
        [{"label": "MISC", "start": 2, "end": 3}],
    ]

    actual = conll_to_spans(before)

    assert actual == expected


def test_split_list():
    before = ["aa", "bb", "cc", "", "dd", "ee", "ff"]
    expected = [["aa", "bb", "cc"], ["dd", "ee", "ff"]]
    out = split_list(before)

    assert expected == out


def test_collect_named_entities_same_type_in_sequence():
    tags = ["O", "B-LOC", "I-LOC", "B-LOC", "I-LOC", "O"]
    result = collect_named_entities(tags)
    expected = [
        {"label": "LOC", "start": 1, "end": 2},
        {"label": "LOC", "start": 3, "end": 4},
    ]
    assert result == expected


def test_collect_named_entities_sequence_has_only_one_entity():
    tags = ["B-LOC", "I-LOC"]
    result = collect_named_entities(tags)
    expected
Download .txt
gitextract_gjg9je1z/

├── .gitchangelog.rc
├── .github/
│   ├── pull_request_template.md
│   └── workflows/
│       └── CI-checks.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CHANGELOG.rst
├── CITATION.cff
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── examples/
│   ├── example_no_loader.py
│   └── run_example.sh
├── pyproject.toml
├── src/
│   └── nervaluate/
│       ├── __init__.py
│       ├── entities.py
│       ├── evaluator.py
│       ├── loaders.py
│       ├── strategies.py
│       └── utils.py
└── tests/
    ├── __init__.py
    ├── test_entities.py
    ├── test_evaluator.py
    ├── test_loaders.py
    ├── test_strategies.py
    └── test_utils.py
Download .txt
SYMBOL INDEX (157 symbols across 11 files)

FILE: examples/example_no_loader.py
  function word2features (line 8) | def word2features(sent, i):
  function sent2features (line 56) | def sent2features(sent):
  function sent2labels (line 60) | def sent2labels(sent):
  function sent2tokens (line 64) | def sent2tokens(sent):
  function main (line 68) | def main():

FILE: src/nervaluate/entities.py
  class Entity (line 6) | class Entity:
    method __eq__ (line 13) | def __eq__(self, other: object) -> bool:
    method __hash__ (line 18) | def __hash__(self) -> int:
  class EvaluationResult (line 23) | class EvaluationResult:
    method compute_metrics (line 37) | def compute_metrics(self, partial_or_type: bool = False) -> None:
  class EvaluationIndices (line 55) | class EvaluationIndices:
    method __post_init__ (line 64) | def __post_init__(self) -> None:

FILE: src/nervaluate/evaluator.py
  class Evaluator (line 17) | class Evaluator:
    method __init__ (line 20) | def __init__(
    method _setup_loaders (line 39) | def _setup_loaders(self) -> None:
    method _setup_evaluation_strategies (line 43) | def _setup_evaluation_strategies(self) -> None:
    method _load_data (line 52) | def _load_data(self, true: Any, pred: Any, loader: str) -> None:
    method evaluate (line 85) | def evaluate(self) -> Dict[str, Any]:
    method _merge_results (line 151) | def _merge_results(
    method _merge_indices (line 164) | def _merge_indices(target: EvaluationIndices, source: EvaluationIndice...
    method results_to_csv (line 172) | def results_to_csv(
    method summary_report (line 250) | def summary_report(self, mode: str = "overall", scenario: str = "stric...
    method summary_report_indices (line 341) | def summary_report_indices(  # pylint: disable=too-many-branches

FILE: src/nervaluate/loaders.py
  class DataLoader (line 7) | class DataLoader(ABC):
    method load (line 11) | def load(self, data: Any) -> List[List[Entity]]:
  class ConllLoader (line 15) | class ConllLoader(DataLoader):
    method load (line 18) | def load(self, data: str) -> List[List[Entity]]:  # pylint: disable=to...
  class ListLoader (line 91) | class ListLoader(DataLoader):
    method load (line 94) | def load(self, data: List[List[str]]) -> List[List[Entity]]:  # pylint...
  class DictLoader (line 154) | class DictLoader(DataLoader):
    method load (line 157) | def load(self, data: List[List[Dict[str, Any]]]) -> List[List[Entity]]:

FILE: src/nervaluate/strategies.py
  class EvaluationStrategy (line 7) | class EvaluationStrategy(ABC):
    method __init__ (line 10) | def __init__(self, min_overlap_percentage: float = 1.0):
    method _calculate_overlap_percentage (line 22) | def _calculate_overlap_percentage(pred: Entity, true: Entity) -> float:
    method _calculate_boundaries_distance (line 45) | def _calculate_boundaries_distance(pred: Entity, true: Entity) -> float:
    method _has_sufficient_overlap (line 58) | def _has_sufficient_overlap(self, pred: Entity, true: Entity) -> bool:
    method evaluate (line 64) | def evaluate(
  class StrictEvaluation (line 70) | class StrictEvaluation(EvaluationStrategy):
    method evaluate (line 81) | def evaluate(
  class PartialEvaluation (line 130) | class PartialEvaluation(EvaluationStrategy):
    method evaluate (line 143) | def evaluate(
  class EntityTypeEvaluation (line 189) | class EntityTypeEvaluation(EvaluationStrategy):
    method evaluate (line 210) | def evaluate(
  class ExactEvaluation (line 266) | class ExactEvaluation(EvaluationStrategy):
    method evaluate (line 277) | def evaluate(

FILE: src/nervaluate/utils.py
  function split_list (line 1) | def split_list(token: list[str], split_chars: list[str] | None = None) -...
  function conll_to_spans (line 28) | def conll_to_spans(doc: str) -> list[list[dict]]:
  function list_to_spans (line 53) | def list_to_spans(doc: list[list[str]]) -> list[list[dict]]:
  function collect_named_entities (line 66) | def collect_named_entities(tokens: list[str]) -> list[dict]:
  function find_overlap (line 110) | def find_overlap(true_range: range, pred_range: range) -> set:
  function clean_entities (line 134) | def clean_entities(ent: dict) -> dict:

FILE: tests/test_entities.py
  function test_entity_equality (line 4) | def test_entity_equality():
  function test_entity_hash (line 15) | def test_entity_hash():
  function test_evaluation_result_compute_metrics (line 25) | def test_evaluation_result_compute_metrics():
  function test_evaluation_result_zero_cases (line 40) | def test_evaluation_result_zero_cases():

FILE: tests/test_evaluator.py
  function sample_data (line 8) | def sample_data():
  function test_evaluator_initialization (line 22) | def test_evaluator_initialization(sample_data):
  function test_evaluator_evaluation (line 32) | def test_evaluator_evaluation(sample_data):
  function test_evaluator_with_invalid_tags (line 53) | def test_evaluator_with_invalid_tags(sample_data):
  function test_partial_and_ent_type_metrics_use_partial_formula_after_merge (line 67) | def test_partial_and_ent_type_metrics_use_partial_formula_after_merge():
  function test_evaluator_different_document_lengths (line 119) | def test_evaluator_different_document_lengths():
  function test_results_to_csv (line 137) | def test_results_to_csv(sample_data, tmp_path):
  function test_evaluator_with_min_overlap_percentage (line 223) | def test_evaluator_with_min_overlap_percentage():
  function test_evaluator_min_overlap_validation (line 249) | def test_evaluator_min_overlap_validation():
  function test_evaluator_min_overlap_affects_all_strategies (line 267) | def test_evaluator_min_overlap_affects_all_strategies():
  function test_evaluator_min_overlap_with_different_thresholds (line 307) | def test_evaluator_min_overlap_with_different_thresholds():
  function test_evaluator_min_overlap_with_multiple_entities (line 345) | def test_evaluator_min_overlap_with_multiple_entities():
  function test_evaluator_min_overlap_backward_compatibility (line 382) | def test_evaluator_min_overlap_backward_compatibility():

FILE: tests/test_loaders.py
  function test_conll_loader (line 6) | def test_conll_loader():
  function test_list_loader (line 57) | def test_list_loader():
  function test_dict_loader (line 116) | def test_dict_loader():
  function test_loader_with_empty_input (line 172) | def test_loader_with_empty_input():
  function test_loader_with_invalid_data (line 190) | def test_loader_with_invalid_data():

FILE: tests/test_strategies.py
  function create_entities_from_bio (line 7) | def create_entities_from_bio(bio_tags):
  function base_sequence (line 34) | def base_sequence():
  function base_sequence_nested (line 40) | def base_sequence_nested():
  class TestStrictEvaluation (line 57) | class TestStrictEvaluation:
    method test_perfect_match (line 60) | def test_perfect_match(self, base_sequence):
    method test_perfect_match_nested (line 79) | def test_perfect_match_nested(self, base_sequence_nested):
    method test_perfect_match_nested_reverse_order (line 97) | def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
    method test_missed_entity (line 115) | def test_missed_entity(self, base_sequence):
    method test_missed_entity_nested (line 134) | def test_missed_entity_nested(self, base_sequence_nested):
    method test_wrong_label (line 153) | def test_wrong_label(self, base_sequence):
    method test_wrong_label_nested (line 172) | def test_wrong_label_nested(self, base_sequence_nested):
    method test_wrong_boundary (line 192) | def test_wrong_boundary(self, base_sequence):
    method test_wrong_boundary_nested (line 211) | def test_wrong_boundary_nested(self, base_sequence_nested):
    method test_extra_entity_nested (line 231) | def test_extra_entity_nested(self, base_sequence_nested):
    method test_wrong_boundary_and_label_nested (line 250) | def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
    method test_shifted_boundary (line 270) | def test_shifted_boundary(self, base_sequence):
    method test_extra_entity (line 289) | def test_extra_entity(self, base_sequence):
  class TestEntityTypeEvaluation (line 309) | class TestEntityTypeEvaluation:
    method test_perfect_match (line 312) | def test_perfect_match(self, base_sequence):
    method test_perfect_match_nested (line 331) | def test_perfect_match_nested(self, base_sequence_nested):
    method test_perfect_match_nested_reverse_order (line 349) | def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
    method test_missed_entity (line 367) | def test_missed_entity(self, base_sequence):
    method test_missed_entity_nested (line 386) | def test_missed_entity_nested(self, base_sequence_nested):
    method test_wrong_label (line 405) | def test_wrong_label(self, base_sequence):
    method test_wrong_label_nested (line 419) | def test_wrong_label_nested(self, base_sequence_nested):
    method test_wrong_boundary (line 439) | def test_wrong_boundary(self, base_sequence):
    method test_wrong_boundary_nested (line 458) | def test_wrong_boundary_nested(self, base_sequence_nested):
    method test_extra_entity_nested (line 478) | def test_extra_entity_nested(self, base_sequence_nested):
    method test_wrong_boundary_and_label_nested (line 497) | def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
    method test_shifted_boundary (line 517) | def test_shifted_boundary(self, base_sequence):
    method test_extra_entity (line 536) | def test_extra_entity(self, base_sequence):
  class TestExactEvaluation (line 556) | class TestExactEvaluation:
    method test_perfect_match (line 559) | def test_perfect_match(self, base_sequence):
    method test_perfect_match_nested (line 578) | def test_perfect_match_nested(self, base_sequence_nested):
    method test_perfect_match_nested_reverse_order (line 596) | def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
    method test_missed_entity (line 614) | def test_missed_entity(self, base_sequence):
    method test_missed_entity_nested (line 633) | def test_missed_entity_nested(self, base_sequence_nested):
    method test_wrong_label (line 652) | def test_wrong_label(self, base_sequence):
    method test_wrong_label_nested (line 671) | def test_wrong_label_nested(self, base_sequence_nested):
    method test_wrong_boundary (line 691) | def test_wrong_boundary(self, base_sequence):
    method test_wrong_boundary_nested (line 710) | def test_wrong_boundary_nested(self, base_sequence_nested):
    method test_extra_entity_nested (line 730) | def test_extra_entity_nested(self, base_sequence_nested):
    method test_wrong_boundary_and_label_nested (line 749) | def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
    method test_shifted_boundary (line 769) | def test_shifted_boundary(self, base_sequence):
    method test_extra_entity (line 788) | def test_extra_entity(self, base_sequence):
  class TestPartialEvaluation (line 808) | class TestPartialEvaluation:
    method test_perfect_match (line 811) | def test_perfect_match(self, base_sequence):
    method test_perfect_match_nested (line 830) | def test_perfect_match_nested(self, base_sequence_nested):
    method test_perfect_match_nested_reverse_order (line 848) | def test_perfect_match_nested_reverse_order(self, base_sequence_nested):
    method test_missed_entity (line 866) | def test_missed_entity(self, base_sequence):
    method test_missed_entity_nested (line 885) | def test_missed_entity_nested(self, base_sequence_nested):
    method test_wrong_label (line 904) | def test_wrong_label(self, base_sequence):
    method test_wrong_label_nested (line 923) | def test_wrong_label_nested(self, base_sequence_nested):
    method test_wrong_boundary (line 943) | def test_wrong_boundary(self, base_sequence):
    method test_wrong_boundary_nested (line 962) | def test_wrong_boundary_nested(self, base_sequence_nested):
    method test_extra_entity_nested (line 982) | def test_extra_entity_nested(self, base_sequence_nested):
    method test_wrong_boundary_and_label_nested (line 1001) | def test_wrong_boundary_and_label_nested(self, base_sequence_nested):
    method test_shifted_boundary (line 1021) | def test_shifted_boundary(self, base_sequence):
    method test_extra_entity (line 1040) | def test_extra_entity(self, base_sequence):
  class TestSingleCharacterEntities (line 1060) | class TestSingleCharacterEntities:
    method test_single_token_entities_strict (line 1063) | def test_single_token_entities_strict(self):
    method test_single_token_entities_same_start_end (line 1080) | def test_single_token_entities_same_start_end(self):
    method test_single_token_entities_partial_evaluation (line 1097) | def test_single_token_entities_partial_evaluation(self):
    method test_single_token_entities_overlap_detection (line 1112) | def test_single_token_entities_overlap_detection(self):
    method test_single_token_adjacent_entities (line 1129) | def test_single_token_adjacent_entities(self):
    method test_single_token_missed_entity (line 1145) | def test_single_token_missed_entity(self):
  function test_minimum_overlap_percentage_validation (line 1162) | def test_minimum_overlap_percentage_validation():
  function test_overlap_percentage_calculation (line 1181) | def test_overlap_percentage_calculation():
  function test_has_sufficient_overlap (line 1207) | def test_has_sufficient_overlap():
  function test_partial_evaluation_with_min_overlap (line 1237) | def test_partial_evaluation_with_min_overlap():
  function test_strict_evaluation_with_min_overlap (line 1267) | def test_strict_evaluation_with_min_overlap():
  function test_entity_type_evaluation_with_min_overlap (line 1292) | def test_entity_type_evaluation_with_min_overlap():
  function test_exact_evaluation_with_min_overlap (line 1326) | def test_exact_evaluation_with_min_overlap():
  function test_edge_cases_overlap_calculation (line 1360) | def test_edge_cases_overlap_calculation():
  function test_multiple_entities_with_min_overlap (line 1383) | def test_multiple_entities_with_min_overlap():

FILE: tests/test_utils.py
  function test_list_to_spans (line 9) | def test_list_to_spans():
  function test_list_to_spans_1 (line 31) | def test_list_to_spans_1():
  function test_conll_to_spans (line 49) | def test_conll_to_spans():
  function test_conll_to_spans_1 (line 78) | def test_conll_to_spans_1():
  function test_split_list (line 96) | def test_split_list():
  function test_collect_named_entities_same_type_in_sequence (line 104) | def test_collect_named_entities_same_type_in_sequence():
  function test_collect_named_entities_sequence_has_only_one_entity (line 114) | def test_collect_named_entities_sequence_has_only_one_entity():
  function test_collect_named_entities_entity_goes_until_last_token (line 121) | def test_collect_named_entities_entity_goes_until_last_token():
  function test_collect_named_entities_no_entity (line 131) | def test_collect_named_entities_no_entity():
Condensed preview — 25 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (208K chars).
[
  {
    "path": ".gitchangelog.rc",
    "chars": 10014,
    "preview": "# -*- coding: utf-8; mode: python -*-\n##\n## Format\n##\n##   ACTION: [AUDIENCE:] COMMIT_MSG [!TAG ...]\n##\n## Description\n#"
  },
  {
    "path": ".github/pull_request_template.md",
    "chars": 1172,
    "preview": "### Related Issues\n\n- fixes #issue-number\n\n### Proposed Changes:\n\n <!--- In case of a bug: Describe what caused the issu"
  },
  {
    "path": ".github/workflows/CI-checks.yml",
    "chars": 827,
    "preview": "name: Linting, Type Checking, and Testing\n\non:\n  push:\n    branches: [ main ]\n  pull_request:\n    branches: [ main ]\n\njo"
  },
  {
    "path": ".gitignore",
    "chars": 162,
    "preview": "**.coverage\n**.ipynb_checkpoints/\n**.mypy_cache/\n**/.python-version\n**__pycache__/\n.tox/\n.venv/\nbuild/\ncoverage.xml\ndist"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 884,
    "preview": "repos:\n\n- repo: https://github.com/pre-commit/pre-commit-hooks\n  rev: v4.1.0\n  hooks:\n  - id: check-yaml\n\n- repo: https:"
  },
  {
    "path": "CHANGELOG.rst",
    "chars": 21196,
    "preview": "Changelog\n=========\n\n\n(unreleased)\n------------\n- Adding tests + updating README.md. [David S. Batista]\n- Fix partial an"
  },
  {
    "path": "CITATION.cff",
    "chars": 404,
    "preview": "cff-version: 1.2.1\nmessage: \"If you use this software, please cite it as below.\"\ntitle: \"nervaluate\"\ndate-released: 2026"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 2387,
    "preview": "# Contributing to `nervaluate`\n\nThank you for your interest in contributing to `nervaluate`! This document provides guid"
  },
  {
    "path": "LICENSE",
    "chars": 1094,
    "preview": "MIT License\n\nCopyright (c) 2020 David S. Batista and Matthew A. Upson\n\nPermission is hereby granted, free of charge, to "
  },
  {
    "path": "README.md",
    "chars": 12543,
    "preview": "[![python](https://img.shields.io/badge/Python-3.11-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.pyth"
  },
  {
    "path": "examples/example_no_loader.py",
    "chars": 3737,
    "preview": "import nltk\nimport sklearn_crfsuite\nfrom sklearn.metrics import classification_report\n\nfrom nervaluate import Evaluator,"
  },
  {
    "path": "examples/run_example.sh",
    "chars": 143,
    "preview": "#!/bin/bash\n\npip install nltk\npip install sklearn\npip install sklearn_crfsuite\npython -m nltk.downloader conll2002\npytho"
  },
  {
    "path": "pyproject.toml",
    "chars": 3289,
    "preview": "[build-system]\nrequires = [\"setuptools\", \"setuptools-scm\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"ne"
  },
  {
    "path": "src/nervaluate/__init__.py",
    "chars": 118,
    "preview": "from .evaluator import Evaluator\nfrom .utils import collect_named_entities, conll_to_spans, list_to_spans, split_list\n"
  },
  {
    "path": "src/nervaluate/entities.py",
    "chars": 2592,
    "preview": "from dataclasses import dataclass\nfrom typing import List, Tuple\n\n\n@dataclass\nclass Entity:\n    \"\"\"Represents a named en"
  },
  {
    "path": "src/nervaluate/evaluator.py",
    "chars": 22062,
    "preview": "from typing import List, Dict, Any, Union, Optional\nimport csv\nimport io\n\nfrom .entities import EvaluationResult, Evalua"
  },
  {
    "path": "src/nervaluate/loaders.py",
    "chars": 7573,
    "preview": "from abc import ABC, abstractmethod\nfrom typing import List, Dict, Any\n\nfrom .entities import Entity\n\n\nclass DataLoader("
  },
  {
    "path": "src/nervaluate/strategies.py",
    "chars": 13842,
    "preview": "from abc import ABC, abstractmethod\nfrom typing import List, Tuple\n\nfrom .entities import Entity, EvaluationResult, Eval"
  },
  {
    "path": "src/nervaluate/utils.py",
    "chars": 3927,
    "preview": "def split_list(token: list[str], split_chars: list[str] | None = None) -> list[list[str]]:\n    \"\"\"\n    Split a list into"
  },
  {
    "path": "tests/__init__.py",
    "chars": 49,
    "preview": "import sys\n\nsys.path.append(\"../src/nervaluate\")\n"
  },
  {
    "path": "tests/test_entities.py",
    "chars": 1477,
    "preview": "from nervaluate.entities import Entity, EvaluationResult\n\n\ndef test_entity_equality():\n    \"\"\"Test Entity equality compa"
  },
  {
    "path": "tests/test_evaluator.py",
    "chars": 16134,
    "preview": "import csv\nimport io\nimport pytest\nfrom nervaluate.evaluator import Evaluator\n\n\n@pytest.fixture\ndef sample_data():\n    t"
  },
  {
    "path": "tests/test_loaders.py",
    "chars": 7528,
    "preview": "import pytest\n\nfrom nervaluate.loaders import ConllLoader, ListLoader, DictLoader\n\n\ndef test_conll_loader():\n    \"\"\"Test"
  },
  {
    "path": "tests/test_strategies.py",
    "chars": 60985,
    "preview": "from copy import deepcopy\nimport pytest\nfrom nervaluate.entities import Entity\nfrom nervaluate.strategies import EntityT"
  },
  {
    "path": "tests/test_utils.py",
    "chars": 3231,
    "preview": "from nervaluate import (\n    collect_named_entities,\n    conll_to_spans,\n    list_to_spans,\n    split_list,\n)\n\n\ndef test"
  }
]

About this extraction

This page contains the full source code of the MantisAI/nervaluate GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 25 files (192.7 KB), approximately 48.3k tokens, and a symbol index with 157 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!