Full Code of MobileTeleSystems/Ambrosia for AI

main 0dd22ace9593 cached
154 files
17.8 MB
1.6M tokens
586 symbols
1 requests
Download .txt
Showing preview only (6,207K chars total). Download the full file or copy to clipboard to get everything.
Repository: MobileTeleSystems/Ambrosia
Branch: main
Commit: 0dd22ace9593
Files: 154
Total size: 17.8 MB

Directory structure:
gitextract_85e2_24t/

├── .editorconfig
├── .github/
│   └── workflows/
│       ├── publish.yaml
│       └── test.yaml
├── .gitignore
├── .pylintrc
├── .readthedocs.yaml
├── CHANGELOG.rst
├── CONTRIBUTING.rst
├── LICENSE
├── Makefile
├── README.rst
├── SECURITY.rst
├── ambrosia/
│   ├── VERSION
│   ├── __init__.py
│   ├── designer/
│   │   ├── __init__.py
│   │   ├── designer.py
│   │   └── handlers.py
│   ├── preprocessing/
│   │   ├── __init__.py
│   │   ├── aggregate.py
│   │   ├── cuped.py
│   │   ├── ml_var_reducer.py
│   │   ├── preprocessor.py
│   │   ├── robust.py
│   │   └── transformers.py
│   ├── spark_tools/
│   │   ├── __init__.py
│   │   ├── empiric.py
│   │   ├── split_tools.py
│   │   ├── stat_criteria.py
│   │   ├── stratification.py
│   │   └── theory.py
│   ├── splitter/
│   │   ├── __init__.py
│   │   ├── handlers.py
│   │   └── splitter.py
│   ├── tester/
│   │   ├── __init__.py
│   │   ├── binary_result_evaluation.py
│   │   ├── handlers.py
│   │   └── tester.py
│   ├── tools/
│   │   ├── __init__.py
│   │   ├── _lib/
│   │   │   ├── __init__.py
│   │   │   ├── _bin_ci_aide.py
│   │   │   ├── _bootstrap_tools.py
│   │   │   ├── _selection_aide.py
│   │   │   └── _tools_aide.py
│   │   ├── ab_abstract_component.py
│   │   ├── back_tools.py
│   │   ├── bin_intervals.py
│   │   ├── configs.py
│   │   ├── decorators.py
│   │   ├── empirical_tools.py
│   │   ├── import_tools.py
│   │   ├── knn.py
│   │   ├── log.py
│   │   ├── pvalue_tools.py
│   │   ├── split_tools.py
│   │   ├── stat_criteria.py
│   │   ├── stratification.py
│   │   ├── theoretical_tools.py
│   │   ├── tools.py
│   │   └── type_checks.py
│   ├── types.py
│   └── version.py
├── context7.json
├── docs/
│   ├── Makefile
│   ├── make.bat
│   ├── requirements.txt
│   └── source/
│       ├── _static/
│       │   └── css/
│       │       └── style.css
│       ├── ab_cases/
│       │   └── kion_ab.rst
│       ├── ab_cases.rst
│       ├── ambrosia_elements/
│       │   ├── advanced_transformations.rst
│       │   ├── aggregation.rst
│       │   ├── designer.rst
│       │   ├── preprocessing.rst
│       │   ├── processor.rst
│       │   ├── robust.rst
│       │   ├── simple_transformation.rst
│       │   ├── splitter.rst
│       │   └── tester.rst
│       ├── ambrosia_nutshell.rst
│       ├── authors.rst
│       ├── changelog.rst
│       ├── conf.py
│       ├── contributing.rst
│       ├── develop.rst
│       ├── index.rst
│       ├── installation.rst
│       ├── nb_pandas_examples.rst
│       ├── nb_spark_examples.rst
│       ├── pandas_examples/
│       │   ├── 00_preprocessing.nblink
│       │   ├── 01_vr_transformations.nblink
│       │   ├── 02_preprocessor.nblink
│       │   ├── 03_pandas_designer.nblink
│       │   ├── 04_binary_design.nblink
│       │   ├── 05_pandas_splitter.nblink
│       │   ├── 06_pandas_tester.nblink
│       │   ├── 10_synthetic_experiment_full_pipeline_short.nblink
│       │   └── 11_cuped_example.nblink
│       ├── security.rst
│       ├── spark_examples/
│       │   ├── 07_spark_designer.nblink
│       │   ├── 08_spark_splitter.nblink
│       │   └── 09_spark_tester.nblink
│       └── usage.rst
├── examples/
│   ├── 00_preprocessing.ipynb
│   ├── 01_vr_transformations.ipynb
│   ├── 02_preprocessor.ipynb
│   ├── 03_pandas_designer.ipynb
│   ├── 04_binary_design.ipynb
│   ├── 05_pandas_splitter.ipynb
│   ├── 06_pandas_tester.ipynb
│   ├── 07_spark_designer.ipynb
│   ├── 08_spark_splitter.ipynb
│   ├── 09_spark_tester.ipynb
│   ├── 10_synthetic_experiment_full_pipeline_short.ipynb
│   ├── 11_cuped_example.ipynb
│   ├── 12_ratio_metrics_and_custom_functions.ipynb
│   ├── _examples_configs/
│   │   ├── aggregator.json
│   │   ├── boxcox_tranformer.json
│   │   ├── cuped_config.json
│   │   ├── designer_config.yaml
│   │   ├── kion_cuped_params.json
│   │   ├── multicuped_coef.json
│   │   ├── multicuped_config.json
│   │   ├── params_cuped.json
│   │   ├── preprocessor.json
│   │   ├── robust.json
│   │   └── splitter_config.yaml
│   └── test_installation.ipynb
├── poetry.toml
├── pyproject.toml
├── setup.cfg
└── tests/
    ├── __init__.py
    ├── configs/
    │   └── designer_config.yaml
    ├── conftest.py
    ├── test_aggregate.py
    ├── test_cuped.py
    ├── test_data/
    │   ├── kion_data.csv
    │   ├── ltv_retention.csv
    │   ├── nonlin_var_table.csv
    │   ├── pipeline_test.csv
    │   ├── result_ltv_ret_conv.csv
    │   ├── robust_moments.csv
    │   ├── splitter_dataframe.csv
    │   ├── stratification_data.csv
    │   ├── var_table.csv
    │   ├── watch_result.csv
    │   ├── watch_result_agg.csv
    │   └── week_metrics.csv
    ├── test_designer.py
    ├── test_ml_variance_reducer.py
    ├── test_preprocessor.py
    ├── test_robust.py
    ├── test_splitter.py
    ├── test_stratification.py
    ├── test_tester.py
    └── test_transformers.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .editorconfig
================================================
root = true

[*]
charset=utf-8
end_of_line=lf
insert_final_newline=true
indent_style=space
indent_size=4
max_line_length=120
trim_trailing_whitespace = true

[{*.yml, *.yaml, *.json, *.xml}]
indent_size = 2

[Makefile]
indent_style = tab

[*.rst]
trim_trailing_whitespace = false

================================================
FILE: .github/workflows/publish.yaml
================================================
name: Publish

on:
  release:
    types:
      - created

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - name: Dump GitHub context
        env:
          GITHUB_CONTEXT: ${{ toJson(github) }}
        run: echo "$GITHUB_CONTEXT"

      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install poetry
        run: pip install poetry>=1.5.0

      - name: Install Dependencies
        run: poetry install --without dev

      - name: Build
        run: poetry build

      - name: Publish
        run: poetry publish -u __token__ -p ${{ secrets.PYPI_TOKEN_V2 }}


================================================
FILE: .github/workflows/test.yaml
================================================
name: Test

on:
  push:
    branches:
      - "**"
  pull_request:
    branches:
      - main
      - dev

jobs:
  lint:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install poetry
        run: pip install poetry>=1.5.0

      - name: Install dependencies
        run: make install

      - name: Static analysis
        run: make lint

  test:
    name: test (${{ matrix.python-version }})
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install poetry
        run: pip install poetry>=1.5.0

      - name: Install dependencies
        run: make install

      - name: Run tests
        run: make test

      - name: Upload coverage
        uses: codecov/codecov-action@v4


================================================
FILE: .gitignore
================================================
.idea/
.vscode/

# Sphynx docs
docs/_build/
docs/build
docs/*.tar.gz

# Data
data/

# Virtualenv
mars_env/
.venv/

# Jupyter checkpoints
.ipynb_checkpoints

# Python dev artefacts
__pycache__/
*.py[cod]
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
.python-version
.DS_Store

# VS code artifacts
settings.json

.mypy_cache/
.pytest_cache/

# Claude Code
CLAUDE.md

# Tests artifacts
reports/
coverage.xml
.coverage
.coverage.*

# Catboost info
catboost_info/

================================================
FILE: .pylintrc
================================================
[MASTER]

# A comma-separated list of package or module names from where C extensions may
# be loaded. Extensions are loading into the active Python interpreter and may
# run arbitrary code.
extension-pkg-whitelist=nmslib

# Specify a score threshold to be exceeded before program exits with error.
fail-under=10

# Add files or directories to the blacklist. They should be base names, not
# paths.
ignore=tests
       

# Add files or directories matching the regex patterns to the blacklist. The
# regex matches against base names, not paths.
ignore-patterns=

# Python code to execute, usually for sys.path manipulation such as
# pygtk.require().
#init-hook=

# Use multiple processes to speed up Pylint. Specifying 0 will auto-detect the
# number of processors available to use.
jobs=1

# Control the amount of potential inferred values when inferring a single
# object. This can help the performance when dealing with large functions or
# complex, nested conditions.
limit-inference-results=100

# List of plugins (as comma separated values of python module names) to load,
# usually to register additional checkers.
load-plugins=

# Pickle collected data for later comparisons.
persistent=yes

# When enabled, pylint would attempt to guess common misconfiguration and emit
# user-friendly hints instead of false-positive error messages.
suggestion-mode=yes

# Allow loading of arbitrary C extensions. Extensions are imported into the
# active Python interpreter and may run arbitrary code.
unsafe-load-any-extension=no


[MESSAGES CONTROL]

# Only show warnings with the listed confidence levels. Leave empty to show
# all. Valid levels: HIGH, INFERENCE, INFERENCE_FAILURE, UNDEFINED.
confidence=

# Disable the message, report, category or checker with the given id(s). You
# can either give multiple identifiers separated by comma (,) or put this
# option multiple times (only on the command line, not in the configuration
# file where it should appear only once). You can also use "--disable=all" to
# disable everything first and then reenable specific checks. For example, if
# you want to run only the similarities checker, you can use "--disable=all
# --enable=similarities". If you want to run only the classes checker, but have
# no Warning level messages displayed, use "--disable=all --enable=classes
# --disable=W".
disable=missing-class-docstring,
        missing-module-docstring,
        missing-function-docstring,
        duplicate-code,
        import-outside-toplevel,
        too-few-public-methods,
        logging-fstring-interpolation,
        unspecified-encoding,
        no-else-return,
        arguments-differ,
        protected-access

# Enable the message, report, category or checker with the given id(s). You can
# either give multiple identifier separated by comma (,) or put this option
# multiple time (only on the command line, not in the configuration file where
# it should appear only once). See also the "--disable" option for examples.
enable=c-extension-no-member


[REPORTS]

# Python expression which should return a score less than or equal to 10. You
# have access to the variables 'error', 'warning', 'refactor', and 'convention'
# which contain the number of messages in each category, as well as 'statement'
# which is the total number of statements analyzed. This score is used by the
# global evaluation report (RP0004).
evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10)

# Template used to display messages. This is a python new-style format string
# used to format the message information. See doc for all details.
#msg-template=

# Set the output format. Available formats are text, parseable, colorized, json
# and msvs (visual studio). You can also give a reporter class, e.g.
# mypackage.mymodule.MyReporterClass.
output-format=text

# Tells whether to display a full report or only the messages.
reports=no

# Activate the evaluation score.
score=yes


[REFACTORING]

# Maximum number of nested blocks for function / method body
max-nested-blocks=5

# Complete name of functions that never returns. When checking for
# inconsistent-return-statements if a never returning function is called then
# it will be considered as an explicit return statement and no message will be
# printed.
never-returning-functions=sys.exit


[VARIABLES]

# List of additional names supposed to be defined in builtins. Remember that
# you should avoid defining new builtins when possible.
additional-builtins=

# Tells whether unused global variables should be treated as a violation.
allow-global-unused-variables=yes

# List of strings which can identify a callback function by name. A callback
# name must start or end with one of those strings.
callbacks=cb_,
          _cb

# A regular expression matching the name of dummy variables (i.e. expected to
# not be used).
dummy-variables-rgx=_+$|(_[a-zA-Z0-9_]*[a-zA-Z0-9]+?$)|dummy|^ignored_|^unused_

# Argument names that match this expression will be ignored. Default to name
# with leading underscore.
ignored-argument-names=_.*|^ignored_|^unused_

# Tells whether we should check for unused import in __init__ files.
init-import=no

# List of qualified module names which can have objects that can redefine
# builtins.
redefining-builtins-modules=six.moves,past.builtins,future.builtins,builtins,io


[TYPECHECK]

# List of decorators that produce context managers, such as
# contextlib.contextmanager. Add to this list to register other decorators that
# produce valid context managers.
contextmanager-decorators=contextlib.contextmanager

# List of members which are set dynamically and missed by pylint inference
# system, and so shouldn't trigger E1101 when accessed. Python regular
# expressions are accepted.
generated-members=

# Tells whether missing members accessed in mixin class should be ignored. A
# mixin class is detected if its name ends with "mixin" (case insensitive).
ignore-mixin-members=yes

# Tells whether to warn about missing members when the owner of the attribute
# is inferred to be None.
ignore-none=yes

# This flag controls whether pylint should warn about no-member and similar
# checks whenever an opaque object is returned when inferring. The inference
# can return multiple potential results while evaluating a Python object, but
# some branches might not be evaluated, which results in partial inference. In
# that case, it might be useful to still emit no-member and other checks for
# the rest of the inferred objects.
ignore-on-opaque-inference=yes

# List of class names for which member attributes should not be checked (useful
# for classes with dynamically set attributes). This supports the use of
# qualified names.
ignored-classes=optparse.Values,thread._local,_thread._local

# List of module names for which member attributes should not be checked
# (useful for modules/projects where namespaces are manipulated during runtime
# and thus existing member attributes cannot be deduced by static analysis). It
# supports qualified module names, as well as Unix pattern matching.
ignored-modules=

# Show a hint with possible names when a member name was not found. The aspect
# of finding the hint is based on edit distance.
missing-member-hint=yes

# The minimum edit distance a name should have in order to be considered a
# similar match for a missing member name.
missing-member-hint-distance=1

# The total number of similar names that should be taken in consideration when
# showing a hint for a missing member.
missing-member-max-choices=1

# List of decorators that change the signature of a decorated function.
signature-mutators=


[STRING]

# This flag controls whether inconsistent-quotes generates a warning when the
# character used as a quote delimiter is used inconsistently within a module.
check-quote-consistency=no

# This flag controls whether the implicit-str-concat should generate a warning
# on implicit string concatenation in sequences defined over several lines.
check-str-concat-over-line-jumps=no


[SPELLING]

# Limits count of emitted suggestions for spelling mistakes.
max-spelling-suggestions=4

# Spelling dictionary name. Available dictionaries: none. To make it work,
# install the python-enchant package.
spelling-dict=

# List of comma separated words that should not be checked.
spelling-ignore-words=

# A path to a file that contains the private dictionary; one word per line.
spelling-private-dict-file=

# Tells whether to store unknown words to the private dictionary (see the
# --spelling-private-dict-file option) instead of raising a message.
spelling-store-unknown-words=no


[SIMILARITIES]

# Ignore comments when computing similarities.
ignore-comments=yes

# Ignore docstrings when computing similarities.
ignore-docstrings=yes

# Ignore imports when computing similarities.
ignore-imports=no

# Minimum lines number of a similarity.
min-similarity-lines=4


[MISCELLANEOUS]

# List of note tags to take in consideration, separated by a comma.
notes=FIXME,
      XXX,
      TODO

# Regular expression of note tags to take in consideration.
#notes-rgx=


[LOGGING]

# The type of string formatting that logging methods do. `old` means using %
# formatting, `new` is for `{}` formatting.
logging-format-style=new

# Logging modules to check that the string format arguments are in logging
# function parameter format.
logging-modules=logging


[FORMAT]

# Expected format of line ending, e.g. empty (any line ending), LF or CRLF.
expected-line-ending-format=

# Regexp for a line that is allowed to be longer than the limit.
ignore-long-lines=^\s*(# )?<?https?://\S+>?$

# Number of spaces of indent required inside a hanging or continued line.
indent-after-paren=4

# String used as indentation unit. This is usually "    " (4 spaces) or "\t" (1
# tab).
indent-string='    '

# Maximum number of characters on a single line.
max-line-length=120

# Maximum number of lines in a module.
max-module-lines=1000

# Allow the body of a class to be on the same line as the declaration if body
# contains single statement.
single-line-class-stmt=no

# Allow the body of an if to be on the same line as the test if there is no
# else.
single-line-if-stmt=no


[BASIC]

# Naming style matching correct argument names.
argument-naming-style=snake_case

# Regular expression matching correct argument names. Overrides argument-
# naming-style.
#argument-rgx=

# Naming style matching correct attribute names.
attr-naming-style=snake_case

# Regular expression matching correct attribute names. Overrides attr-naming-
# style.
#attr-rgx=

# Bad variable names which should always be refused, separated by a comma.
bad-names=foo,
          bar,
          baz,
          toto,
          tutu,
          tata

# Bad variable names regexes, separated by a comma. If names match any regex,
# they will always be refused
bad-names-rgxs=

# Naming style matching correct class attribute names.
class-attribute-naming-style=any

# Regular expression matching correct class attribute names. Overrides class-
# attribute-naming-style.
#class-attribute-rgx=

# Naming style matching correct class names.
class-naming-style=PascalCase

# Regular expression matching correct class names. Overrides class-naming-
# style.
#class-rgx=

# Naming style matching correct constant names.
const-naming-style=UPPER_CASE

# Regular expression matching correct constant names. Overrides const-naming-
# style.
#const-rgx=

# Minimum line length for functions/classes that require docstrings, shorter
# ones are exempt.
docstring-min-length=-1

# Naming style matching correct function names.
function-naming-style=snake_case

# Regular expression matching correct function names. Overrides function-
# naming-style.
#function-rgx=

# Good variable names which should always be accepted, separated by a comma.
good-names=i,
           j,
           k,
           f,
           ex,
           e,
           db,
           Run,
           _,
           df,
           pd,
           np,
           it,
           id,
           ip,
           dt,
           by,
           N,

# Good variable names regexes, separated by a comma. If names match any regex,
# they will always be accepted
good-names-rgxs=

# Include a hint for the correct naming format with invalid-name.
include-naming-hint=no

# Naming style matching correct inline iteration names.
inlinevar-naming-style=any

# Regular expression matching correct inline iteration names. Overrides
# inlinevar-naming-style.
#inlinevar-rgx=

# Naming style matching correct method names.
method-naming-style=snake_case

# Regular expression matching correct method names. Overrides method-naming-
# style.
#method-rgx=

# Naming style matching correct module names.
module-naming-style=snake_case

# Regular expression matching correct module names. Overrides module-naming-
# style.
#module-rgx=

# Colon-delimited sets of names that determine each other's naming style when
# the name regexes allow several styles.
name-group=

# Regular expression which should only match function or class names that do
# not require a docstring.
no-docstring-rgx=^_

# List of decorators that produce properties, such as abc.abstractproperty. Add
# to this list to register other decorators that produce valid properties.
# These decorators are taken in consideration only for invalid-name.
property-classes=abc.abstractproperty

# Naming style matching correct variable names.
variable-naming-style=snake_case

# Regular expression matching correct variable names. Overrides variable-
# naming-style.
#variable-rgx=


[IMPORTS]

# List of modules that can be imported at any level, not just the top level
# one.
allow-any-import-level=

# Allow wildcard imports from modules that define __all__.
allow-wildcard-with-all=no

# Analyse import fallback blocks. This can be used to support both Python 2 and
# 3 compatible code, which means that the block might have code that exists
# only in one or another interpreter, leading to false positives when analysed.
analyse-fallback-blocks=no

# Deprecated modules which should not be used, separated by a comma.
deprecated-modules=optparse,tkinter.tix

# Create a graph of external dependencies in the given file (report RP0402 must
# not be disabled).
ext-import-graph=

# Create a graph of every (i.e. internal and external) dependencies in the
# given file (report RP0402 must not be disabled).
import-graph=

# Create a graph of internal dependencies in the given file (report RP0402 must
# not be disabled).
int-import-graph=

# Force import order to recognize a module as part of the standard
# compatibility libraries.
known-standard-library=

# Force import order to recognize a module as part of a third party library.
known-third-party=enchant

# Couples of modules and preferred modules, separated by a comma.
preferred-modules=


[DESIGN]

# Maximum number of arguments for function / method.
max-args=20

# Maximum number of attributes for a class (see R0902).
max-attributes=12

# Maximum number of boolean expressions in an if statement (see R0916).
max-bool-expr=5

# Maximum number of branch for function / method body.
max-branches=12

# Maximum number of locals for function / method body.
max-locals=25

# Maximum number of parents for a class (see R0901).
max-parents=7

# Maximum number of public methods for a class (see R0904).
max-public-methods=20

# Maximum number of return / yield for function / method body.
max-returns=6

# Maximum number of statements in function / method body.
max-statements=50

# Minimum number of public methods for a class (see R0903).
min-public-methods=0


[CLASSES]

# List of method names used to declare (i.e. assign) instance attributes.
defining-attr-methods=__init__,
                      __new__,
                      setUp,
                      __post_init__

# List of member names, which should be excluded from the protected access
# warning.
exclude-protected=_asdict,
                  _fields,
                  _replace,
                  _source,
                  _make

# List of valid names for the first argument in a class method.
valid-classmethod-first-arg=cls

# List of valid names for the first argument in a metaclass class method.
valid-metaclass-classmethod-first-arg=cls


[EXCEPTIONS]

# Exceptions that will emit a warning when being caught. Defaults to
# "BaseException, Exception".
overgeneral-exceptions=BaseException,
                       Exception

================================================
FILE: .readthedocs.yaml
================================================
version: 2

build:
  os: ubuntu-22.04
  tools:
    python: "3.11"
  jobs:
    post_install:
      - pip install --no-cache-dir poetry poetry-plugin-export
      - poetry export -f requirements.txt -o requirements.txt --without-hashes --without dev
      - pip install --no-cache-dir -r requirements.txt

sphinx:
  builder: html
  configuration: docs/source/conf.py
  fail_on_warning: false

python:
  install:
    - requirements: docs/requirements.txt

formats:
  - pdf


================================================
FILE: CHANGELOG.rst
================================================
Release Notes
=============

Version 0.5.1 (26.03.2026)
---------------------------

**New Features:**

* Custom metric functions in ``Tester``: new ``metric_funcs`` parameter allows
  passing arbitrary callables instead of column names. Works with ``theory``
  and ``empiric`` methods. Functions passed to ``run()`` override those set
  in the constructor.

* ``LinearizationTransformer`` for ratio metrics (e.g. revenue/orders).
  Linearizes metric via ``linearized_i = numerator_i - ratio * denominator_i``,
  where ``ratio`` is estimated on reference data during ``fit()``.

* ``Preprocessor.linearize()`` integrates linearization into the existing
  chain architecture with full serialization and replay support.

**Bug Fixes:**

* Pinned ``setuptools>=65.0.0, <82.0.0`` to fix ``pkg_resources`` removal
  in setuptools 82 that broke ``pip install ambrosia`` due to hyperopt dependency.

**Internal:**

* Updated publish workflow to use ``PYPI_TOKEN_V2``

* Added CLAUDE.md to ``.gitignore``


Version 0.5.0 (06.01.2025)
---------------------------

**Breaking Changes:**

* Minimum Python version raised to 3.9 (dropped support for 3.7, 3.8)

* Minimum PySpark version raised to 3.4 (dropped support for 3.2, 3.3)

**New Features:**

* Added support for Python 3.11, 3.12, 3.13

**Bug Fixes:**

* Added hnswlib as fallback for nmslib on macOS ARM (fixes segfault in metric split)

**Dependencies:**

* Updated numpy to >=1.24.0, <3.0.0

* Updated pandas to >=1.5.0, <3.0.0

* Updated scipy to >=1.10.0

* Updated scikit-learn to >=1.3.0

* Updated nmslib to >=2.1.0

* Added hnswlib >=0.7.0 as alternative KNN backend

* Updated catboost to >=1.2.0

* Updated other dependencies for Python 3.12/3.13 compatibility

**Internal:**

* Replaced deprecated ``pkg_resources`` with ``importlib.metadata``

* Updated CI/CD to test Python 3.9-3.13

* Updated GitHub Actions to v4/v5


Version 0.4.1 (21.04.2023)
---------------------------

Hotfix for pyspark import in spark criteria.

Version 0.4.0 (21.04.2023)
---------------------------

* Documentation and usage examples have been substantially reworked and updated. 

* The ``Designer`` class and design methods functionality is updated. 
  
  * Empirical design now supports the choice of hypothesis alternative and group ratio parameter
  
  * Look of resulting tables with calculated parameters is unified for all design methods
  
  * Changed multiprocessing strategy for bootstrap criterion

* The ``Tester`` class functionality is updated. 

  * Spark data support for the ``Tester`` class is added. Independent t-test is available now

  *  Bootstrap criterion can now return deterministic output using a ``random_seed`` parameter

  * Paired bootstrap criterion is now available

  * MHC now is optional and takes into account the number of passed metrics

  *  ``first_errors`` parameter renamed to ``first_type_errors``

* ``pyspark`` package now is optional and could be installed using ``pip`` extras.

* Fixed a set of bugs.


Version 0.3.0 (15.02.2023)
---------------------------

* The ``Designer`` class and design methods functionality is updated. 

  * Theoretical design now supports the choice of hypothesis alternative and group ratio parameter 

  * These calculations now use Statsmodels solvers

  * Experimental parameters for binary data can now also be theoretically designed using both 
    the asin variance-stabilizing transformation and the normal approximation

* All preprocessor classes, except for the ``Preprocessor``, have changed their api and have updated functionality

  * Preprocessing classes now use ``fit`` and ``transform`` methods to get transformation parameters 
    and apply transformation on pandas tables

  * Fitted classes now can now be saved and loaded from json files

  * Table column names used when fitting class instances are now strictly fixed in instance attributes

* The ``Preprocessor`` class is updated.

  * Added new transformation methods

  * The executed transformation pipeline can now be saved and loaded from a json file. 
    This can be used to store and load the entire experimental data processing pipeline

  * The data handling methods of the class have changed some parameters to match the changes in the classes used

* The ``IQRPreprocessor`` class now is available in ``ambrosia.preprocessing``.

  * It can be used to remove outliers based on quartile and interquartile range estimates

* The ``RobustPreprocessor`` class is updated.

  * It now supports different types of tails for removal: ``both``, ``right`` or ``left``

  * For each processed column, a separate alpha portion of the distribution can be passed.

* The ``BoxCoxTransformer`` class now is available in ``ambrosia.preprocessing``

  * It can be used for data distribution normalization.

* The ``LogTransformer`` class now is available in ``ambrosia.preprocessing``

  * It can be used to transform data for variance reduction.

* The ``MLVarianceReducer`` class is updated.

  * Now it can store and load the selected ML model from a single specified path

Version 0.2.0 (22.11.2022)
---------------------------

Library name changed back to ``ambrosia``. Naming conflict in PyPI has been resolved.  
0.1.x versions are still available in PyPI under ``ambrozia`` name.

Version 0.1.2 (16.11.2022)
---------------------------

Hotfix for Ttest stat criterion absolute effect calculation. 
Url to main image deleted from docs.

Version 0.1.1 (04.10.2022)
---------------------------

Hotfix for library naming. 
Library temprorary renamed to ``ambrozia`` in PyPI repository due to hidden naming conflict. 

Version 0.1.0 (03.10.2022)
---------------------------

First release of ``Ambrosia`` package:

    * Added ``Designer`` class for experiment parameters design
    * Added ``Spliiter`` class for A/B groups split
    * Added ``Tester`` class for experiment effect measurement 
    * Added various classes for experiment data preprocessing
    * Added A/B testing tools with wide functionality  


================================================
FILE: CONTRIBUTING.rst
================================================
Contributing Guide 
===================

`Ambrosia` is an open source project and there are many ways to contribute, from writing tutorials or blog posts, 
improving the documentation, submitting bug reports and feature requests or writing code which can be incorporated 
into `Ambrosia` itself.

Bug reports
-----------

If you think you have found a bug in `Ambrosia`, first make sure that you are testing against 
the latest version of package - your issue may already have been fixed. If not, 
search our issues list on GitHub in case a similar issue has already been opened.

It is very helpful if you can prepare a reproduction of the bug. 
In other words, provide a small test case which we can run to confirm your bug. 
It makes it easier to find the problem and to fix it. 

Provide as much information as you can. The easier it is for us to recreate your problem,
the faster it is likely to be fixed.

Feature requests
----------------

If you find yourself wishing for a feature that doesn't exist in `Ambrosia`, you can open an issue 
on our `issues list   <https://github.com/MobileTeleSystems/Ambrosia/issues>`_ on GitHub 
which describes the feature you would like to see, why you need it, and how it should work.


Contributing code and documentation changes
-------------------------------------------

If you have a bugfix or new feature that you would like to contribute to `Ambrosia`, 
please find or open an issue about it first. Talk about what you would like to do. 
It may be that somebody is already working on it, 
or that there are particular issues that you should know about before implementing the change.

There are many approaches to fixing a problem and it is important to find the best approach 
before writing too much code.

Branching
---------

Those users with Contributor permissions can directly clone the repository and work on a branch within it.

Those without Contibutor permissions will need to fork the main repository to work on your changes. 
Simply navigate to our GitHub page and click the “Fork” button at the top. 
Once you have forked the repository, you can clone your new repository and start making edits.

When using git, it is best to isolate each topic or feature into a “topic branch”. 
Branches are a great way to group commits related to one feature together, 
or to isolate different efforts when you might be working on multiple topics at the same time.

While it takes some experience to get the right feel about how to break up commits, 
a topic branch should be limited in scope to a single issue. If you are working on multiple issues, 
please create multiple branches and submit them for review separately.

Pull Request Guidelines
-----------------------

Create a pull request for preliminary review or merging into the project when you are ready.

If you need to make any adjustments to your pull request, just push the updates to your branch. 
Your pull request will automatically track the changes on your development branch and update.

You may merge the Pull Request in once you have the sign-off of two other developers, 
or if you do not have permission to do that, you may request the second reviewer to merge it for you. 
We expect to have a minimum of one approval from someone else on the core team.

================================================
FILE: LICENSE
================================================
Copyright 2022 MTS (Mobile Telesystems).  All rights reserved.

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2022 MTS (Mobile Telesystems).

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: Makefile
================================================
VENV=.venv

ifeq (${OS},Windows_NT)
	BIN=${VENV}/Scripts
else
	BIN=${VENV}/bin
endif

export PATH := $(BIN):$(PATH)

FLAKE=flake8
PYLINT=pylint
ISORT=isort
BLACK=black
PYTEST=pytest
COVERAGE=coverage

SOURCES=ambrosia
TESTS=tests
REPORTS=reports


# Installation

reports:
	@mkdir ${REPORTS}

.venv:
	@echo "Creating virtualenv...\t\t"
	poetry install --no-root
	poetry install --all-extras
	@echo "[Installed]"

install: .venv reports


# Linters

.isort:
	@echo "Running isort checks..."
	@${ISORT} --check ${SOURCES} ${TESTS}
	@echo "[Isort checks finished]"

.black:
	@echo "Running black checks..."
	@${BLACK} --check --diff ${SOURCES} ${TESTS} ${BENCHMARK}
	@echo "[Black checks finished]"

.pylint: reports
	@echo "Running pylint checks..."
	@${PYLINT} ${SOURCES} ${TESTS} --exit-zero 
	@${PYLINT} ${SOURCES} ${TESTS} --exit-zero > ${REPORTS}/pylint.txt
	@echo "[Pylint checks finished]"

.flake8:
	@echo "Running flake8 checks...\t"
	@${FLAKE} ${SOURCES} ${TESTS} --exit-zero 
	@echo "[Flake8 checks finished]"


# Fixers & formatters

.isort_fix:
	@echo "Fixing isort..."
	@${ISORT} ${SOURCES} ${TESTS}
	@echo "[Isort fixed]"

.black_fix:
	@echo "Formatting with black..."
	@${BLACK} -q  ${SOURCES} ${TESTS}
	@echo "[Black fixed]"


# Tests

.pytest:
	@echo "Running pytest checks...\t"
	@PYTHONPATH=. ${PYTEST} --cov=${SOURCES} --cov-report=xml:${REPORTS}/coverage.xml

coverage: .venv reports
	@echo "Running coverage..."
	${COVERAGE} run --source ${SOURCES} --module pytest
	${COVERAGE} report
	${COVERAGE} html -d ${REPORTS}/coverage_html
	${COVERAGE} xml -o ${REPORTS}/coverage.xml -i


# Generalization

.autoformat: .isort_fix .black_fix
autoformat: .venv .autoformat

.lint: .isort .black .pylint .flake8
lint: .venv .lint

.test: .pytest 
test: .venv .test


# Cleaning

clean:
	@rm -rf build dist .eggs *.egg-info
	@rm -rf ${VENV}
	@rm -rf ${REPORTS}
	@find . -type d -name '.mypy_cache' -exec rm -rf {} +
	@find . -type d -name '*pytest_cache*' -exec rm -rf {} +

reinstall: clean install

================================================
FILE: README.rst
================================================
.. shields start

Ambrosia
========

|PyPI| |PyPI License| |ReadTheDocs| |Tests| |Coverage| |Black| |Python Versions| |Telegram Channel|

.. |PyPI| image:: https://img.shields.io/pypi/v/ambrosia?v=0.5.1
    :target: https://pypi.org/project/ambrosia
.. |PyPI License| image:: https://img.shields.io/pypi/l/ambrosia.svg
    :target: https://github.com/MobileTeleSystems/Ambrosia/blob/main/LICENSE
.. |ReadTheDocs| image:: https://img.shields.io/readthedocs/ambrosia.svg
    :target: https://ambrosia.readthedocs.io
.. |Tests| image:: https://img.shields.io/github/actions/workflow/status/MobileTeleSystems/Ambrosia/test.yaml?branch=main
    :target: https://github.com/MobileTeleSystems/Ambrosia/actions/workflows/test.yaml?query=branch%3Amain+
.. |Coverage| image:: https://codecov.io/gh/MobileTeleSystems/Ambrosia/branch/main/graph/badge.svg
    :target: https://codecov.io/gh/MobileTeleSystems/Ambrosia
.. |Black| image:: https://img.shields.io/badge/code%20style-black-000000.svg
    :target: https://github.com/psf/black
.. |Python Versions| image:: https://img.shields.io/pypi/pyversions/ambrosia.svg?v=0.5.1
    :target: https://pypi.org/project/ambrosia  
.. |Telegram Channel| image:: https://img.shields.io/badge/telegram-Ambrosia-blueviolet.svg?logo=telegram
    :target: https://t.me/+Tkt43TNUUSAxNWNi

.. shields end

.. image:: https://raw.githubusercontent.com/MobileTeleSystems/Ambrosia/main/docs/source/_static/ambrosia.png
   :height: 320 px
   :width: 320 px
   :align: center

.. title

*Ambrosia* is a Python library for A/B tests design, split and effect measurement. 
It provides rich set of methods for conducting full A/B testing pipeline. 

The project is intended for use in research and production environments 
based on data in pandas and Spark format.

.. functional

Key functionality
-----------------

* Pilots design 🛫
* Multi-group split 🎳
* Matching of new control group to the existing pilot 🎏
* Experiments result evaluation as p-value, point estimate of effect and confidence interval 🎞
* Data preprocessing ✂️
* Experiments acceleration 🎢

.. documentation

Documentation
-------------

For more details, see the `Documentation <https://ambrosia.readthedocs.io/>`_ 
and `Tutorials <https://github.com/MobileTeleSystems/Ambrosia/tree/main/examples>`_.

.. install

Installation
------------

**Requirements:** Python 3.9+

You can always get the newest *Ambrosia* release using ``pip``.
Stable version is released on every tag to ``main`` branch. 

.. code:: bash
    
    pip install ambrosia 

Starting from version ``0.4.0``, the ability to process PySpark data is optional and can be enabled 
using ``pip`` extras during the installation.

.. code:: bash
    
    pip install ambrosia[spark]

.. usage

Usage
-----

The main functionality of *Ambrosia* is contained in several core classes and methods, 
which are autonomic for each stage of an experiment and have very intuitive interface. 

|

Below is a brief overview example of using a set of three classes to conduct some simple experiment.

**Designer**

.. code:: python

    from ambrosia.designer import Designer
    designer = Designer(dataframe=df, effects=1.2, metrics='portfel_clc') # 20% effect, and loaded data frame df
    designer.run('size') 


**Splitter**

.. code:: python

    from ambrosia.splitter import Splitter
    splitter = Splitter(dataframe=df, id_column='id') # loaded data frame df with column with id - 'id'
    splitter.run(groups_size=500, method='simple') 


**Tester**

.. code:: python

    from ambrosia.tester import Tester
    tester = Tester(dataframe=df, column_groups='group') # loaded data frame df with groups info 'group'
    tester.run(metrics='retention', method='theory', criterion='ttest')

.. develop

Development
-----------

To install all requirements run

.. code:: bash

    make install

You must have ``python3`` and ``poetry`` installed.

For autoformatting run

.. code:: bash

    make autoformat

For linters check run

.. code:: bash

    make lint

For tests run

.. code:: bash

    make test

For coverage run

.. code:: bash

    make coverage

To remove virtual environment run

.. code:: bash

    make clean

.. contributors

Authors
-------

**Developers and evangelists**:

* `Bayramkulov Aslan <https://github.com/aslanbm>`_
* `Khakimov Artem <https://github.com/xandaau>`_
* `Vasin Artem <https://github.com/VictorFromChoback>`_


================================================
FILE: SECURITY.rst
================================================
Security Policy
===============

Supported Python versions
-------------------------

3.7 or above

Product development security recommendations
--------------------------------------------

1. Update dependencies to last stable version
2. Build SBOM for the project
3. Perform SAST (Static Application Security Testing) where possible

Product development security requirements
-----------------------------------------

1. No binaries in repository
2. No passwords, keys, access tokens in source code
3. No "Critical" and/or "High" vulnerabilities in contributed source code

Vulnerability reports
---------------------

Please, use email `<ambajramk1@mts.ru>`__ for reporting security issues or anything that can cause any consequences for security. 

Please avoid any public disclosure (including registering issues) at least until it is fixed. Thank you in advance for understanding.  


================================================
FILE: ambrosia/VERSION
================================================
0.5.1


================================================
FILE: ambrosia/__init__.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Ambrosia
===============================

Ambrosia is a Python library for A/B tests design, split and effect
measurement. It provides rich set of methods for conducting full
A/B test pipeline. In particular, a design stage could be performed
using data from both pandas and spark dataframes with either
theoretical or empirical approach. Split methods support different
strategies and multigroup split. Final effect measurement stage could
be gently conducted via testing tools that allow to measure relative
and absolute effects and construct corresponding confidence intervals
for continious and binary variables. Testing tools as well as design
support significant number of statistical criteria, like t-test,
non-parametric ones, and bootstrap. For additional A/B tests support
package provides features and tools for data preproccesing and
experiment acceleration.

See "https://ambrosia.readthedocs.io" for complete documentation.

Subpackages
------------
    preprocessing - Experiment data preprocessing
    designer - Experiments design
    splitter - Groups split
    tester - Effects measurement
    tools - Core methods
    spark_tools - Spark methods
"""

from ambrosia.version import __version__


================================================
FILE: ambrosia/designer/__init__.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Subpackage for experiment and pilots design.
"""
from .designer import (
    Designer,
    design,
    design_binary,
    design_binary_effect,
    design_binary_power,
    design_binary_size,
    load_from_config,
)

__all__ = [
    "Designer",
    "design",
    "load_from_config",
    "design_binary_size",
    "design_binary_effect",
    "design_binary_power",
    "design_binary",
]


================================================
FILE: ambrosia/designer/designer.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Experiment design methods.

Module contains `Designer` core class and `design` method which are
intended to conduct the experiment design for A/B/.. tests via different
methods.

Experiment design of the individual metric is based on its historical data
and could be done for any parameter from the self-dependent triplet:
group size, effect size and experiment power.

Currently, experiment design problem could be solved using data provided
in form of both pandas and Spark(with some restrictions) dataframes.
"""
from __future__ import annotations

from typing import List, Optional

import numpy as np
import pandas as pd
import yaml

import ambrosia.tools.bin_intervals as bin_pkg
import ambrosia.tools.theoretical_tools as theory_pkg
from ambrosia import types
from ambrosia.tools.ab_abstract_component import ABMetaClass, ABToolAbstract, SimpleDesigner

from .handlers import EmpiricHandler, TheoryHandler, calc_prob_control_class

SIZE: str = "size"
EFFECT: str = "effect"
POWER: str = "power"
BINARY_DESIGN_METHODS: List[str] = ["theory", "binary"]


class Designer(yaml.YAMLObject, ABToolAbstract, metaclass=ABMetaClass):
    """
    Unit for experiments and pilots parameters design.

    Enables to design missing experiment parameters using historical data.
    The main related to each other designable parameters for a single metric are:

        - Effect (Minimal Detectible Effect):
            old_mean_metric_value * effect_value = new_mean_metric_value
        - Sample size:
            Number of research objects in sample
            (for example number of users and their retention).
        - Errors (I type error, II type error):
            I error (alpha):
                Probability to detect presence of effect
                for equally distributed samples.
            II error (beta):
                Probability not to find effect
                for differently distributed samples.

    Parameters
    ----------
    dataframe : PassedDataType, optional
        DataFrame with metrics historical values.
    sizes : SampleSizeType, optional
        Values of research objects number in groups samples during
        the experiment.
    effects : EffectType, optional
        Effects values that are expected during the experiment.
    first_type_errors : StatErrorType, default: ``0.05``
        I type error bounds
        P (detect difference for equal) < alpha.
    second_type_errors : StatErrorType, default: ``0.2``
        II type error bounds
        P (suppose equality for different groups) < beta.
    metrics : MetricNamesType, optional
        Column names of metrics in dataframe to be designed.
    method : str, optional
        Method used for experiment design.
        Can be ``"theory"``, ``"empiric"`` or ``"binary"``.


    Attributes
    ----------
    dataframe : PassedDataType
        DataFrame with metrics historocal values.
    sizes : SampleSizeType
        Number of research objects in group samples.
    effects : EffectType
        Effects values in the experiment.
    first_type_errors : StatErrorType, default: ``0.05``
        I type errors.
    second_type_errors : StatErrorType, default: ``0.2``
        II type errors.
    metrics : MetricNamesType
        Column names of metrics in dataframe to be designed.
    method : str
        Method used for experiment design.

    Examples
    --------
    We have retention labels for users of mobile app for previous month.
    Suppose old_retention = ``0.3``, that is 30% of users returned to the app
    in a month after installation.

    Let us fix the following parameters:
        I type error (alpha) = ``0.05``
        (5% of equal samples we can suppose to be different).

        II type error (beta) = ``0.2``
        (20% of different sampels we can suppose to be equal).

    We add onboarding to our app and want to estimate an effect, by A/B testing
    and wish to increase retention value to 31% percents, so our effect
    parameter gets value of ``1.0(3)``. Now we want to find how much users we
    need in both groups to detect such effect.

    We can use ``Designer`` class in the following way:

    >>> designer = Designer(dataframe=df, metric='retention', effect=1.033)
    >>> designer.run("size")

    Note, that default values for errors are:
        ``first_type_error`` = ``0.05``

        ``second_type_error`` = ``0.2``

    Then we get dataframe that contains value of  sufficient number of users
    for our experiment.

    Notes
    -----

    Constructors:

    >>> designer = Designer()
    >>> # You can pass an Iterable or single object for some parameters
    >>> designer = Designer(
    >>>     dataframe=df,
    >>>     sizes=[100, 200],
    >>>     metrics='LTV',
    >>>     effects=1.05
    >>> )
    >>> designer = Desginer(sizes=1000, metrics=['retention', 'LTV'])
    >>> # You can use path to .csv table for pandas
    >>> designer = Designer('./data/table.csv')

    Setters:

    >>> designer.set_first_errors([0.05, 0.01])
    >>> desginer.set_dataframe(df)

    Run:

    >>> # One can pass arguments and they will have higher priority
    >>> designer.run('size', effects=1.1)
    >>> designer.run('effect', sizes=[500, 1000], metrics='retention')
    >>> # You can set method (watch below)
    >>> designer.run('effect', sizes=[500, 1000], metrics='retention', method='binary')

    Load from yaml config:

    >>> config = '''
            !splitter # <--- this is yaml tag (!important)
                effects:
                    - 0.9
                    - 1.05
                sizes:
                    - 1000
        '''
    >>> designer = yaml.load(config)
    >>> # Or use the implmented function
    >>> designer = load_from_config(config)

    Use standalone function instead of a class:

    >>> design('size', dataframe=df, effects=1.05, metrics='retention')
    """

    # YAML tag for loading from configs
    yaml_tag = "!designer"

    def set_first_errors(self, first_type_errors: types.StatErrorType) -> None:
        if isinstance(first_type_errors, float):
            self.__alpha = [first_type_errors]
        else:
            self.__alpha = first_type_errors

    def set_second_errors(self, second_type_errors: types.StatErrorType) -> None:
        if isinstance(second_type_errors, float):
            self.__beta = [second_type_errors]
        else:
            self.__beta = second_type_errors

    def set_sizes(self, sizes: types.SampleSizeType) -> None:
        if isinstance(sizes, int):
            self.__size = [sizes]
        else:
            self.__size = sizes

    def set_effects(self, effects: types.EffectType) -> None:
        if isinstance(effects, (float, int)):
            self.__effect = [effects]
        else:
            self.__effect = effects

    def set_dataframe(self, dataframe: types.PassedDataType) -> None:
        if isinstance(dataframe, str):
            if dataframe.endswith(".csv"):
                self.__df = pd.read_csv(dataframe)
            else:
                raise ValueError("File name must ends with .csv")
        else:
            self.__df = dataframe

    def set_method(self, method: str) -> None:
        self.__method = method

    def set_metrics(self, metrics: str) -> None:
        if isinstance(metrics, types.MetricNameType):
            self.__metrics = [metrics]
        else:
            self.__metrics = metrics

    def __init__(
        self,
        dataframe: Optional[types.PassedDataType] = None,
        sizes: Optional[types.SampleSizeType] = None,
        effects: Optional[types.EffectType] = None,
        first_type_errors: types.StatErrorType = 0.05,
        second_type_errors: types.StatErrorType = 0.2,
        metrics: Optional[types.MetricNamesType] = None,
        method: str = "theory",
    ):
        """
        Designer class constructor to initialize the object.
        """
        self.set_first_errors(first_type_errors)
        self.set_second_errors(second_type_errors)
        self.set_sizes(sizes)
        self.set_effects(effects)
        self.set_metrics(metrics)
        self.set_dataframe(dataframe)
        self.set_method(method)

    def __getstate__(self):
        """
        Get the state of the object to serialize.
        """
        return dict(
            effects=self.__effect,
            sizes=self.__size,
            first_type_errors=self.__alpha,
            second_type_errors=self.__beta,
            metrics=self.__metrics,
            method=self.__method,
        )

    @classmethod
    def from_yaml(cls, loader: yaml.Loader, node: yaml.Node):
        kwargs = loader.construct_mapping(node)
        return cls(**kwargs)

    @staticmethod
    def __dataframe_handler(handler: SimpleDesigner, parameter: str, **kwargs) -> pd.DataFrame:
        """
        Handles different dataframe types.
        Now pandas and spark are available.
        """
        if parameter == SIZE:
            return handler.size_design(**kwargs)
        elif parameter == EFFECT:
            return handler.effect_design(**kwargs)
        elif parameter == POWER:
            return handler.power_design(**kwargs)
        else:
            raise ValueError(f"Only {SIZE}, {EFFECT} and {POWER} parameters of the experiment could be designed.")

    @staticmethod
    def __theory_design(label: str, args: types._UsageArgumentsType, **kwargs) -> types.DesignerResult:
        """
        Designing an experiment, using a theoretical approach.
        """
        result: types.DesignerResult = {}
        for metric_name in args["metric"]:
            kwargs["dataframe"] = args["df"]
            kwargs["column"] = metric_name
            kwargs["first_errors"] = np.array(args["alpha"])
            if label == SIZE:
                kwargs["effects"] = args[EFFECT]
                kwargs["second_errors"] = np.array(args["beta"])
            elif label == EFFECT:
                kwargs["sample_sizes"] = args[SIZE]
                kwargs["second_errors"] = np.array(args["beta"])
            elif label == POWER:
                kwargs["sample_sizes"] = args[SIZE]
                kwargs["effects"] = args[EFFECT]
            result[metric_name] = Designer.__dataframe_handler(TheoryHandler(), label, **kwargs)
        if len(args["metric"]) == 1:
            return result[args["metric"][0]]
        else:
            return result

    @staticmethod
    def __empiric_design(label: str, args: types._UsageArgumentsType, **kwargs) -> types.DesignerResult:
        """
        Designing an experiment, using an empirical approach.
        """
        kwargs["dataframe"] = args["df"]
        kwargs["alphas"] = np.array(args["alpha"])
        kwargs["metrics"] = args["metric"]
        if label == SIZE:
            kwargs["effects"] = args[EFFECT]
            kwargs["betas"] = np.array(args["beta"])
        elif label == EFFECT:
            kwargs["group_sizes"] = args[SIZE]
            kwargs["betas"] = np.array(args["beta"])
        elif label == POWER:
            groups_ratio: float = kwargs.pop("groups_ratio") if "groups_ratio" in kwargs else 1.0
            kwargs["sample_sizes_a"] = args[SIZE]
            kwargs["sample_sizes_b"] = [int(groups_ratio * size) for size in args[SIZE]]
            kwargs["effects"] = args[EFFECT]
        return Designer.__dataframe_handler(EmpiricHandler(), label, **kwargs)

    @staticmethod
    def __binary_design(label: str, args: types._UsageArgumentsType, **kwargs) -> types.DesignerResult:
        """
        Designing an experiment, using the approach for binary metrics.
        """
        result: types.DesignerResult = {}
        kwargs["first_errors"] = np.array(args["alpha"])
        for metric_name in args["metric"]:
            kwargs["p_a"] = calc_prob_control_class(args["df"], metric_name)
            if label == SIZE:
                kwargs["delta_relative_values"] = args[EFFECT]
                kwargs["second_errors"] = args["beta"]
                result[metric_name] = bin_pkg.get_table_sample_size_on_effect(**kwargs)
            elif label == EFFECT:
                kwargs["second_errors"] = args["beta"]
                kwargs["sample_sizes"] = args[SIZE]
                result[metric_name] = bin_pkg.get_table_effect_on_sample_size(**kwargs)
            elif label == POWER:
                kwargs["delta_relative_values"] = args[EFFECT]
                kwargs["sample_sizes"] = args[SIZE]
                result[metric_name] = bin_pkg.get_table_power_on_size_and_delta(**kwargs)
        if len(args["metric"]) == 1:
            return result[args["metric"][0]]
        else:
            return result

    @staticmethod
    def __pre_design(label: str, args: types._UsageArgumentsType, **kwargs) -> types.DesignerResult:
        """
        Helper function for run() method logic.
        """
        admissible_methods: List[str] = ["theory", "empiric", "binary"]
        if args["method"] == "theory":
            return Designer.__theory_design(label, args, **kwargs)
        elif args["method"] == "empiric":
            return Designer.__empiric_design(label, args, **kwargs)
        elif args["method"] == "binary":
            return Designer.__binary_design(label, args, **kwargs)
        else:
            raise ValueError(f'Choose method from {", ".join(admissible_methods)}, got {args["method"]}')

    def run(
        self,
        to_design: str,
        method: Optional[str] = None,
        sizes: Optional[types.SampleSizeType] = None,
        effects: Optional[types.EffectType] = None,
        first_type_errors: Optional[types.StatErrorType] = None,
        second_type_errors: Optional[types.StatErrorType] = None,
        dataframe: Optional[types.PassedDataType] = None,
        metrics: Optional[types.MetricNamesType] = None,
        **kwargs,
    ) -> types.DesignerResult:
        """
        Perform an experiment design for chosen parameter and metrics
        using historical data.

        Parameters
        ----------
        to_design : str
           Parameter that will be designed using historical data.
           Can take the values of ``"size"``, ``"effect"`` or ``"power"``.
        method : str, optional
            Method used for experiment design.
            Can be ``"theory"``, ``"empiric"`` or ``"binary"``.
        sizes : SampleSizeType, optional
            Values of research objects number in groups samples during
            the experiment.
            If is not provided, must exist as proper class attribute.
        effects : EffectType, optional
            Effects for experiment
            If is not provided, must exist as proper class attribute.
        first_type_errors : StatErrorType, optional
            I type error bounds
            P (detect difference for equal) < alpha.
        second_type_errors : StatErrorType, optional
            II type error bounds
            P (suppose equality for different groups) < beta.
        dataframe : PassedDataType, optional
            DataFrame with metrics historical values.
            If is not provided, must exist as proper class attribute.
        metrics : MetricNamesType, optional
            Column names of metrics in dataframe to be designed.
            If not provided, must exist as proper class attribute.
        **kwargs : Dict
            Other keyword arguments.

        Other Parameters
        ----------------
        as_numeric : bool, default: ``False``
            The result of calculations can be obtained as a percentage string
            either as a number, this parameter could used to toggle.
        groups_ratio : float, default: ``1.0``
            Ratio between two groups.
        alternative : str, default: ``"two-sided"``
            Alternative hypothesis, can be ``"two-sided"``, ``"greater"``
            or ``"less"``.
            ``"greater"`` - if effect is positive.
            ``"less"`` - if effect is negative.
        stabilizing_method : str, default: ``"asin"``
            Effect trasformation. Can be ``"asin"`` and ``"norm"``.
            For non-binary metrics: only ``"norm"`` is accceptable.
            For binary metrics: ``"norm"`` and ``"asin"``, but ``"asin"``
            is more robust and accurate.
            Acceptable only for ``"theory"`` method and actual for binary metrics!

        Returns
        -------
        result : DesignerResult
            Table or dictionary with the results of parameter design for each
            metric.
        """
        if isinstance(effects, (float, int)):
            effects = [effects]
        if isinstance(sizes, int):
            sizes = [sizes]
        if isinstance(first_type_errors, float):
            first_type_errors = [first_type_errors]
        if isinstance(second_type_errors, float):
            second_type_errors = [second_type_errors]
        if isinstance(metrics, types.MetricNameType):
            metrics = [metrics]

        arguments_choice: types._PrepareArgumentsType = {
            "df": (self.__df, dataframe),
            "alpha": (self.__alpha, first_type_errors),
            "metric": (self.__metrics, metrics),
            "method": (self.__method, method),
        }

        designable_parameters: List[str] = [SIZE, EFFECT, POWER]
        if to_design == SIZE:
            arguments_choice[EFFECT] = (self.__effect, effects)
            arguments_choice["beta"] = (self.__beta, second_type_errors)
            chosen_args: types._UsageArgumentsType = Designer._prepare_arguments(arguments_choice)
            return Designer.__pre_design(SIZE, chosen_args, **kwargs)
        elif to_design == EFFECT:
            arguments_choice[SIZE] = (self.__size, sizes)
            arguments_choice["beta"] = (self.__beta, second_type_errors)
            chosen_args: types._UsageArgumentsType = Designer._prepare_arguments(arguments_choice)
            return Designer.__pre_design(EFFECT, chosen_args, **kwargs)
        elif to_design == POWER:
            arguments_choice[SIZE] = (self.__size, sizes)
            arguments_choice[EFFECT] = (self.__effect, effects)
            chosen_args: types._UsageArgumentsType = Designer._prepare_arguments(arguments_choice)
            return Designer.__pre_design(POWER, chosen_args, **kwargs)
        else:
            raise ValueError(f'Incorrect parameter name to design, choose from {", ".join(designable_parameters)}')


def load_from_config(yaml_config: str, loader: type = yaml.Loader) -> Designer:
    """
    Restore a ``Designer`` class instance from a yaml config.

    For yaml_config you can pass file name with config,
    it must ends with .yaml, for example: "config.yaml".

    For loader you can choose SafeLoader.
    """
    if isinstance(yaml_config, str) and yaml_config.endswith(".yaml"):
        with open(yaml_config, "r", encoding="utf8") as file:
            return yaml.load(file, Loader=loader)
    return yaml.load(yaml_config, Loader=loader)


def design(
    to_design,
    dataframe: types.PassedDataType,
    metrics: types.MetricNamesType,
    sizes: types.SampleSizeType = None,
    effects: types.EffectType = None,
    first_type_errors: types.StatErrorType = (0.05,),
    second_type_errors: types.StatErrorType = (0.2,),
    method: str = "theory",
    **kwargs,
) -> types.DesignerResult:
    """
    Function wrapper around the ``Designer`` class.

    Make experiment design based on historical data using passed arguments.

    Creates an instance of the ``Designer`` class internally and execute
    run method with corresponding arguments.

    Parameters
    ----------
    to_design : str
        Parameter that will be designed using historical data.
        Can take the values of ``"size"``, ``"effect"`` or ``"power"``.
    dataframe : PassedDataType
        DataFrame with metrics historical values.
    metrics : MetricNamesType
        Column names of metrics in dataframe to be designed.
    sizes : SampleSizeType, optional
        Values of research objects number in groups samples during
        the experiment.
        If is not provided, ``effects`` value must be defined.
    effects : EffectType, optional
        Effects for experiment
        If is not provided, ``sizes`` value must be defined.
    first_type_errors : StatErrorType, default: ``(0.05,)``
        I type error bounds
        P (detect difference for equal) < alpha.
    second_type_errors : StatErrorType, default: ``(0.2,)``
        II type error bounds
        P (suppose equality for different groups) < beta.
    method : str, default: ``"theory"``
        Method used for experiment design.
        Can be ``"theory"``, ``"empiric"`` or ``"binary"``.
    **kwargs : Dict
        Other keyword arguments.

    Other Parameters
    ----------------
    as_numeric : bool, default: ``False``
        The result of calculations can be obtained as a percentage string
        either as a number, this parameter could used to toggle.
    groups_ratio : float, default: ``1.0``
        Ratio between two groups.
    alternative : str, default: ``"two-sided"``
        Alternative hypothesis, can be ``"two-sided"``, ``"greater"``
        or ``"less"``.
        ``"greater"`` - if effect is positive.
        ``"less"`` - if effect is negative.
    stabilizing_method : str, default: ``"asin"``
        Effect trasformation. Can be ``"asin"`` and ``"norm"``.
        For non-binary metrics: only ``"norm"`` is accceptable.
        For binary metrics: ``"norm"`` and ``"asin"``, but ``"asin"``
        is more robust and accurate.
        Acceptable only for ``"theory"`` method and actual for binary metrics!

    Returns
    -------
    result : DesignerResult
        Table or dictionary with the results of parameter design for each
        metric.
    """
    return Designer(
        dataframe=dataframe,
        metrics=metrics,
        first_type_errors=first_type_errors,
        second_type_errors=second_type_errors,
        sizes=sizes,
        effects=effects,
        method=method,
    ).run(to_design, **kwargs)


def design_binary_size(
    prob_a: float,
    effects: types.EffectType,
    first_type_errors: types.StatErrorType = (0.05,),
    second_type_errors: types.StatErrorType = (0.2,),
    method: str = "theory",
    groups_ratio: float = 1.0,
    alternative: str = "two-sided",
    stabilizing_method: str = "asin",
    **kwargs,
) -> pd.DataFrame:
    """
    Design size for binary metrics.

    Parameters
    ----------
    prob_a : float
        Probability of success for the control group.
    effects : EffectType
        List or single value of relative effects.
        For example: ``1.05``, ``[1.05, 1.2]``.
    first_type_errors : StatErrorType, default: ``(0.05,)``
       I type error bounds
       P (detect difference for equal) < alpha.
    second_type_errors : StatErrorType, default: ``(0.2,)``
       II type error bounds
       P (suppose equality for different groups) < beta.
    method : str, default: ``"theory"``
        Supports 2 methods: ``"theory"`` and ``"binary"``
        ``"theory"`` ~ by formula using statsmodels solve_power mechanism
        ``"binary"`` ~ using different types of intervals
    groups_ratio : float, default: ``1.0``
        Ratio between two groups.
    alternative : str, default: ``"two-sided"``
        Alternative hypothesis, can be ``"two-sided"``, ``"greater"``
        or ``"less"``.
        ``"greater"`` - if effect is positive.
        ``"less"`` - if effect is negative.
    stabilizing_method : str, default: ``"asin"``
        Effect trasformation. Can be ``"asin"`` and ``"norm"``.
        For non-binary metrics: only ``"norm"`` is accceptable.
        For binary metrics: ``"norm"`` and ``"asin"``, but ``"asin"``
        is more robust and accurate.
    **kwargs : Dict
        Other keyword arguments.

    Returns
    -------
    result_table : pd.DataFrame
        Table with results of design.
    """
    if isinstance(effects, (float, int)):
        effects = [effects]
    if isinstance(first_type_errors, float):
        first_type_errors = [first_type_errors]
    if isinstance(second_type_errors, float):
        second_type_errors = [second_type_errors]
    if method == "theory":
        return theory_pkg.get_table_sample_size(
            mean=prob_a,
            std=None,
            effects=effects,
            first_errors=first_type_errors,
            second_errors=second_type_errors,
            target_type="binary",
            groups_ratio=groups_ratio,
            alternative=alternative,
            stabilizing_method=stabilizing_method,
        )
    elif method == "binary":
        return bin_pkg.get_table_sample_size_on_effect(
            p_a=prob_a,
            first_errors=first_type_errors,
            second_errors=second_type_errors,
            delta_relative_values=effects,
            **kwargs,
        )
    else:
        raise ValueError(f"Choose valid method from {BINARY_DESIGN_METHODS}, got {method}")


def design_binary_effect(
    prob_a: float,
    sizes: types.SampleSizeType,
    first_type_errors: types.StatErrorType = (0.05,),
    second_type_errors: types.StatErrorType = (0.2,),
    method: str = "theory",
    groups_ratio: float = 1.0,
    alternative: str = "two-sided",
    stabilizing_method: str = "asin",
    as_numeric: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Design effect for binary metrics.

    Parameters
    ----------
    prob_a : float
         Probability of success for the control group.
    sizes : SampleSizeType
        List or single value of group sizes.
        For example: ``100``, ``[100, 200]``.
    first_type_errors : StatErrorType, default: ``(0.05,)``
       I type error bounds
       P (detect difference for equal) < alpha.
    second_type_errors : StatErrorType, default: ``(0.2,)``
       II type error bounds
       P (suppose equality for different groups) < beta.
    method: str, default: ``"theory"``
        Supports 2 methods: ``"theory"`` and ``"binary"``
        ``"theory"`` ~ by formula using statsmodels solve_power mechanism
        ``"binary"`` ~ using different types of intervals
    groups_ratio : float, default: ``1.0``
        Ratio between two groups.
    alternative : str, default: ``"two-sided"``
        Alternative hypothesis, can be ``"two-sided"``, ``"greater"``
        or ``"less"``.
        ``"greater"`` - if effect is positive.
        ``"less"`` - if effect is negative.
    stabilizing_method : str, default: ``"asin"``
        Effect trasformation. Can be ``"asin"`` and ``"norm"``.
        For non-binary metrics: only ``"norm"`` is accceptable.
        For binary metrics: ``"norm"`` and ``"asin"``, but ``"asin"``
        is more robust and accurate.
    as_numeric : bool, default: ``False``
        The result of calculations can be obtained as a percentage string
        either as a number, this parameter could used to toggle.
    **kwargs : Dict
        Other keyword arguments.

    Returns
    -------
    result_table : pd.DataFrame
        Table with results of design.
    """
    if isinstance(sizes, int):
        sizes = [sizes]
    if isinstance(first_type_errors, float):
        first_type_errors = [first_type_errors]
    if isinstance(second_type_errors, float):
        second_type_errors = [second_type_errors]
    if method == "theory":
        return theory_pkg.get_minimal_effects_table(
            mean=prob_a,
            std=None,
            sample_sizes=sizes,
            first_errors=first_type_errors,
            second_errors=second_type_errors,
            as_numeric=as_numeric,
            target_type="binary",
            groups_ratio=groups_ratio,
            alternative=alternative,
            stabilizing_method=stabilizing_method,
        )
    elif method == "binary":
        return bin_pkg.get_table_effect_on_sample_size(
            p_a=prob_a,
            sample_sizes=sizes,
            first_errors=first_type_errors,
            second_errors=second_type_errors,
            as_numeric=as_numeric,
            **kwargs,
        )
    else:
        raise ValueError(f"Choose valid method from {BINARY_DESIGN_METHODS}, got {method}")


def design_binary_power(
    prob_a: float,
    sizes: types.SampleSizeType,
    effects: types.EffectType,
    first_type_errors: types.StatErrorType = (0.05,),
    method: str = "theory",
    groups_ratio: float = 1.0,
    alternative: str = "two-sided",
    stabilizing_method: str = "asin",
    as_numeric: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Design power for binary metrics.

    Parameters
    ----------
    prob_a : float
       Probability of success for the control group.
    sizes : SampleSizeType
        List of single value of group sizes.
        For example: ``100``, ``[100, 200]``.
    effects : EffectType
        List or single value of relative effects.
        For example: ``1.05``, ``[1.05, 1.2]``.
    first_type_errors : StatErrorType, default: ``(0.05,)``
       I type error bounds
       P (detect difference for equal) < alpha.
    method: str, default: ``"theory"``
        Supports 2 methods: ``"theory"`` and ``"binary"``
        ``"theory"`` ~ by formula using statsmodels solve_power mechanism
        ``"binary"`` ~ using different types of intervals
    groups_ratio : float, default: ``1.0``
        Ratio between two groups.
    alternative : str, default: ``"two-sided"``
        Alternative hypothesis, can be ``"two-sided"``, ``"greater"``
        or ``"less"``.
        ``"greater"`` - if effect is positive.
        ``"less"`` - if effect is negative.
    stabilizing_method : str, default: ``"asin"``
        Effect trasformation. Can be ``"asin"`` and ``"norm"``.
        For non-binary metrics: only ``"norm"`` is accceptable.
        For binary metrics: ``"norm"`` and ``"asin"``, but ``"asin"``
        is more robust and accurate.
    as_numeric : bool, default: ``False``
        The result of calculations can be obtained as a percentage string
        either as a number, this parameter could used to toggle.
    **kwargs : Dict
        Other keyword arguments.

    Returns
    -------
    result_table : pd.DataFrame
        Table with results of design.
    """
    if isinstance(effects, (int, float)):
        effects = [effects]
    if isinstance(sizes, int):
        sizes = [sizes]
    if isinstance(first_type_errors, float):
        first_type_errors = [first_type_errors]
    if method == "theory":
        return theory_pkg.get_power_table(
            mean=prob_a,
            std=None,
            sample_sizes=sizes,
            effects=effects,
            first_errors=first_type_errors,
            as_numeric=as_numeric,
            target_type="binary",
            groups_ratio=groups_ratio,
            alternative=alternative,
            stabilizing_method=stabilizing_method,
        )
    elif method == "binary":
        return bin_pkg.get_table_power_on_size_and_delta(
            p_a=prob_a,
            sample_sizes=sizes,
            first_errors=first_type_errors,
            delta_relative_values=effects,
            as_numeric=as_numeric,
            **kwargs,
        )
    else:
        raise ValueError(f"Choose valid method from {BINARY_DESIGN_METHODS}, got {method}")


def design_binary(
    to_design: str,
    prob_a: float,
    sizes: Optional[types.SampleSizeType] = None,
    effects: Optional[types.EffectType] = None,
    first_type_errors: types.StatErrorType = (0.05,),
    second_type_errors: types.StatErrorType = (0.2,),
    method: str = "theory",
    groups_ratio: float = 1.0,
    alternative: str = "two-sided",
    stabilizing_method: str = "asin",
    **kwargs,
) -> pd.DataFrame:
    """
    Design of experiment parameters for binary metrics based
    on a known conversion value.

    Parameters
    ----------
    to_design : str
        Parameter to design.
    prob_a : float
        Probability of success for the control group.
    sizes : SampleSizeType, optional
        List or single value of group sizes.
        For example: ``100``, ``[100, 200]``.
    effects : EffectType, optional
        List of single value of relative effects.
        For example: 1.05, [1.05, 1.2].
    first_type_errors : StatErrorType, default: ``(0.05, )``
        I type error bounds
        P (detect difference for equal) < alpha.
    second_type_errors : StatErrorType, default: ``(0.2,)``
        II type error bounds
        P (suppose equality for different groups) < beta.
    method: str, default: ``"theory"``
        Supports 2 methods: ``"theory"`` and ``"binary"``
        ``"theory"`` ~ by formula using statsmodels solve_power mechanism
        ``"binary"`` ~ using different types of intervals
    groups_ratio : float, default: ``1.0``
        Ratio between two groups.
    alternative : str, default: ``"two-sided"``
        Alternative hypothesis, can be ``"two-sided"``, ``"greater"``
        or ``"less"``.
        ``"greater"`` - if effect is positive.
        ``"less"`` - if effect is negative.
    stabilizing_method : str, default: ``"asin"``
        Effect trasformation. Can be ``"asin"`` and ``"norm"``.
        For non-binary metrics: only ``"norm"`` is accceptable.
        For binary metrics: ``"norm"`` and ``"asin"``, but ``"asin"``
        is more robust and accurate.
    **kwargs : Dict
        Other keyword arguments.

    Returns
    -------
    result_table : pd.DataFrame
        Table with results of design.
    """
    if to_design == SIZE:
        return design_binary_size(
            prob_a,
            effects,
            first_type_errors,
            second_type_errors,
            method,
            groups_ratio,
            alternative,
            stabilizing_method,
            **kwargs,
        )
    elif to_design == EFFECT:
        return design_binary_effect(
            prob_a,
            sizes,
            first_type_errors,
            second_type_errors,
            method,
            groups_ratio,
            alternative,
            stabilizing_method,
            **kwargs,
        )
    elif to_design == POWER:
        return design_binary_power(
            prob_a, sizes, effects, first_type_errors, method, groups_ratio, alternative, stabilizing_method, **kwargs
        )
    else:
        raise ValueError(f"Only {SIZE}, {EFFECT} and {POWER} parameters of the binary experiment could be designed.")


================================================
FILE: ambrosia/designer/handlers.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Handlers for different dataframe types.

Module containes functions and classes that help to deal
with data of different type during the experiment design problem.

These objects are used in `Designer` core class.
"""
import warnings
from typing import List

import pandas as pd

import ambrosia.spark_tools.empiric as empiric_spark
import ambrosia.spark_tools.theory as theory_spark
import ambrosia.tools.theoretical_tools as theory_pkg
import ambrosia.tools.tools as empiric_pkg
from ambrosia import types
from ambrosia.tools.ab_abstract_component import SimpleDesigner
from ambrosia.tools.import_tools import spark_installed

if spark_installed():
    import pyspark.sql.functions as spark_functions


DATA: str = "dataframe"
AVAILABLE: List[str] = ["pandas", "spark"]
AVAILABLE_TABLES_ERROR = TypeError(f'Type of table must be one of {", ".join(AVAILABLE)}')


class TheoryHandler(SimpleDesigner):
    """
    Unit for theory design.
    """

    def size_design(self, **kwargs) -> pd.DataFrame:
        return self._handle_cases(theory_pkg.design_groups_size, theory_spark.design_groups_size, **kwargs)

    def effect_design(self, **kwargs) -> pd.DataFrame:
        return self._handle_cases(theory_pkg.design_effect, theory_spark.design_effect, **kwargs)

    def power_design(self, **kwargs) -> pd.DataFrame:
        return self._handle_cases(theory_pkg.design_power, theory_spark.design_power, **kwargs)


class EmpiricHandler(SimpleDesigner):
    """
    Unit for empiric design.
    """

    def size_design(self, **kwargs) -> pd.DataFrame:
        return self._handle_cases(empiric_pkg.get_empirical_table_sample_size, empiric_spark.get_table_size, **kwargs)

    def effect_design(self, **kwargs) -> pd.DataFrame:
        return self._handle_cases(empiric_pkg.get_empirical_mde_table, empiric_spark.get_table_effect, **kwargs)

    def power_design(self, **kwargs) -> pd.DataFrame:
        if isinstance(kwargs[DATA], types.SparkDataFrame):
            kwargs["group_sizes"] = kwargs["sample_sizes_a"]
            del kwargs["sample_sizes_a"]
            del kwargs["sample_sizes_b"]
        return self._handle_cases(empiric_pkg.get_empirical_table_power, empiric_spark.get_table_power, **kwargs)


def calc_prob_control_class(table: types.PassedDataType, metric: types.MetricNameType) -> float:
    """
    Calculate conversion on binary metric for pandas or Spark dataframe.

    Parameters
    ----------
    table : SparkDataFrame or pd.DataFrame
        Table with binary metric.
    metric : MetricNameType
        Table Column name that containes binary metric of interest.

    Returns
    -------
    p_a : float
        Conversion in control group.
    """
    warning_message_values: str = "Metric values are not binary, choose empiric or theory method!"
    if isinstance(table, pd.DataFrame):
        if not set(table[metric].unique()).issubset({0, 1}):
            warnings.warn(warning_message_values)
        p_a = table[metric].mean()
    else:
        if not set(table.select(metric).distinct().toPandas()[metric]).issubset({0, 1}):
            warnings.warn(warning_message_values)
        p_a = table.select(spark_functions.mean(metric)).collect()[0][0]
    return p_a


================================================
FILE: ambrosia/preprocessing/__init__.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Subpackage for data preprocessing, including methods for accelerating
experiments.
"""
from .aggregate import AggregatePreprocessor
from .cuped import Cuped, MultiCuped
from .ml_var_reducer import MLVarianceReducer
from .preprocessor import Preprocessor
from .robust import IQRPreprocessor, RobustPreprocessor
from .transformers import BoxCoxTransformer, LinearizationTransformer, LogTransformer

__all__ = [
    "AggregatePreprocessor",
    "Cuped",
    "MultiCuped",
    "MLVarianceReducer",
    "Preprocessor",
    "RobustPreprocessor",
    "IQRPreprocessor",
    "BoxCoxTransformer",
    "LinearizationTransformer",
    "LogTransformer",
]


================================================
FILE: ambrosia/preprocessing/aggregate.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
"""
Module contains class for data aggregation during a preprocessing task.
"""
import copy
from typing import Any, Dict, Optional, Union

import pandas as pd

from ambrosia import types
from ambrosia.tools.ab_abstract_component import AbstractFittableTransformer
from ambrosia.tools.back_tools import wrap_cols


class AggregatePreprocessor(AbstractFittableTransformer):
    """
    Preprocessing class for data aggregation.

    Can group data by multiple columns and aggregate it using methods
    for real and categorial features.

    Parameters
    ----------
    categorial_method : types.MethodType, default: ``"mode"``
        Aggregation method for categorial variables that
        will become as a default behavior.
    real_method : types.MethodType, default: ``"sum"``
        Aggregation method for real variables that
        will become as a default behavior.

    Attributes
    ----------
    categorial_method : types.MethodType
        Default aggregation method for categorial variables.
    real_method : types.MethodType
        Default aggregation method for real variables.
    groupby_columns : types.ColumnNamesType
        Columns which were used for groupping in the last aggregation.
        Gets value after fitting the class instance.
    agg_params : Dict
        Dictionary with aggregation rules which was used in the last
        aggregation.
        Gets value after fitting the class instance.
    """

    @staticmethod
    def __mode_calculation(values: pd.Series) -> Any:
        """
        Mode function for aggregation.
        """
        return values.value_counts().index[0]

    @staticmethod
    def __simple_agg(values: pd.Series) -> Any:
        """
        Simple aggregation, just picks the first element.
        """
        return values.iloc[0]

    @staticmethod
    def __transform_agg_param(aggregation_method: types.MethodType) -> types.MethodType:
        """
        Invoke an aggregation callable function by given string alias.
        """
        if aggregation_method == "mode":
            return AggregatePreprocessor.__mode_calculation
        if aggregation_method == "simple":
            return AggregatePreprocessor.__simple_agg
        return aggregation_method

    @staticmethod
    def __transform_params(dataframe: pd.DataFrame, aggregation_params: Dict) -> Dict:
        """
        Iteratively apply transformations specified by aggragation parameters.
        """
        agg_params = copy.deepcopy(aggregation_params)
        for column, method in agg_params.items():
            if column not in dataframe.columns:
                raise ValueError(f"{column} does not exist in the dataframe!")
            agg_params[column] = AggregatePreprocessor.__transform_agg_param(method)
        return agg_params

    def __init__(self, categorial_method: types.MethodType = "mode", real_method: types.MethodType = "sum"):
        self.categorial_method = categorial_method
        self.real_method = real_method
        self.agg_params = None
        self.groupby_columns = None
        super().__init__()

    def __real_case_step(
        self,
        agg_params: Optional[Dict] = None,
        real_cols: Optional[types.ColumnNamesType] = None,
    ) -> None:
        """
        A private method containing aggregation parameters filling logic
        for real metrics.
        """
        real_cols = wrap_cols(real_cols)
        for real_feature in real_cols:
            agg_params[real_feature] = self.real_method

    def __categorial_case_step(
        self,
        agg_params: Optional[Dict] = None,
        categorial_cols: Optional[types.ColumnNamesType] = None,
    ) -> None:
        """
        A private method containing aggregation parameters filling logic
        for categorial metrics.
        """
        categorial_cols = wrap_cols(categorial_cols)
        for categorial_feature in categorial_cols:
            agg_params[categorial_feature] = self.categorial_method

    def __empty_args_step(
        self,
        agg_params: Optional[Dict] = None,
        real_cols: Optional[types.ColumnNamesType] = None,
        categorial_cols: Optional[types.ColumnNamesType] = None,
    ) -> None:
        """
        A private method containing aggregation parameters filling logic
        if no aggregation parameters passed.
        """
        if real_cols is not None:
            self.__real_case_step(agg_params, real_cols)
        if categorial_cols is not None:
            self.__categorial_case_step(agg_params, categorial_cols)

    def get_params_dict(self) -> Dict:
        """
        Returns dictionary with parameters of the last run() or transform() call.
        """
        self._check_fitted()
        return {"aggregation_params": self.agg_params, "groupby_columns": self.groupby_columns}

    def load_params_dict(self, params: Dict) -> None:
        """
        Load prefitted parameters form a dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with prefitted params.
        """
        if "groupby_columns" in params:
            self.groupby_columns = params["groupby_columns"]
        else:
            raise TypeError(f"params argument must contain: {'column_names'}")
        if "aggregation_params" in params:
            self.agg_params = params["aggregation_params"]
        else:
            raise TypeError(f"params argument must contain: {'aggregation_params'}")
        self.fitted = True

    def fit(
        self,
        dataframe: pd.DataFrame,
        groupby_columns: types.ColumnNamesType,
        agg_params: Optional[Dict] = None,
        real_cols: Optional[types.ColumnNamesType] = None,
        categorial_cols: Optional[types.ColumnNamesType] = None,
    ) -> pd.DataFrame:
        """
        Fit preprocessor with parameters of aggregation.

        Aggregation will be performed using passed dictionary with
        defined aggregation conditions for each columns of interest,
        or lists of columns with default class aggregation behavior.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with selected columns.
        groupby_columns : types.ColumnNamesType
            Columns for GROUP BY.
        agg_params : Dict, optional
            Dictionary with aggregation parameters.
        real_cols : types.ColumnNamesType, optional
            Columns with real metrics.
            Overriden by ``agg_params`` parameter and could be passed if
            expected default aggregation behavior.
        categorial_cols : types.ColumnNamesType, optional
            Columns with categorial metrics
            Overriden by ``agg_params`` parameter and could be passed if
            expected default aggregation behavior.

        Returns
        -------
        self : object
            Instance object.
        """
        if agg_params is None and real_cols is None and categorial_cols is None:
            raise ValueError("Set agg_params or pass real_cols and categorial_cols")
        if agg_params is None:
            agg_params = {}
            self.__empty_args_step(agg_params, real_cols, categorial_cols)
        self._check_cols(dataframe, agg_params.keys())
        self.groupby_columns = groupby_columns
        self.agg_params = copy.deepcopy(agg_params)
        self.fitted = True
        return self

    def transform(
        self,
        dataframe: pd.DataFrame,
    ) -> pd.DataFrame:
        """
        Apply table transformation by its aggregation with prefitted
        parameters.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table to aggregate.

        Returns
        -------
        agg_table : pd.DataFrame
            Aggregated table.
        """
        self._check_fitted()
        self._check_cols(dataframe, self.agg_params.keys())
        agg_params = AggregatePreprocessor.__transform_params(dataframe, self.agg_params)
        return dataframe.groupby(self.groupby_columns, as_index=False).agg(agg_params)

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        groupby_columns: types.ColumnNamesType,
        agg_params: Optional[Dict] = None,
        real_cols: Optional[types.ColumnNamesType] = None,
        categorial_cols: Optional[types.ColumnNamesType] = None,
    ) -> pd.DataFrame:
        """
        Fit preprocessor parameters using given dataframe and aggregate it.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table to aggregate.
        groupby_columns : types.ColumnNamesType
            Columns for GROUP BY.
        agg_params : Dict, optional
            Dictionary with aggregation parameters.
        real_cols : types.ColumnNamesType, optional
            Columns with real metrics.
            Overriden by ``agg_params`` parameter and could be passed if
            expected default aggregation behavior.
        categorial_cols : types.ColumnNamesType, optional
            Columns with categorial metrics
            Overriden by ``agg_params`` parameter and could be passed if
            expected default aggregation behavior.

        Returns
        -------
        agg_table : pd.DataFrame
            Aggregated table.
        """
        self.fit(dataframe, groupby_columns, agg_params, real_cols, categorial_cols)
        return self.transform(dataframe)


================================================
FILE: ambrosia/preprocessing/cuped.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Module contains CUPED-based data transformation methods for the experiment
acceleration.
"""
from typing import Dict, List, Optional, Union

import numpy as np
import pandas as pd

from ambrosia import types
from ambrosia.tools.ab_abstract_component import AbstractVarianceReducer
from ambrosia.tools.back_tools import wrap_cols


class Cuped(AbstractVarianceReducer):
    """
    Class for data CUPED transformation.

    https://towardsdatascience.com/how-to-double-a-b-testing-speed-with-cuped-f80460825a90
    Y_hat = Y - theta * X
    theta := cov(X, Y) / Var(Y)
    It is important, that the mean covariance metric did not change over time!!!

    Parameters
    ----------
    verbose : bool, default: ``True``
        If ``True`` will print in sys.stdout the information
        about the variance reduction.

    Attributes
    ----------
    params : Dict
        Parameters of instance that will be updated after calling fit() method.
        Include:
        - target column name
        - covariate column name
        - name of column after the transformation
        - linear coefficient for CUPED transformation.
        - bias value for mean equality
    verbose : bool
        Verbose info flag.
    fitted : bool
        Flag if class was fitted.

    Examples
    --------
    Suppose we have the dataframe with users info which contains two columns:
    a "target" columns and a column with metric "income". Let us can assume,
    that over time, the average of the "income" values do not change. Then, we
    can use CUPED transformation based on "income" data to reduce "target"
    column variation.

    >>> cuped_transformer = Cuped(dataframe, 'target', verbose=True)
    >>> cuped_transformer.fit_transform(
    >>>     dataframe=dataframe
    >>>     target_column='target'
    >>>     covariate_column='income',
    >>>     transformed_name='cuped_target'
    >>>     inplace=True,
    >>> )

    Now in the dataframe a new column "cuped_target" appeared, we can use it
    to design our experiment and estimate variance reduction. For further CUPED
    usage in the future experiment, let us store the parameters:

    >>> cuped_transformer.store_params('cuped_transform_params.json')

    Now we conduct an experiment and want to transform our data to reduce its
    variation:

    >>> cuped_transformation = Cuped()
    >>> cuped_transformation.load_params('cuped_transform_params.json')
    >>> cuped_transformation.transform(
    >>>     dataframe=exp_results,
    >>>     inplace=True,
    >>> )

    Methods
    -------
    get_params_dict()
        Returns dictionary with params if fit() method has been previously
        called.
    load_params_dict(params)
        Load params from a dictionary.
    store_params(store_path)
        Store params to json file if fit() method has been previously called.
    load_params(load_path)
        Load params from a json file.
    fit(covariate_column)
        Fit model using a specific covariate column.
    transform(covariate_column, inplace, name)
        Transform target column after a class instance fitting.
    fit_transform(covariate_column, inplace, name)
        Combination of fit() and transform() methods.
    """

    THETA_NAME: str = "theta"
    BIAS_NAME: str = "bias"
    non_serializable_params: List = [THETA_NAME, BIAS_NAME]

    def __init__(self, verbose: bool = True) -> None:
        super().__init__(verbose)
        self.params["covariate_column"] = None
        self.params[Cuped.THETA_NAME] = None
        self.params[Cuped.BIAS_NAME] = None

    def __str__(self) -> str:
        return f"СUPED for {self.params['target_column']}"

    def __call__(self, y: np.ndarray, X: np.ndarray) -> np.ndarray:
        self._check_fitted()
        y_hat: np.ndarray = y - self.params[Cuped.THETA_NAME] * (X - self.params[Cuped.BIAS_NAME])
        return y_hat

    def get_params_dict(self) -> Dict:
        """
        Returns a dictionary with params.

        Returns
        -------
        params : Dict
            Dictionary with fitted params.
        """
        self._check_fitted()
        return {
            key: (value if key not in Cuped.non_serializable_params else value.tolist())
            for key, value in self.params.items()
        }

    def load_params_dict(self, params: Dict) -> None:
        """
        Load model parameters from the dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with params.
        """
        for parameter in self.params:
            if parameter in params:
                if parameter in Cuped.non_serializable_params:
                    self.params[parameter] = np.array(params[parameter])
                else:
                    self.params[parameter] = params[parameter]
            else:
                raise TypeError(f"params argument must contain: {parameter}")
        self.fitted = True

    def fit(
        self,
        dataframe: pd.DataFrame,
        target_column: types.ColumnNameType,
        covariate_column: types.ColumnNameType,
        transformed_name: Optional[types.ColumnNameType] = None,
    ) -> None:
        """
        Fit to calculate CUPED parameters for target column using given
        covariate column and data.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for the calculation of CUPED parameters.
        target_column : ColumnNameType
            Column from the dataframe, for which CUPED transformation will be
            applied.
        covariate_column : ColumnNameType
            Column which will be used as the covariate in CUPED transformation.
        transformed_name : ColumnNamesType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        """
        self._check_cols(dataframe, [target_column, covariate_column])
        covariance: pd.DataFrame = dataframe[[target_column, covariate_column]].cov()
        covariate_variance: float = covariance.loc[covariate_column, covariate_column]

        self.params[Cuped.THETA_NAME] = covariance.loc[target_column, covariate_column] / (
            super().EPSILON + covariate_variance
        )
        self.params[Cuped.BIAS_NAME] = np.mean(dataframe[covariate_column])
        self.params["target_column"] = target_column
        self.params["covariate_column"] = covariate_column
        self.params["transformed_name"] = transformed_name
        self.fitted = True

    def transform(
        self,
        dataframe: pd.DataFrame,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Make CUPED transformation for the target column.

        Could be performed inplace or not.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for CUPED transformation.
        inplace : bool, default: ``False``
            If is ``True``, then method returns ``None`` and
            sets a new column for the original dataframe.
            Otherwise return copied dataframe with a new column.
        """
        self._check_cols(dataframe, [self.params["target_column"], self.params["covariate_column"]])
        new_target: np.ndarray = self(
            dataframe[self.params["target_column"]], dataframe[self.params["covariate_column"]]
        )
        if self.verbose:
            old_variance: float = np.var(dataframe[self.params["target_column"]])
            new_variance: float = np.var(new_target)
            self._verbose(old_variance, new_variance)
        return self._return_result(dataframe, new_target, inplace)

    def fit_transform(
        self,
        dataframe,
        target_column,
        covariate_column: types.ColumnNameType,
        transformed_name: Optional[types.ColumnNameType] = None,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Combination of fit() and transform() methods.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for fitting and applying CUPED transformation.
        target_column : ColumnNameType
            Column from the dataframe, for which CUPED transformation will be
            applied.
        covariate_column : ColumnNameType
            Column which will be used as the covariate.
        transformed_name : ColumnNamesType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        inplace : bool, default: ``False``
            If is ``True``, then method returns ``None`` and
            sets a new column for the original dataframe.
            Otherwise return copied dataframe with a new column.
        """
        self.fit(dataframe, target_column, covariate_column, transformed_name)
        return self.transform(dataframe, inplace)


class MultiCuped(AbstractVarianceReducer):
    """
    Class for data Multi CUPED transformation.

    Y_hat = Y - X theta (Matrix multiplication)
    theta := argmin Var (Y - X theta)
    It is important, that the mean covariance metric do not change over time!!!


    Parameters
    ----------
    verbose : bool, default: ``True``
        If ``True`` will print in sys.stdout the information
        about the variance reduction.

    Attributes
    ----------
    params : Dict
        Parameters of instance that will be updated after calling fit() method.
        Include:
        - target column name
        - covariate columns names
        - name of column after the transformation
        - linear coefficients for Multi CUPED transformation.
        - bias value for mean equality
    verbose : bool
        Verbose info flag.
    fitted : bool
        Flag if class was fitted.

    Examples
    --------
    We have dataframe with users info with column 'target' and
    columns 'income' and 'age'. We can assume, that over time,
    the average of this covariate values does not change. Then, we can use
    multi cuped transformation to reduce variation.

    Suppose we have the dataframe with users info which contains two columns:
    a "target" columns and columns "income" and "age". Let us can assume,
    that over time, the average of the "income" and "age" values do not change.
    Then, we can use Multi CUPED transformation based on "income" and "age"
    data in order to reduce "target" column variation.

    >>> cuped_transformer = MultiCuped(verbose=True)
    >>> cuped_transformer.fit_transform(
    >>>     dataframe=dataframe
    >>>     target_column='target'
    >>>     ['income', 'age'],
    >>>     transformed_name='cuped_target'
    >>>     inplace=True,
    >>> )

    Now in the dataframe a new column "cuped_target" appeared, we can use it
    to design our experiment and estimate variance reduction. For further
    Multi CUPED usage in the future experiment, let us store the parameters:

    >>> cuped_transformer.store_params('cuped_transform_params.json')

    Now we conduct an experiment and want to transform our data to reduce its
    variation:

    >>> cuped_transformation = MultiCuped()
    >>> cuped_transformation.load_params('cuped_transform_params.json')
    >>> cuped_transformation.transform(
    >>>     exp_results,
    >>>     inplace=True,
    >>> )

    Methods
    -------
    get_params_dict()
        Returns dictionary with params if fit() method has been previously
        called.
    load_params_dict(params)
        Load params from a dictionary.
    store_params(store_path)
        Store params to json file if fit() method has been previously called.
    load_params(load_path)
        Load params from a json file.
    fit(covariate_column)
        Fit model using covariate columns.
    transform(covariate_column, inplace, name)
        Transform target column after a class instance fitting.
    fit_transform(covariate_column, inplace, name)
        Combination of fit() and transform() methods.
    """

    THETA_NAME: str = "theta"
    BIAS_NAME: str = "bias"
    non_serializable_params: List = [THETA_NAME, BIAS_NAME]

    def __init__(self, verbose: bool = True) -> None:
        super().__init__(verbose)
        self.params["covariate_columns"] = None
        self.params[MultiCuped.THETA_NAME] = None
        self.params[MultiCuped.BIAS_NAME] = None

    def __str__(self) -> str:
        return f"Multi СUPED for {self.params['target_column']}"

    def __call__(self, y: np.ndarray, X: np.ndarray) -> np.ndarray:
        self._check_fitted()
        y_hat: np.ndarray = y - (X @ self.params[MultiCuped.THETA_NAME]).reshape(-1) + self.params[MultiCuped.BIAS_NAME]
        return y_hat

    def get_params_dict(self) -> Dict:
        """
        Returns a dictionary with params.

        Returns
        -------
        params : Dict
            Dictionary with fitted params.
        """
        self._check_fitted()
        return {
            key: (value if key not in MultiCuped.non_serializable_params else value.tolist())
            for key, value in self.params.items()
        }

    def load_params_dict(self, params: Dict) -> None:
        """
        Load model parameters from the dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with params.
        """
        for parameter in self.params:
            if parameter in params:
                if parameter in MultiCuped.non_serializable_params:
                    self.params[parameter] = np.array(params[parameter])
                else:
                    self.params[parameter] = params[parameter]
            else:
                raise TypeError(f"params argument must contain: {parameter}")
        self.fitted = True

    def fit(
        self,
        dataframe: pd.DataFrame,
        target_column: types.ColumnNameType,
        covariate_columns: types.ColumnNamesType,
        transformed_name: Optional[types.ColumnNameType] = None,
    ) -> None:
        """
        Fit to calculate Multi CUPED parameters for target column using selected
        covariate columns.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for the calculation of CUPED parameters.
        target_column : ColumnNameType
            Column from the dataframe, for which CUPED transformation will be
            applied.
        covariate_columns : ColumnNamesType
            Columns which will be used as the covariates in Multi CUPED
            transformation.
        transformed_name : ColumnNamesType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        """
        covariate_columns = wrap_cols(covariate_columns)
        cols_concat: List = [target_column] + covariate_columns
        self._check_cols(dataframe, cols_concat)
        covariance: np.ndarray = dataframe[cols_concat].cov()
        matrix: np.ndarray = covariance.loc[covariate_columns, covariate_columns]
        num_features: int = len(covariate_columns)
        covariance_target: np.ndarray = covariance.loc[covariate_columns, target_column].values.reshape(
            num_features, -1
        )

        self.params[MultiCuped.THETA_NAME] = np.linalg.inv(matrix) @ covariance_target
        self.params[MultiCuped.BIAS_NAME]: np.ndarray = (
            (dataframe[covariate_columns].values @ self.params[MultiCuped.THETA_NAME]).reshape(-1).mean()
        )
        self.params["target_column"] = target_column
        self.params["covariate_columns"] = covariate_columns
        self.params["transformed_name"] = transformed_name
        self.fitted = True

    def transform(
        self,
        dataframe: pd.DataFrame,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Make Multi CUPED transformation for the target column.

        Could be performed inplace or not.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for Multi CUPED transformation.
        inplace : bool, default: ``False``
            If is ``True``, then method returns ``None`` and
            sets a new column for the original dataframe.
            Otherwise return copied dataframe with a new column.
        """
        self._check_cols(dataframe, [self.params["target_column"]] + self.params["covariate_columns"])
        self._check_fitted()
        new_target: np.ndarray = self(
            dataframe[self.params["target_column"]].values, dataframe[self.params["covariate_columns"]].values
        )
        if self.verbose:
            old_variance: float = np.var(dataframe[self.params["target_column"]])
            new_variance: float = np.var(new_target)
            self._verbose(old_variance, new_variance)
        return self._return_result(dataframe, new_target, inplace)

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        target_column: types.ColumnNameType,
        covariate_columns: types.ColumnNamesType,
        transformed_name: Optional[types.ColumnNameType] = None,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Combination of fit() and transform() methods.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for fitting and applying Multi CUPED transformation.
        target_column : ColumnNameType
            Column from the dataframe, for which CUPED transformation will be
            applied.
        covariate_column : ColumnNameType
            Column which will be used as the covariate.
        transformed_name : ColumnNamesType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        inplace : bool, default: ``False``
            If is ``True``, then method returns ``None`` and
            sets a new column for the original dataframe.
            Otherwise return copied dataframe with a new column.
        """
        self.fit(dataframe, target_column, covariate_columns, transformed_name)
        return self.transform(dataframe, inplace)


================================================
FILE: ambrosia/preprocessing/ml_var_reducer.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Module contains ML-based data transformation methods for the experiment
acceleration.
"""
import json
from pathlib import Path
from typing import Any, Callable, Dict, Optional, Union

import joblib
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

from ambrosia import types
from ambrosia.tools import log
from ambrosia.tools.ab_abstract_component import AbstractVarianceReducer
from ambrosia.tools.back_tools import wrap_cols


class MLVarianceReducer(AbstractVarianceReducer):
    """
    Machine Learning approach for variance reduction.

    Building a model M, we can make a transformation:
    Y_hat = Y - M(X) + MEAN(M(X))

    It is important, that that the mean of M(X) do not change over time!!!
    You can choose models from Gradient boosting or Ridge regression or your
    own model class, for example ``sklearn.ensemble.RandomForest``, and pass
    models params to constructor function for a model assembly.

    Parameters
    ----------
    model : str or model type, default: ``"boosting"``
        Model which will be used for the transformations.
    model_params : Dict, optional
        Dictionary with parameters which will be used in constructor
        for a model assembly.
    scores : Dict[str, Callable], optional
        Scores which will be used.
    verbose : bool, default: ``True``
        If ``True`` will print in sys.stdout the information
        about the reduction in variance.

    Attributes
    ----------
    model : model type
        Model which will be used for the transformations.
    params : Dict
        Parameters of instance that will be updated after calling fit() method.
        Include:
        - target column name
        - covariate columns names
        - name of column after the transformation
        - additional train bias equals mean(M(X)).
    scores : Dict[str, Callable]
        Scores which will be used.
    verbose : bool
        Verbose info flag.
    fitted : bool
        Fit status flag.

    Examples
    --------
    We have data table with column 'target' and columns 'feature_1',
    'feature_2', 'feature_3'. Let us assume, that means of all these metrics
    don't change over the time, it can be age for example. We want to reduce
    variance using the predictions some of ML model, then we can use this class:

    >>> transformer = MLVarianceReducer() # By default CatBoost model will be choosen
    >>> transformer.fit_transform(dataframe, 'target', [feature columns], inplace=True, name='new_target')
    >>> transformer.store_params('path_ml_params.json')

    Now to transform the experimental data we use the following commands:

    >>> transformer = MLVarianceReducer()
    >>> transformer.load_params('path_ml_params.json')
    >>> transformer.transform(exp_data, inplace=True)

    Methods
    -------
    get_params_dict()
        Returns dict with instance fitted parameters.
    load_params_dict()
        Load parameters from the dict.
    store_params(store_path)
        Store fitted params in a json file and pickle model file.
    load_params(load_path)
        Load params from a json file and pickled model.
    fit(**fit_params)
        Fit model using a train data.
    transform(dataframe, inplace)
        Transform target column of a data frame.
    fit_transform(dataframe, **fit_params, inplace)
        Combination of fit() and transform() methods.
    """

    def __set_scorer(self, scores: Optional[Dict[str, Callable]]):
        """
        Support method for scorer setting.
        """
        if scores is not None:
            self.score = scores
        else:
            self.score = {"MSE": mean_squared_error}

    def __create_model(self) -> None:
        """
        Construct variance reducing ML model.
        """
        if not isinstance(self.model, str):
            self.model = self.model(**self.model_params)
        if self.model == "linear":
            self.model = Ridge(**self.model_params)
        if self.model == "boosting":
            if "verbose" not in self.model_params:
                self.model_params["verbose"] = False
            self.model = CatBoostRegressor(**self.model_params)

    def __init__(
        self,
        model: Union[str, Any] = "boosting",
        model_params: Optional[Dict] = None,
        scores: Optional[Dict[str, Callable]] = None,
        verbose: bool = True,
    ) -> None:
        super().__init__(verbose)
        self.params["covariate_columns"] = None
        self.params["train_bias"] = None
        self.model = model
        self.model_params = {} if model_params is None else model_params
        self.__set_scorer(scores)

    def __str__(self) -> str:
        return f"ML approach reduce for {self.params['target_column']}"

    def __call__(self, y: np.ndarray, X: np.ndarray) -> np.ndarray:
        """
        Transform target values using its predictions based on covariates.

        Class must be fitted.
        """
        self._check_fitted()
        y_hat = y - self.model.predict(X) + self.params["train_bias"]
        return y_hat

    def _verbose_score(self, dataframe: pd.DataFrame, prediction: np.ndarray) -> None:
        for name, scorer in self.score.items():
            current_score: float = scorer(dataframe[self.params["target_column"]], prediction)
            log.info_log(f"Prediction {name} score - {current_score:.5f}")

    def _check_load_params(self, params: Dict) -> None:
        for parameter in self.params:
            if parameter in params:
                self.params[parameter] = params[parameter]
            else:
                raise TypeError(f"params argument must contain: {parameter}")

    def get_params_dict(self) -> Dict:
        """
        Returns a dictionary with params.

        Returns
        -------
        params : Dict
            Dictionary with fitted params.
        """
        self._check_fitted()
        return {
            "target_column": self.params["target_column"],
            "covariate_columns": self.params["covariate_columns"],
            "transformed_name": self.params["transformed_name"],
            "train_bias": self.params["train_bias"],
            "model": self.model,
        }

    def load_params_dict(self, params: Dict) -> None:
        """
        Load instance parameters from the dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with params.
        """
        self._check_load_params(params)
        if "model" in params:
            self.model = params["model"]
        else:
            raise TypeError(f"params argument must contain: {'model'}")
        self.fitted = True

    def store_params(self, config_store_path: Path, model_store_path: Path) -> None:
        """
        Store params of model as a json file, available only for CatBoost
        model.

        You can reach model using instance.model and store it by yourself.

        Parameters
        ----------
         store_path : Path
            Path where models parameters will be stored in a json format.
        """
        self._check_fitted()
        with open(config_store_path, "w+") as file:
            json.dump(self.params, file)
        joblib.dump(self.model, model_store_path)

    def load_params(self, config_load_path: Path, model_load_path: Path) -> None:
        """
        Load models params from a json file, works only for CatBoost model.

        Parameters
        ----------
        load_path: Path
            Path to a json file with model parameters.
        """
        with open(config_load_path, "r+") as file:
            params = json.load(file)
            self._check_load_params(params)
        self.model = joblib.load(model_load_path)
        self.fitted = True

    def fit(
        self,
        dataframe: pd.DataFrame,
        target_column: types.ColumnNameType,
        covariate_columns: types.ColumnNamesType,
        transformed_name: Optional[types.ColumnNamesType] = None,
    ) -> None:
        """
        Fit model for transformations.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for model fitting.
        target_column : ColumnNameType
            Column from the dataframe, for which transformation will be
            applied.
        covariate_columns: ColumnNamesType
            Columns which will be used for the transformation.
        transformed_name : ColumnNamesType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        """
        covariate_columns = wrap_cols(covariate_columns)
        self._check_cols(dataframe, [target_column] + covariate_columns)
        self.__create_model()
        self.model.fit(dataframe[covariate_columns].values, dataframe[target_column].values)

        self.params["target_column"] = target_column
        self.params["transformed_name"] = transformed_name
        self.params["covariate_columns"] = covariate_columns
        self.params["train_bias"] = np.mean(self.model.predict(dataframe[covariate_columns].values))
        self.fitted = True

    def transform(
        self,
        dataframe: pd.DataFrame,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Transform data using the fitted model.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for transformation.
        inplace : bool, default: ``False``
            If is ``True``, then method returns ``None`` and
            sets a new column for the original dataframe.
            Otherwise return copied dataframe with a new column.
        """
        self._check_cols(dataframe, [self.params["target_column"]] + self.params["covariate_columns"])
        self._check_fitted()
        prediction: np.ndarray = self(
            dataframe[self.params["target_column"]].values, dataframe[self.params["covariate_columns"]].values
        )
        new_target: np.ndarray = prediction + np.mean(dataframe[self.params["target_column"]]) - np.mean(prediction)
        if self.verbose:
            old_variance: float = np.var(dataframe[self.params["target_column"]].values)
            new_variance: float = np.var(prediction)
            self._verbose(old_variance, new_variance)
            self._verbose_score(dataframe, prediction)
        return self._return_result(dataframe, new_target, inplace)

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        target_column: types.ColumnNameType,
        covariate_columns: types.ColumnNamesType,
        transformed_name: Optional[types.ColumnNamesType] = None,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Combinate consequentially ``fit()`` and ``transform()`` methods.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Table with data for model fitting and further transformation.
        target_column : ColumnNameType
            Column from the dataframe, for which transformation will be
            applied.
        covariate_columns: ColumnNamesType
            Columns which will be used for the transformation.
        transformed_name : ColumnNamesType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        inplace : bool, default: ``False``
            If is ``True``, then method returns ``None`` and
            sets a new column for the original dataframe.
            Otherwise return copied dataframe with a new column.
        """
        self.fit(dataframe, target_column, covariate_columns, transformed_name)
        return self.transform(dataframe, inplace)


================================================
FILE: ambrosia/preprocessing/preprocessor.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Module contains `Preprocessor` class that combines all data preprocessing
methods in one single chain pipeline. The resulting pipeline allows one to
consistently apply the desired transformations to the data, including outliers
removal, data aggregation and target metric transformations for the variance
reduction.
"""
from __future__ import annotations

import inspect
import json
import sys
from pathlib import Path
from typing import Dict, List, Optional, Union

import numpy as np
import pandas as pd

from ambrosia import types
from ambrosia.preprocessing.aggregate import AggregatePreprocessor
from ambrosia.preprocessing.cuped import Cuped, MultiCuped
from ambrosia.preprocessing.robust import IQRPreprocessor, RobustPreprocessor
from ambrosia.preprocessing.transformers import BoxCoxTransformer, LinearizationTransformer, LogTransformer


class Preprocessor:
    """
    Preprocessor class, implementation is based on the chain pattern.

    Parameters
    ----------
    dataframe : pd.DataFrame
        Table with data used for further transformations.
    verbose : bool, default: ``True``
        If ``True`` will print in sys.stdout the information
        about the variance reduction.

    Attributes
    ----------
    dataframe : pd.DataFrame
        Table with data for transformations.
    transformers : List of transformations
        List of transformation that have been called before.
    verbose : bool
        Verbose info flag.

    Examples
    --------
    >>> transformer = Preprocessor(dataframe)
    >>> transformer.aggregate(aggregate_params)
    >>>            .robust(robust_params)
    >>>            .cuped(cuped_params)
    >>>            .data()

    Methods
    -------
    data(copy=True)
        Returns a copy or a link for the stored dataframe.
    aggregate(groupby_columns, categorial_method, real_method, agg_params,
              real_cols, categorial_cols)
        Aggreagate data by columns.
    robust(column_names, alpha=0.05)
        Make a robust preprocessing of data.
    iqr(column_names, alpha=0.05)
        Make an IQR preprocessing of data.
    boxcox(column_names, alpha=0.05)
        Make a Box-Cox transformation.
    log(column_names, alpha=0.05)
        Make a log transformation.
    cuped(target, by, name, load_path)
        Make CUPED transformation for the stored dataframe.
    multicuped(target, by, name, load_path)
        Make Multi CUPED transformation for the stored dataframe.
    transformations()
        Returns a list of transformations.
    store_transformations(store_path)
        Store transformations in a json file.
    load_transformations(load_path)
        Load transformations from a json file.
    apply_transformations()
        Apply transformations for the stored dataframe.
    transform_from_config(load_path)
        Transform inner data frame using pre-saved config file.
    """

    def __len__(self) -> int:
        return len(self.dataframe)

    def __init__(self, dataframe: pd.DataFrame, verbose: bool = True) -> None:
        self.dataframe = dataframe.copy()
        self.transformers = []
        self.verbose = verbose

    def data(self, copy: bool = True):
        """
        Return the inner data frame.

        Use after all transformations to get transformed data.

        Parameters
        ----------
        copy : bool, default: ``True``
            If true returns copy, otherwise link

        Returns
        -------
        dataframe : pd.DataFrame
            Table with the modified data after the sequential preprocessing.
        """
        return self.dataframe.copy() if copy else self.dataframe

    def aggregate(
        self,
        groupby_columns: Optional[types.ColumnNamesType] = None,
        categorial_method: types.MethodType = "mode",
        real_method: types.MethodType = "sum",
        agg_params: Optional[Dict] = None,
        real_cols: Optional[types.ColumnNamesType] = None,
        categorial_cols: Optional[types.ColumnNamesType] = None,
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Make an aggregation of the dataframe.

        Parameters
        ----------
        groupby_columns : List of columns, optional
            Columns for GROUP BY.
        categorial_method : types.MethodType, default: ``"mode"``
            Aggregation method  that will be applied for all selected
            categorial variables.
        real_method : types.MethodType, default: ``"sum"``
            Aggregation method  that will be applied for all selected
            real variables.
        agg_params : Dict, optional
            Dictionary with aggregation parameters.
        real_cols : types.ColumnNamesType, optional
            Columns with real metrics.
            Overriden by ``agg_params`` parameter and could be passed if
            expected default aggregation behavior.
        categorial_cols : types.ColumnNamesType, optional
            Columns with categorial metrics
            Overriden by ``agg_params`` parameter and could be passed if
            expected default aggregation behavior.

        Returns
        -------
        self : Preprocessor
            Instance object
        """
        transformer = AggregatePreprocessor(categorial_method, real_method)
        if load_path is None:
            self.dataframe = transformer.fit_transform(
                self.dataframe, groupby_columns, agg_params, real_cols, categorial_cols
            )
        else:
            transformer.load_params(load_path)
            self.dataframe = transformer.transform(self.dataframe)
        self.transformers.append(transformer)
        return self

    def robust(
        self,
        column_names: Optional[types.ColumnNamesType] = None,
        alpha: Union[float, np.ndarray] = 0.05,
        tail: str = "both",
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Make a robust preprocessing of the selected columns to remove outliers.

        Removes objects from the dataframe which are in the head, end or
        both tail parts of the selected metrics distributions.

        Parameters
        ----------
        column_names : ColumnNamesType
            One or number of columns in the dataframe.
        alpha : Union[float, np.ndarray], default: ``0.05``
            The percentage of removed data from head and tail.
        tail : str, default: ``"both"``
            Part of distribution to be removed.
            Can be ``"left"``, ``"right"`` or ``"both"``.
        load_path : Path, optional
            Path to json file with parameters.

        Returns
        -------
        self : Preprocessor
            Instance object
        """
        transformer = RobustPreprocessor(verbose=self.verbose)
        if load_path is None:
            transformer.fit_transform(self.dataframe, column_names, alpha, tail, inplace=True)
        else:
            transformer.load_params(load_path)
            transformer.transform(self.dataframe, inplace=True)
        self.transformers.append(transformer)
        return self

    def iqr(
        self,
        column_names: Optional[types.ColumnNamesType] = None,
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Make an IQR preprocessing of the selected columns to remove outliers.

        Removes objects from the dataframe which are behind boxplot maximum
        and minimum of the selected metrics distributions.

        Parameters
        ----------
        column_names : ColumnNamesType, optional
            One or number of columns in the dataframe.
        load_path : Path, optional
            Path to json file with parameters.

        Returns
        -------
        self : Preprocessor
            Instance object
        """
        transformer = IQRPreprocessor(verbose=self.verbose)
        if load_path is None:
            transformer.fit_transform(self.dataframe, column_names, inplace=True)
        else:
            transformer.load_params(load_path)
            transformer.transform(self.dataframe, inplace=True)
        self.transformers.append(transformer)
        return self

    def boxcox(
        self,
        column_names: Optional[types.ColumnNamesType] = None,
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Make a Box-Cox transformation on the selected columns.

        Optimal transformation parameters are selected automatically.

        Parameters
        ----------
        column_names : ColumnNamesType, optional
            One or number of columns in the dataframe.
        load_path : Path, optional
            Path to json file with parameters.

        Returns
        -------
        self : Preprocessor
            Instance object
        """
        transformer = BoxCoxTransformer()
        if load_path is None:
            transformer.fit_transform(self.dataframe, column_names, inplace=True)
        else:
            transformer.load_params(load_path)
            transformer.transform(self.dataframe, inplace=True)
        self.transformers.append(transformer)
        return self

    def log(
        self,
        column_names: Optional[types.ColumnNamesType] = None,
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Make a logarithmic transformation on the selected columns.

        Parameters
        ----------
        column_names : ColumnNamesType, optional
            One or number of columns in the dataframe.
        load_path : Path, optional
            Path to json file with parameters.

        Returns
        -------
        self : Preprocessor
            Instance object
        """
        transformer = LogTransformer()
        if load_path is None:
            transformer.fit_transform(self.dataframe, column_names, inplace=True)
        else:
            transformer.load_params(load_path)
            transformer.transform(self.dataframe, inplace=True)
        self.transformers.append(transformer)
        return self

    def cuped(
        self,
        target: Optional[types.ColumnNameType] = None,
        by: Optional[types.ColumnNameType] = None,
        transformed_name: Optional[types.ColumnNameType] = None,
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Make CUPED transformation on the selected column.

        Parameters
        ----------
        target : ColumnNameType
            Column from the dataframe, for which CUPED transformation will be
            applied.
        by : ColumnNameType
            Covariance column in the dataframe.
        transformed_name : types.ColumnNameType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        load_path : Path, optional
            Path to json file with parameters.

        Returns
        -------
        self : Preprocessor
            Instance object
        """
        transformer = Cuped(verbose=self.verbose)
        if load_path is None:
            transformer.fit_transform(self.dataframe, target, by, transformed_name, inplace=True)
        else:
            transformer.load_params(load_path)
            transformer.transform(self.dataframe, inplace=True)
        self.transformers.append(transformer)
        return self

    def multicuped(
        self,
        target: Optional[types.ColumnNameType] = None,
        by: Optional[types.ColumnNamesType] = None,
        transformed_name: Optional[types.ColumnNameType] = None,
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Make Multi CUPED transformation on the selected column.

        Parameters
        ----------
        target : ColumnNameType
            Column from the dataframe, for which CUPED transformation will be
            applied.
        by : ColumnNameType
            Covariance columns in the dataframe.
        transformed_name : types.ColumnNameType, optional
            Name for the new transformed target column, if is not defined
            it will be generated automatically.
        load_path : Path, optional
            Path to json file with parameters.

        Returns
        -------
        self : Preprocessor
            Instance object
        """
        transformer = MultiCuped(verbose=self.verbose)
        if load_path is None:
            transformer.fit_transform(self.dataframe, target, by, transformed_name, inplace=True)
        else:
            transformer.load_params(load_path)
            transformer.transform(self.dataframe, inplace=True)
        self.transformers.append(transformer)
        return self

    def linearize(
        self,
        numerator: types.ColumnNameType,
        denominator: types.ColumnNameType,
        transformed_name: Optional[types.ColumnNameType] = None,
        load_path: Optional[Path] = None,
    ) -> Preprocessor:
        """
        Linearize a ratio metric for use in A/B testing.

        Computes a per-unit linearized value that is approximately normally
        distributed, enabling correct t-test usage for ratio metrics:

            linearized_i = numerator_i - ratio * denominator_i

        where ratio = mean(numerator) / mean(denominator) is estimated on
        the data passed to this ``Preprocessor`` instance (reference / control data).

        Parameters
        ----------
        numerator : ColumnNameType
            Column name of the ratio numerator (e.g. ``"revenue"``).
        denominator : ColumnNameType
            Column name of the ratio denominator (e.g. ``"orders"``).
        transformed_name : ColumnNameType, optional
            Name for the new linearized column. Defaults to
            ``"{numerator}_lin"``.
        load_path : Path, optional
            Path to a json file with pre-fitted parameters.

        Returns
        -------
        self : Preprocessor
            Instance object.
        """
        transformer = LinearizationTransformer()
        if load_path is None:
            transformer.fit_transform(self.dataframe, numerator, denominator, transformed_name, inplace=True)
        else:
            transformer.load_params(load_path)
            transformer.transform(self.dataframe, inplace=True)
        self.transformers.append(transformer)
        return self

    def transformations(self) -> List:
        """
        List of all transformations which were called.

        Returns
        -------
        transformers : List[object]
            List of executed transformations
        """
        return self.transformers

    def store_transformations(self, store_path: Path) -> None:
        """
        Store transformations with parameters in the json file.

        Parameters
        ----------
        store_path : Path
            Path to a json file where transformations will be stored
        """
        if len(self.transformers) == 0:
            raise ValueError("No transformations have been made yet.")
        transformations_counter = {}
        transformations_config = {}
        for transformer in self.transformers:
            alias = transformer.__class__.__name__
            if alias in transformations_counter:
                transformations_counter[alias] += 1
            else:
                transformations_counter[alias] = 1
            alias += "_" + str(transformations_counter[alias])
            transformations_config[alias] = transformer.get_params_dict()

        with open(store_path, "w+") as file:
            json.dump(transformations_config, file)

    def load_transformations(self, load_path: Path) -> None:
        """
        Load pre-saved transformations from the json file.

        Parameters
        ----------
        load_path : Path
            Path to a json file where transformations are stored
        """
        with open(load_path, "r+") as file:
            params = json.load(file)
        for key, value in params.items():
            class_alias = "".join(filter(str.isalpha, key))
            transformer = getattr(sys.modules[__name__], class_alias)
            kwargs = {}
            if "verbose" in inspect.signature(transformer).parameters:
                kwargs["verbose"] = self.verbose
            transformer = transformer(**kwargs)
            transformer.load_params_dict(value)
            self.transformers.append(transformer)

    def apply_transformations(self) -> pd.DataFrame:
        """
        Apply all transformations to the inner data frame.

        Returns
        -------
        dataframe : pd.DataFrame
            Transformed inner data frame
        """
        for transformer in self.transformers:
            if isinstance(transformer, AggregatePreprocessor):
                self.dataframe = transformer.transform(self.dataframe)
            else:
                transformer.transform(self.dataframe, inplace=True)
        return self.data()

    def transform_from_config(self, load_path: Path) -> pd.DataFrame:
        """
        Run transformations from the config file on the internal data frame.

        Parameters
        ----------
        load_path : Path
            Path to a json file where transformations are stored.

        Returns
        -------
        dataframe : pd.DataFrame
            Transformed inner data frame
        """
        self.load_transformations(load_path)
        return self.apply_transformations()


================================================
FILE: ambrosia/preprocessing/robust.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Module contains tools for outliers removal from data during a
preprocessing task.
"""
from typing import Dict, Iterable, List, Union

import numpy as np
import pandas as pd

from ambrosia import types
from ambrosia.tools import log
from ambrosia.tools.ab_abstract_component import AbstractFittableTransformer
from ambrosia.tools.back_tools import wrap_cols


class RobustPreprocessor(AbstractFittableTransformer):
    """
    Unit for simple robust transformation for avoiding outliers in data.

    It cuts the alpha percentage of distribution from head, tail or both sides
    for each given metric.
    The data distribution structure assumed to present as small alpha
    part of outliers, followed by the normal part of the data with another
    alpha part of outliers at the end of the distribution.

    Parameters
    ----------
    verbose : bool, default: ``True``
        If ``True`` will show info about the transformation of passed columns.

    Attributes
    ----------
    params : Dict
        Dictionary with operational parameters of the instance.
        Updated after calling the ``fit`` method.
    verbose : bool
        Verbose info flag.
    available_tails : List
        List of the available tail type names to preprocess
    non_serializable_params: List
        List of the class parameters that should be converted to lists
        in order to serialize.
    fitted : bool
        Fit flag.

    Examples
    --------
    >>> robust = RobustPreprocessor(verbose=True)
    >>> robust.fit(dataframe, ['column1', 'column2'], alpha=0.05)
    >>> robust.transform(dataframe, inplace=True)

    You can pass one or number of columns, if several columns are passed
    it will drop in total alpha percent of extreme values for each column.
    """

    available_tails: List = ["both", "left", "right"]
    non_serializable_params: List = ["alpha", "quantiles"]

    def __str__(self) -> str:
        return "Robust preprocessing"

    def __init__(self, verbose: bool = True) -> None:
        """
        RobustPreprocessor class constructor.
        """
        self.params = {
            "tail": None,
            "column_names": None,
            "alpha": None,
            "quantiles": None,
        }
        self.verbose = verbose
        super().__init__()

    def get_params_dict(self) -> Dict:
        """
        Returns a dictionary with params.

        Returns
        -------
        params : Dict
            Dictionary with fitted params.
        """
        self._check_fitted()
        return {
            key: (value if key not in RobustPreprocessor.non_serializable_params else value.tolist())
            for key, value in self.params.items()
        }

    def load_params_dict(self, params: Dict) -> None:
        """
        Load prefitted parameters form a dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with prefitted params.
        """
        for parameter in self.params:
            if parameter in params:
                if parameter in RobustPreprocessor.non_serializable_params:
                    self.params[parameter] = np.array(params[parameter])
                else:
                    self.params[parameter] = params[parameter]
            else:
                raise TypeError(f"params argument must contain: {parameter}")
        self.fitted = True

    def __wrap_alpha(self, alpha: Union[float, Iterable]) -> np.ndarray:
        columns_num = len(self.params["column_names"])
        if isinstance(alpha, float):
            alpha = np.array([alpha] * columns_num)
        elif isinstance(alpha, Iterable):
            alpha = np.array(alpha)
        else:
            raise ValueError("Alpha parameter must be float or an iterable")
        if len(alpha) != columns_num:
            raise ValueError("Alpha length must be equal to the columns number")
        if (alpha < 0).any() or (alpha >= 0.5).any():
            raise ValueError(f"Alpha value must be from 0 to 0.5, but alpha vector = {alpha}")
        return alpha

    def __check_tail(self, tail: str) -> str:
        if tail not in self.available_tails:
            raise ValueError(f"tail must be one of {RobustPreprocessor.available_tails}")
        return tail

    def __calculate_quantiles(
        self,
        dataframe: pd.DataFrame,
    ) -> None:
        columns_num = len(self.params["column_names"])
        if self.params["tail"] == "both":
            self.params["quantiles"] = np.zeros((columns_num, 2))
            for num, col in enumerate(self.params["column_names"]):
                alpha = self.params["alpha"][num] / 2
                self.params["quantiles"][num, :] = np.quantile(dataframe[col].values, [alpha, 1 - alpha])
        else:
            self.params["quantiles"] = np.zeros((columns_num, 1))
            for num, col in enumerate(self.params["column_names"]):
                alpha = self.params["alpha"][num] if self.params["tail"] == "left" else 1 - self.params["alpha"][num]
                self.params["quantiles"][num] = np.quantile(dataframe[col].values, alpha)

    def fit(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
        alpha: Union[float, np.ndarray] = 0.05,
        tail: str = "both",
    ):
        """
        Fit to calculate robust parameters for the selected columns.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to calculate quantiles.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.
        alpha : Union[float, np.ndarray], default: ``0.05``
            The percentage of removed data from head and tail.
        tail : str, default: ``"both"``
            Part of distribution to be removed.
            Can be ``"left"``, ``"right"`` or ``"both"``.

        Returns
        -------
        self : object
            Instance object.
        """
        self.params["column_names"] = wrap_cols(column_names)
        self._check_cols(dataframe, self.params["column_names"])
        self.params["alpha"] = self.__wrap_alpha(alpha)
        self.params["tail"] = self.__check_tail(tail)
        self.__calculate_quantiles(dataframe)
        self.fitted = True
        return self

    def transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
        """
        Remove objects from the dataframe which are in the head, tail or both
        alpha parts of chosen metrics distributions.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to transform.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self._check_fitted()
        self._check_cols(dataframe, self.params["column_names"])
        if self.verbose:
            prev_stats: List[Dict[str, float]] = log.RobustLogger.get_stats(dataframe, self.params["column_names"])

        transformed: pd.DataFrame = dataframe if inplace else dataframe.copy()
        if self.params["tail"] == "both":
            mask: pd.Series = (transformed[self.params["column_names"]] < self.params["quantiles"][:, 0]).any(
                axis=1
            ) | (transformed[self.params["column_names"]] > self.params["quantiles"][:, 1]).any(axis=1)
        elif self.params["tail"] == "left":
            mask = (transformed[self.params["column_names"]] < self.params["quantiles"].T).any(axis=1)
        elif self.params["tail"] == "right":
            mask = (transformed[self.params["column_names"]] > self.params["quantiles"].T).any(axis=1)
        bad_ids = transformed.loc[mask].index
        transformed.drop(bad_ids, inplace=True)

        if self.verbose:
            log.info_log(
                f"""Making {self.params['tail']}-tail robust transformation of columns {self.params['column_names']}
                 with alphas = {np.round(self.params['alpha'], 3)}"""
            )
            new_stats: Dict[str, float] = log.RobustLogger.get_stats(transformed, self.params["column_names"])
            log.RobustLogger.verbose_list(prev_stats, new_stats, self.params["column_names"])
        return None if inplace else transformed

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
        alpha: Union[float, np.ndarray] = 0.05,
        tail: str = "both",
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Fit preprocessor parameters using given dataframe and transform it.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to calculate quantiles and for further transformation.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.
        alpha : Union[float, np.ndarray], default: ``0.05``
            The percentage of removed data from head and tail.
        tail : str, default: ``"both"``
            Part of distribution to be removed.
            Can be ``"left"``, ``"right"`` or ``"both"``.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self.fit(dataframe, column_names, alpha, tail)
        return self.transform(dataframe, inplace)


class IQRPreprocessor(AbstractFittableTransformer):
    """
    Unit for IQR transformation of the data to exclude outliers.

    It cuts the points from the distribution which are behind the range of
    0.25 quantile - 1,5 * iqr and 0.75 quantile + 1,5 * iqr
    for each given metric.


    Parameters
    ----------
    verbose : bool, default: ``True``
        If ``True`` will show info about the transformation of passed columns.

    Attributes
    ----------
    params : Dict
        Dictionary with operational parameters of the instance.
        Updated after calling the ``fit`` method.
    verbose : bool
        Verbose info flag.
    non_serializable_params: List
        List of the class parameters that should be converted to lists
        in order to serialize.
    fitted : bool
        Fit flag.

    Examples
    --------
    >>> iqr = IQRPreprocessor(verbose=True)
    >>> iqr.fit(dataframe, ['column1', 'column2'])
    >>> iqr.transform(dataframe, inplace=True)

    You can pass one or number of columns, if several columns are passed
    it will drop extreme values for each column.
    """

    non_serializable_params: List = ["medians", "quartiles"]

    def __str__(self) -> str:
        return "IQR outliers preprocessing"

    def __init__(self, verbose: bool = True) -> None:
        """
        IQRPreprocessor class constructor.
        """
        self.params = {"column_names": None, "medians": None, "quartiles": None}
        self.verbose = verbose
        super().__init__()

    def get_params_dict(self) -> Dict:
        """
        Returns a dictionary with params.

        Returns
        -------
        params : Dict
            Dictionary with fitted params.
        """
        self._check_fitted()
        return {
            key: (value if key not in IQRPreprocessor.non_serializable_params else value.tolist())
            for key, value in self.params.items()
        }

    def load_params_dict(self, params: Dict) -> None:
        """
        Load prefitted parameters form a dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with prefitted params.
        """
        for parameter in self.params:
            if parameter in params:
                if parameter in IQRPreprocessor.non_serializable_params:
                    self.params[parameter] = np.array(params[parameter])
                else:
                    self.params[parameter] = params[parameter]
            else:
                raise TypeError(f"params argument must contain: {parameter}")
        self.fitted = True

    def __calculate_params(
        self,
        dataframe: pd.DataFrame,
    ):
        X: np.ndarray = dataframe[self.params["column_names"]].values
        self.params["quartiles"] = np.quantile(X, (0.25, 0.75), axis=0).T
        self.params["medians"] = np.median(X, axis=0).T

    def fit(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
    ):
        """
        Fit to calculate iqr parameters for the selected columns.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to calculate quantiles.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.

        Returns
        -------
        self : object
            Instance object.
        """
        self.params["column_names"] = wrap_cols(column_names)
        self._check_cols(dataframe, self.params["column_names"])
        self.__calculate_params(dataframe)
        self.fitted = True
        return self

    def transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
        """
        Remove objects from the dataframe which are behind maximum and minimum
        values of boxplots for each metric distribution.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to transform.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self._check_fitted()
        self._check_cols(dataframe, self.params["column_names"])
        if self.verbose:
            prev_stats: List[Dict[str, float]] = log.RobustLogger.get_stats(dataframe, self.params["column_names"])

        transformed: pd.DataFrame = dataframe if inplace else dataframe.copy()
        iqr: np.ndarray = self.params["quartiles"][:, 1] - self.params["quartiles"][:, 0]
        tail: np.ndarray = self.params["quartiles"][:, 0] - 1.5 * iqr
        head: np.ndarray = self.params["quartiles"][:, 1] + 1.5 * iqr
        mask: pd.Series = (
            (transformed[self.params["column_names"]] < tail) | (transformed[self.params["column_names"]] > head)
        ).any(axis=1)
        bad_ids = transformed.loc[mask].index
        transformed.drop(bad_ids, inplace=True)

        if self.verbose:
            log.info_log(f"Making IQR transformation of columns {self.params['column_names']}")
            new_stats: Dict[str, float] = log.RobustLogger.get_stats(transformed, self.params["column_names"])
            log.RobustLogger.verbose_list(prev_stats, new_stats, self.params["column_names"])
        return None if inplace else transformed

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Fit preprocessor parameters using given dataframe and transform it.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to calculate quantiles and for further transformation.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self.fit(dataframe, column_names)
        return self.transform(dataframe, inplace)


================================================
FILE: ambrosia/preprocessing/transformers.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

"""
Module contains tools for metrics transformations during a
preprocessing task.
"""
from typing import Dict, Optional, Union

import numpy as np
import pandas as pd
import scipy.stats as sps

from ambrosia import types
from ambrosia.tools.ab_abstract_component import AbstractFittableTransformer
from ambrosia.tools.back_tools import wrap_cols


class BoxCoxTransformer(AbstractFittableTransformer):
    """
    Unit for a Box-Cox transformation of the pandas data.

    A Box Cox transformation helps to transform non-normal dependent variables
    into a normal shape. All variables values must be positive.

    Optimal transformation lambdas are selected automatically during
    the transformer fit process.


    Attributes
    ----------
    column_names : List
        Names of column which will be selected for the transformation.
    lambda_ : np.ndarray
        Array of parameters using during the transformation of the
        selected columns.
    fitted : bool
        Fit flag.

    Examples
    --------
    >>> boxcox = BoxCoxTransformer()
    >>> boxcox.fit(dataframe, ['column1', 'column2'])
    >>> boxcox.transform(dataframe, inplace=True)

    """

    def __str__(self) -> str:
        return "Box-Cox transformation"

    def __init__(
        self,
    ) -> None:
        """
        BoxCoxTransformer class constructor.
        """
        self.column_names = None
        self.lambda_ = None
        super().__init__()

    def __calculate_lambda_(
        self,
        dataframe: pd.DataFrame,
    ) -> None:
        columns_num: int = len(self.column_names)
        self.lambda_ = np.zeros(columns_num)
        X: np.ndarray = dataframe[self.column_names].values
        for num in range(columns_num):
            self.lambda_[num] = sps.boxcox(X[:, num])[1]

    def get_params_dict(self) -> Dict:
        """
        Returns a dictionary with params.

        Returns
        -------
        params : Dict
            Dictionary with fitted params.
        """
        self._check_fitted()
        return {
            "column_names": self.column_names,
            "lambda_": self.lambda_.tolist(),
        }

    def load_params_dict(self, params: Dict) -> None:
        """
        Load instance parameters from the dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with params.
        """
        if "column_names" in params:
            self.column_names = params["column_names"]
        else:
            raise TypeError(f"params argument must contain: {'column_names'}")
        if "lambda_" in params:
            self.lambda_ = np.array(params["lambda_"])
        else:
            raise TypeError(f"params argument must contain: {'lambda_'}")
        self.fitted = True

    def fit(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
    ):
        """
        Fit to calculate transformation parameters for the selected columns.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to calculate optimal transformation parameters.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.

        Returns
        -------
        self : object
            Instance object.
        """
        self.column_names = wrap_cols(column_names)
        self._check_cols(dataframe, self.column_names)
        self.__calculate_lambda_(dataframe)
        self.fitted = True
        return self

    def transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
        """
        Apply Box-Cox transformation for the data.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to transform.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self._check_fitted()
        self._check_cols(dataframe, self.column_names)
        transformed: pd.DataFrame = dataframe if inplace else dataframe.copy()
        X: np.ndarray = transformed[self.column_names].values
        for num in range(len(self.column_names)):
            if self.lambda_[num] == 0:
                X[:, num] = np.log(X[:, num])
            else:
                X[:, num] = (X[:, num] ** self.lambda_[num] - 1) / self.lambda_[num]
        transformed[self.column_names] = X
        return None if inplace else transformed

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Fit transformer parameters using given dataframe and transform it.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe for calculation of optimal parameters and further
            transformation.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self.fit(dataframe, column_names)
        return self.transform(dataframe, inplace)

    def inverse_transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
        """
        Apply inverse Box-Cox transformation for the data.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to inverse transform.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self._check_fitted()
        self._check_cols(dataframe, self.column_names)
        transformed: pd.DataFrame = dataframe if inplace else dataframe.copy()
        X_tr: np.ndarray = transformed[self.column_names].values
        for num in range(len(self.column_names)):
            if self.lambda_[num] == 0:
                X_tr[:, num] = np.exp(X_tr[:, num])
            else:
                X_tr[:, num] = (X_tr[:, num] * self.lambda_[num] + 1) ** (1 / self.lambda_[num])
        transformed[self.column_names] = X_tr
        return None if inplace else transformed


class LogTransformer(AbstractFittableTransformer):
    """
    Unit for a logarithmic transformation of the pandas data.

    A logarithmic transformation helps to transform some metrics distributions
    into a more normal shape and reduce the variance.
    All metrics values must be positive.


    Attributes
    ----------
    column_names : List
        Names of column which will be selected for the transformation.
    fitted : bool
        Fit flag.

    Examples
    --------
    >>> log = LogTransformer()
    >>> log.fit(dataframe, ['column1', 'column2'])
    >>> log.transform(dataframe, inplace=True)

    """

    def __str__(self) -> str:
        return "Logarithmic transformation"

    def __init__(self) -> None:
        """
        LogTransformer class constructor.
        """
        self.column_names = None
        super().__init__()

    def get_params_dict(self) -> Dict:
        """
        Returns a dictionary with params.
        """
        self._check_fitted()
        return {
            "column_names": self.column_names,
        }

    def load_params_dict(self, params: Dict) -> None:
        """
        Load instance parameters from the dictionary.

        Parameters
        ----------
        params : Dict
            Dictionary with params.
        """
        if "column_names" in params:
            self.column_names = params["column_names"]
        else:
            raise TypeError(f"params argument must contain: {'column_names'}")
        self.fitted = True

    def fit(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
    ):
        """
        Fit names of the selected columns.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe with metrics.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.

        Returns
        -------
        self : object
            Instance object.
        """
        self.column_names = wrap_cols(column_names)
        self._check_cols(dataframe, self.column_names)
        self.fitted = True
        return self

    def transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
        """
        Apply log transformation for the data.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to transform.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self._check_fitted()
        self._check_cols(dataframe, self.column_names)
        transformed: pd.DataFrame = dataframe if inplace else dataframe.copy()
        if (transformed[self.column_names] > 0).all(axis=None):
            transformed[self.column_names] = np.log(transformed[self.column_names].values)
        else:
            raise ValueError(f"All values in columns {self.column_names} must be positive")
        return None if inplace else transformed

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        column_names: types.ColumnNamesType,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Fit transformer parameters using given dataframe and transform it.

        Only column names are fittable.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to transform.
        column_names : ColumnNamesType
            One or number of columns in the dataframe.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self.fit(dataframe, column_names)
        return self.transform(dataframe, inplace)

    def inverse_transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
        """
        Apply inverse log transformation for the data.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to inverse transform.
        inplace : bool, default: ``False``
            If ``True`` transforms the given dataframe, otherwise copy and
            returns an another one.

        Returns
        -------
        df : Union[pd.DataFrame, None]
            Transformed dataframe or None
        """
        self._check_fitted()
        self._check_cols(dataframe, self.column_names)
        transformed: pd.DataFrame = dataframe if inplace else dataframe.copy()
        transformed[self.column_names] = np.exp(transformed[self.column_names].values)
        return None if inplace else transformed


class LinearizationTransformer(AbstractFittableTransformer):
    """
    Linearization transformer for ratio metrics.

    Converts a ratio metric (numerator / denominator) into a per-unit linearized
    metric that is approximately normally distributed, enabling correct t-test usage:

        linearized_i = numerator_i - ratio * denominator_i

    where ratio = mean(numerator) / mean(denominator), estimated on the reference
    (control group / historical) data passed to fit().

    Parameters
    ----------
    numerator : str
        Column name of the ratio numerator (e.g. "revenue").
    denominator : str
        Column name of the ratio denominator (e.g. "orders").
    transformed_name : str, optional
        Name for the new column. Defaults to ``"{numerator}_lin"``.

    Examples
    --------
    >>> transformer = LinearizationTransformer()
    >>> transformer.fit(control_df, "revenue", "orders", "arpu_lin")
    >>> transformer.transform(experiment_df, inplace=True)
    """

    def __str__(self) -> str:
        return "Linearization transformation"

    def __init__(self) -> None:
        self.numerator: Optional[str] = None
        self.denominator: Optional[str] = None
        self.transformed_name: Optional[str] = None
        self.ratio: Optional[float] = None
        super().__init__()

    def get_params_dict(self) -> Dict:
        self._check_fitted()
        return {
            "numerator": self.numerator,
            "denominator": self.denominator,
            "transformed_name": self.transformed_name,
            "ratio": self.ratio,
        }

    def load_params_dict(self, params: Dict) -> None:
        for key in ("numerator", "denominator", "transformed_name", "ratio"):
            if key not in params:
                raise TypeError(f"params argument must contain: {key}")
            setattr(self, key, params[key])
        self.fitted = True

    def fit(
        self,
        dataframe: pd.DataFrame,
        numerator: str,
        denominator: str,
        transformed_name: Optional[str] = None,
    ):
        """
        Estimate ratio = mean(numerator) / mean(denominator) on reference data.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Reference dataframe (typically control group or historical data).
        numerator : str
            Column name of the ratio numerator.
        denominator : str
            Column name of the ratio denominator.
        transformed_name : str, optional
            Name for the linearized column. Defaults to ``"{numerator}_lin"``.
        """
        self._check_cols(dataframe, [numerator, denominator])
        denom_mean = dataframe[denominator].mean()
        if denom_mean == 0:
            raise ValueError(f"Mean of denominator column '{denominator}' is zero; cannot compute ratio.")
        self.numerator = numerator
        self.denominator = denominator
        self.transformed_name = transformed_name if transformed_name is not None else f"{numerator}_lin"
        self.ratio = dataframe[numerator].mean() / denom_mean
        self.fitted = True
        return self

    def transform(self, dataframe: pd.DataFrame, inplace: bool = False) -> Union[pd.DataFrame, None]:
        """
        Apply linearization: transformed = numerator - ratio * denominator.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Dataframe to transform.
        inplace : bool, default: ``False``
            If ``True`` modifies dataframe in place, otherwise returns a copy.
        """
        self._check_fitted()
        self._check_cols(dataframe, [self.numerator, self.denominator])
        df = dataframe if inplace else dataframe.copy()
        df[self.transformed_name] = df[self.numerator] - self.ratio * df[self.denominator]
        return None if inplace else df

    def fit_transform(
        self,
        dataframe: pd.DataFrame,
        numerator: str,
        denominator: str,
        transformed_name: Optional[str] = None,
        inplace: bool = False,
    ) -> Union[pd.DataFrame, None]:
        """
        Fit and transform in one step.

        Parameters
        ----------
        dataframe : pd.DataFrame
            Reference dataframe for fitting and transformation.
        numerator : str
            Column name of the ratio numerator.
        denominator : str
            Column name of the ratio denominator.
        transformed_name : str, optional
            Name for the linearized column.
        inplace : bool, default: ``False``
            If ``True`` modifies dataframe in place.
        """
        self.fit(dataframe, numerator, denominator, transformed_name)
        return self.transform(dataframe, inplace)


================================================
FILE: ambrosia/spark_tools/__init__.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from .theory import design_effect, design_groups_size, design_power

__all__ = [
    "design_groups_size",
    "design_effect",
    "design_power",
]


================================================
FILE: ambrosia/spark_tools/empiric.py
================================================
#  Copyright 2022 MTS (Mobile Telesystems)
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

from typing import Any, Dict, Iterable, List

import numpy as np
import pandas as pd
from joblib import Parallel, delayed, parallel_backend
from tqdm.auto import tqdm

import ambrosia.spark_tools.theory as th_pkg
import ambrosia.tools._lib._bootstrap_tools as solver_pkg
import ambrosia.tools._lib._selection_aide as select_pkg
from ambrosia import types
from ambrosia.tools.import_tools import spark_installed

if spark_installed():
    import pyspark.sql.functions as spark_functions

BOOSTRAP_BASE_CONST: int = 10
RANDOM_SAMPLE_SEED: int = 42
FIRST_TYPE_ERROR: float = 0.05
THREADS_BOOTSTRAP: int = 2  # Creates a significant reduction in runtime
N_JOBS_MULTIPROCESS: int = 1
ACCEPTED_CRITERIA: List[str] = ["ttest"]
BOOTSTRAP_BACKEND: str = "threading"
ROUND_DIGITS_PERCENT: int = 1


def inject_effect(dataframe: types.SparkDataFrame, column: types.ColumnNameType, effect: float) -> types.SparkDataFrame:
    """
    Injects effect to column of given dataframe and returns injected one.

    Injection conducts via adding mean * delta_relative_effect.

    Parameters
    ----------
    dataframe : Spark dataframe
        Table with column, where efect will be injected
    column : Column type
        Column which will be used
    effect : float
        Value of effect, for example for 20% percent growth pass 1.2

    Returns
    -------
    effected_dataframe : Spark dataframe
        Table with changed column
    """
    multiplicator: float = effect - 1
    current_mean, _ = th_pkg.get_stats_from_table(dataframe, column)
    return dataframe.withColumn(column, spark_functions.col(column) + current_mean * multiplicator)


def evaluate_criterion(
    dataframe: types.SparkDataFrame,
    column: types.ColumnNameType,
    effect: float,
    group_size: int,
    alpha: float = FIRST_TYPE_ERROR,
    criterion: str = ACCEPTED_CRITERIA[0],
) -> int:
    """
    Evaluate criterion, returns 0 if H1 is not rejected, 1 otherwise.

    Check hypotesis for given column, if effect injected for given sample size.
    H0: means in two groups equals, H1: means are not equal

    Parameters
    ----------
    dataframe : Spark dataframe
        Table with column, which wiil be used
    column : Column type
        Column with metric, which will be used in criterion
    effect : float
        Value of effect to be tested, for example, for 20% effect pass 1.2
    group_size : int
        Size for each of two groups, which wiil be sampled
    alpha : float
        Bound for first type error, will be used as pvalue <= alpha
    criterion : str
        Name of criterion, default ttest, see list acceptable criteria

    Returns
    -------
    is_rejected : bool
        Is H1 correct
    """
    total_size: int = dataframe.count()
    if group_size * 2 > total_size:
        err_msg: str = "Total sampled values more than table size"
        raise ValueError(err_msg)

    part: float = 2 * group_size / total_size
    data_a, data_b = dataframe.sample(part).randomSplit([0.5, 0.5], seed=RANDOM_SAMPLE_SEED)
    data_b = inject_effect(data_b, column, effect)

    if criterion == "ttest":
        _, pvalue = th_pkg.ttest_spark(data_a, data_b, column)
    else:
        err_msg: str = f"Choose criterion from {ACCEPTED_CRITERIA}"
        raise ValueError(err_msg)
    return pvalue < alpha


def calc_empiric_power(
    dataframe: types.SparkDataFrame,
    column: types.ColumnNameType,
    effect: float,
    group_size: int,
    first_error: float = FIRST_TYPE_ERROR,
    bootstrap_size: int = BOOSTRAP_BASE_CONST,
    criterion: str = ACCEPTED_CRITERIA[0],
    threads: int = THREADS_BOOTSTRAP,
) -> float:
    """
    Calculate empiric power of criterion via thread pool.

    Parameters
    ----------
    dataframe : Spark dataframe
        Table with column, which wiil be used
    column : Column type
        Column with metric, which will be used in criterion
    effect : float
        Value of effect to be tested, for example, for 20% effect pass 1.2
    group_size: int
        Size for each of two groups, which wiil be sampled
    first_error : float
        Bound for first type error, will be used as pvalue <= first_error
    bootstrap_size : int
        Amount of groups to be sampled
    criterion : str
        Name of criterion, default ttest, see list acceptable criteria
    threads : int
        Amount of threads used in thread pool

    Returns
    -------
    empirical_power : float
        Empirical power, calculated as frequecy of rejected hypotesis
    """
    if threads > 1:
        with parallel_backend(BOOTSTRAP_BACKEND, n_jobs=threads):
            exp_results = Parallel(verbose=False)(
                delayed(evaluate_criterion)(
                    dataframe=dataframe,
                    column=column,
                    effect=effect,
                    group_size=group_size,
                    alpha=first_error,
                    criterion=criterion,
                )
                for _ in range(bootstrap_size)
            )
    else:
        exp_results = []
        for _ in range(bootstrap_size):
            exp_results.append(
                evaluate_criterion(
                    dataframe=dataframe,
                    column=column,
                    effect=effect,
                    alpha=first_error,
                    group_size=group_size,
                    criterion=criterion,
                )
            )
    return np.mean(exp_results)


def get_table_power(
    dataframe: types.SparkDataFrame,
    metrics: Iterable[types.ColumnNameType],
    effects: Iterable[float],
    group_sizes: Iterable[int],
    alphas: Iterable[float],
    bootstrap_size: int = BOOSTRAP_BASE_CONST,
    threads: int = THREADS_BOOTSTRAP,
    use_tqdm: bool = True,
    as_numeric: bool = False,
) -> types.DesignerResult:
    """
    Calculate table of criterion empirical power with rows effects and columns sizes.

    Parameters
    ----------
    dataframe : Spark dataframe
        Table with column, which wiil be used
    metrics : Iterable of column type
        Iterable set of columns for designing
    effects : Iterable[float]
        List of effects which we want to check
    group_sizes : Iterable[int]
        List of group sizes which we want to check
    alphas : Iterable[float]
        1st type error bound, passed as list, for example [0.05]
    bootstrap_size: int, default: ``10``
        Amount of pairs of groups A/B to be sampled for estimation power
    threads : int
        Amount of threads for thread pool
    use_tqdm : bool
        Whether to use progress bar
    as_numeric : bool, default False
        Whether to return a number or a string with percentages

    Returns
    -------
    report : Union[pd.DataFrame, Dict[str, pd.DataFrame]
        Tables with sizes for group A, group B, effects, erros and metrics names
        Effects for indices
        Group sizes for columns
        table for each metric: dict[metric name] = table with mde(effects)
        or one table, if only one metric passed
    """
    if len(alphas) > 1:
        raise ValueError("For power table you can pass only one first error bound")
    result = pd.DataFrame(columns=group_sizes, index=effects)
    iterate_params = tqdm(zip(effects, group_sizes)) if use_tqdm else zip(effects, group_sizes)
    for column in metrics:
        for effect, group_size in iterate_params:
            power = calc_empiric_power(
                dataframe=dataframe,
                column=column,
                effect=effect,
                group_size=group_size,
                first_error=alphas[0],
                bootstrap_size=bootstrap_size,
                threads=threads,
            )
            if as_numeric:
                result.loc[effect, group_size] = power
            else:
                result.loc[effect, group_size] = (round(power * 100, ROUND_DIGITS_PERCENT)).astype(str) + "%"
    result.columns.name = "sample sizes"
    result.index.name = "effect"
    return result


def optimize_group_size(
    dataframe: types.SparkDataFrame,
    column: types.ColumnNameType,
    effect: float,
    beta: float,
    first_error: float = FIRST_TYPE_ERROR,
    bootstrap_size: int = BOOSTRAP_BASE_CONST,
    threads: int = THREADS_BOOTSTRAP,
) -> int:
    """
    Optimize group size for fixed effect and errors using empiric solution.
    Spark requests are made using thread pool.

    Parameters
    ----------
    dataframe : Spark table
        Table for designing experiment
    column : Column type
        Column, containg metric for designing
    effect : float
        Size of group both groups
    beta: float
        2nd type error bound
    first_error : float, default: ``(0.05,)``
        1st type error bound
    bootstrap_size : int, default: ``10``
        Amount of pairs of groups A/B to be sampled for estimation power
    threads : int
        Amount of threads for thread pool

    Returns
    -------
    optimal_size : int
        Groups sizes calculted via empiric power optimization
    """
    power: float = 1 - beta
    solver = solver_pkg.EmpiricSizeSolution(calc_empiric_power, power, ["group_size"])
    return solver.calc_binary(
        dataframe=dataframe,
        column=column,
        effect=effect,
        first_error=first_error,
        bootstrap_size=bootstrap_size,
        threads=threads,
    )


def get_table_size(
    dataframe: types.SparkDataFrame,
    metrics: Iterable[types.ColumnNameType],
    effects: Iterable[float],
    betas: Iterable[float],
    alphas: Iterable[float],
    bootstrap_size: int = BOOSTRAP_BASE_CONST,
    threads: int = THREADS_BOOTSTRAP,
    n_jobs: int = N_JOBS_MULTIPROCESS,
    use_tqdm: bool = True,
    as_numeric: bool = False,
) -> types.DesignerResult:
    """
    Find sizes by variating other params for many columns using thread pool for power estimation.

    Parameters
    ----------
    dataframe : Spark table
        Table for designing experiment
    metrics : Iterable of column type
        Iterable set of columns for designing
    effects : Iterable[float]
        List of group size which we want to check
    betas : Iterable[float]
        2nd type error bounds
    alpha : Iterable[float]
        1st type error bounds
    bootstrap_size : int, default: ``10``
        Amount of pairs of groups A/B to be sampled for estimation power
    threads : int
        Amount of threads for thread pool
    n_jobs : int
        Amount of jobs for metrics variating
    use_tqdm : bool
        Whether to use progress bar
    as_numeric : bool, default False
        Whether to return a number or a string with percentages

    Returns
    -------
    report : Union[pd.DataFrame, Dict[str, pd.DataFrame]
        Tables with sizes for group A, group B, effects, erros and metrics names
        Effects for indices
        (alpha(1 type error), beta(2 type error)) for columns
        table for each metric: dict[metric name] = table with mde(effects),
        or one table if one metric passed
    """
    params: Dict[str, Iterable[Any]] = {"effect": effects, "beta": betas, "first_error": alphas}
    results: Dict[types.ColumnNameType, pd.DataFrame] = {}
    for column_name in metrics:
        selector = select_pkg.Selector(
            optimize_group_size,
            params,
            n_jobs,
            use_tqdm,
            dataframe=dataframe,
            column=column_name,
            bootstrap_size=bootstrap_size,
            threads=threads,
        )
        results[column_name] = selector.get_table_size(as_numeric)
    return results[metrics[0]] if len(results) == 1 else results


def optimize_effect(
    dataframe: types.SparkDataFrame,
    column: types.ColumnNameType,
    group_size: int,
    beta: float,
    first_error: float = FIRST_TYPE_ERROR,
    bootstrap_size: int = BOOSTRAP_BASE_CONST,
    threads: int = THREADS_BOOTSTRAP,
) -> int:
    """
    Optimize effect for fixed size and errors using empiric solution.
    Spark requests are made using thread pool.

    Parameters
    ----------
    dataframe : Spark table
        Table for designing experiment
    column : Column type
        Column, containg metric for designing
    group_size : int
        Size of group both groups
    beta: fl
Download .txt
gitextract_85e2_24t/

├── .editorconfig
├── .github/
│   └── workflows/
│       ├── publish.yaml
│       └── test.yaml
├── .gitignore
├── .pylintrc
├── .readthedocs.yaml
├── CHANGELOG.rst
├── CONTRIBUTING.rst
├── LICENSE
├── Makefile
├── README.rst
├── SECURITY.rst
├── ambrosia/
│   ├── VERSION
│   ├── __init__.py
│   ├── designer/
│   │   ├── __init__.py
│   │   ├── designer.py
│   │   └── handlers.py
│   ├── preprocessing/
│   │   ├── __init__.py
│   │   ├── aggregate.py
│   │   ├── cuped.py
│   │   ├── ml_var_reducer.py
│   │   ├── preprocessor.py
│   │   ├── robust.py
│   │   └── transformers.py
│   ├── spark_tools/
│   │   ├── __init__.py
│   │   ├── empiric.py
│   │   ├── split_tools.py
│   │   ├── stat_criteria.py
│   │   ├── stratification.py
│   │   └── theory.py
│   ├── splitter/
│   │   ├── __init__.py
│   │   ├── handlers.py
│   │   └── splitter.py
│   ├── tester/
│   │   ├── __init__.py
│   │   ├── binary_result_evaluation.py
│   │   ├── handlers.py
│   │   └── tester.py
│   ├── tools/
│   │   ├── __init__.py
│   │   ├── _lib/
│   │   │   ├── __init__.py
│   │   │   ├── _bin_ci_aide.py
│   │   │   ├── _bootstrap_tools.py
│   │   │   ├── _selection_aide.py
│   │   │   └── _tools_aide.py
│   │   ├── ab_abstract_component.py
│   │   ├── back_tools.py
│   │   ├── bin_intervals.py
│   │   ├── configs.py
│   │   ├── decorators.py
│   │   ├── empirical_tools.py
│   │   ├── import_tools.py
│   │   ├── knn.py
│   │   ├── log.py
│   │   ├── pvalue_tools.py
│   │   ├── split_tools.py
│   │   ├── stat_criteria.py
│   │   ├── stratification.py
│   │   ├── theoretical_tools.py
│   │   ├── tools.py
│   │   └── type_checks.py
│   ├── types.py
│   └── version.py
├── context7.json
├── docs/
│   ├── Makefile
│   ├── make.bat
│   ├── requirements.txt
│   └── source/
│       ├── _static/
│       │   └── css/
│       │       └── style.css
│       ├── ab_cases/
│       │   └── kion_ab.rst
│       ├── ab_cases.rst
│       ├── ambrosia_elements/
│       │   ├── advanced_transformations.rst
│       │   ├── aggregation.rst
│       │   ├── designer.rst
│       │   ├── preprocessing.rst
│       │   ├── processor.rst
│       │   ├── robust.rst
│       │   ├── simple_transformation.rst
│       │   ├── splitter.rst
│       │   └── tester.rst
│       ├── ambrosia_nutshell.rst
│       ├── authors.rst
│       ├── changelog.rst
│       ├── conf.py
│       ├── contributing.rst
│       ├── develop.rst
│       ├── index.rst
│       ├── installation.rst
│       ├── nb_pandas_examples.rst
│       ├── nb_spark_examples.rst
│       ├── pandas_examples/
│       │   ├── 00_preprocessing.nblink
│       │   ├── 01_vr_transformations.nblink
│       │   ├── 02_preprocessor.nblink
│       │   ├── 03_pandas_designer.nblink
│       │   ├── 04_binary_design.nblink
│       │   ├── 05_pandas_splitter.nblink
│       │   ├── 06_pandas_tester.nblink
│       │   ├── 10_synthetic_experiment_full_pipeline_short.nblink
│       │   └── 11_cuped_example.nblink
│       ├── security.rst
│       ├── spark_examples/
│       │   ├── 07_spark_designer.nblink
│       │   ├── 08_spark_splitter.nblink
│       │   └── 09_spark_tester.nblink
│       └── usage.rst
├── examples/
│   ├── 00_preprocessing.ipynb
│   ├── 01_vr_transformations.ipynb
│   ├── 02_preprocessor.ipynb
│   ├── 03_pandas_designer.ipynb
│   ├── 04_binary_design.ipynb
│   ├── 05_pandas_splitter.ipynb
│   ├── 06_pandas_tester.ipynb
│   ├── 07_spark_designer.ipynb
│   ├── 08_spark_splitter.ipynb
│   ├── 09_spark_tester.ipynb
│   ├── 10_synthetic_experiment_full_pipeline_short.ipynb
│   ├── 11_cuped_example.ipynb
│   ├── 12_ratio_metrics_and_custom_functions.ipynb
│   ├── _examples_configs/
│   │   ├── aggregator.json
│   │   ├── boxcox_tranformer.json
│   │   ├── cuped_config.json
│   │   ├── designer_config.yaml
│   │   ├── kion_cuped_params.json
│   │   ├── multicuped_coef.json
│   │   ├── multicuped_config.json
│   │   ├── params_cuped.json
│   │   ├── preprocessor.json
│   │   ├── robust.json
│   │   └── splitter_config.yaml
│   └── test_installation.ipynb
├── poetry.toml
├── pyproject.toml
├── setup.cfg
└── tests/
    ├── __init__.py
    ├── configs/
    │   └── designer_config.yaml
    ├── conftest.py
    ├── test_aggregate.py
    ├── test_cuped.py
    ├── test_data/
    │   ├── kion_data.csv
    │   ├── ltv_retention.csv
    │   ├── nonlin_var_table.csv
    │   ├── pipeline_test.csv
    │   ├── result_ltv_ret_conv.csv
    │   ├── robust_moments.csv
    │   ├── splitter_dataframe.csv
    │   ├── stratification_data.csv
    │   ├── var_table.csv
    │   ├── watch_result.csv
    │   ├── watch_result_agg.csv
    │   └── week_metrics.csv
    ├── test_designer.py
    ├── test_ml_variance_reducer.py
    ├── test_preprocessor.py
    ├── test_robust.py
    ├── test_splitter.py
    ├── test_stratification.py
    ├── test_tester.py
    └── test_transformers.py
Download .txt
SYMBOL INDEX (586 symbols across 50 files)

FILE: ambrosia/designer/designer.py
  class Designer (line 50) | class Designer(yaml.YAMLObject, ABToolAbstract, metaclass=ABMetaClass):
    method set_first_errors (line 192) | def set_first_errors(self, first_type_errors: types.StatErrorType) -> ...
    method set_second_errors (line 198) | def set_second_errors(self, second_type_errors: types.StatErrorType) -...
    method set_sizes (line 204) | def set_sizes(self, sizes: types.SampleSizeType) -> None:
    method set_effects (line 210) | def set_effects(self, effects: types.EffectType) -> None:
    method set_dataframe (line 216) | def set_dataframe(self, dataframe: types.PassedDataType) -> None:
    method set_method (line 225) | def set_method(self, method: str) -> None:
    method set_metrics (line 228) | def set_metrics(self, metrics: str) -> None:
    method __init__ (line 234) | def __init__(
    method __getstate__ (line 255) | def __getstate__(self):
    method from_yaml (line 269) | def from_yaml(cls, loader: yaml.Loader, node: yaml.Node):
    method __dataframe_handler (line 274) | def __dataframe_handler(handler: SimpleDesigner, parameter: str, **kwa...
    method __theory_design (line 289) | def __theory_design(label: str, args: types._UsageArgumentsType, **kwa...
    method __empiric_design (line 314) | def __empiric_design(label: str, args: types._UsageArgumentsType, **kw...
    method __binary_design (line 335) | def __binary_design(label: str, args: types._UsageArgumentsType, **kwa...
    method __pre_design (line 361) | def __pre_design(label: str, args: types._UsageArgumentsType, **kwargs...
    method run (line 375) | def run(
  function load_from_config (line 484) | def load_from_config(yaml_config: str, loader: type = yaml.Loader) -> De...
  function design (line 499) | def design(
  function design_binary_size (line 582) | def design_binary_size(
  function design_binary_effect (line 663) | def design_binary_effect(
  function design_binary_power (line 750) | def design_binary_power(
  function design_binary (line 837) | def design_binary(

FILE: ambrosia/designer/handlers.py
  class TheoryHandler (line 45) | class TheoryHandler(SimpleDesigner):
    method size_design (line 50) | def size_design(self, **kwargs) -> pd.DataFrame:
    method effect_design (line 53) | def effect_design(self, **kwargs) -> pd.DataFrame:
    method power_design (line 56) | def power_design(self, **kwargs) -> pd.DataFrame:
  class EmpiricHandler (line 60) | class EmpiricHandler(SimpleDesigner):
    method size_design (line 65) | def size_design(self, **kwargs) -> pd.DataFrame:
    method effect_design (line 68) | def effect_design(self, **kwargs) -> pd.DataFrame:
    method power_design (line 71) | def power_design(self, **kwargs) -> pd.DataFrame:
  function calc_prob_control_class (line 79) | def calc_prob_control_class(table: types.PassedDataType, metric: types.M...

FILE: ambrosia/preprocessing/aggregate.py
  class AggregatePreprocessor (line 27) | class AggregatePreprocessor(AbstractFittableTransformer):
    method __mode_calculation (line 59) | def __mode_calculation(values: pd.Series) -> Any:
    method __simple_agg (line 66) | def __simple_agg(values: pd.Series) -> Any:
    method __transform_agg_param (line 73) | def __transform_agg_param(aggregation_method: types.MethodType) -> typ...
    method __transform_params (line 84) | def __transform_params(dataframe: pd.DataFrame, aggregation_params: Di...
    method __init__ (line 95) | def __init__(self, categorial_method: types.MethodType = "mode", real_...
    method __real_case_step (line 102) | def __real_case_step(
    method __categorial_case_step (line 115) | def __categorial_case_step(
    method __empty_args_step (line 128) | def __empty_args_step(
    method get_params_dict (line 143) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 150) | def load_params_dict(self, params: Dict) -> None:
    method fit (line 169) | def fit(
    method transform (line 217) | def transform(
    method fit_transform (line 240) | def fit_transform(

FILE: ambrosia/preprocessing/cuped.py
  class Cuped (line 29) | class Cuped(AbstractVarianceReducer):
    method __init__ (line 115) | def __init__(self, verbose: bool = True) -> None:
    method __str__ (line 121) | def __str__(self) -> str:
    method __call__ (line 124) | def __call__(self, y: np.ndarray, X: np.ndarray) -> np.ndarray:
    method get_params_dict (line 129) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 144) | def load_params_dict(self, params: Dict) -> None:
    method fit (line 163) | def fit(
    method transform (line 200) | def transform(
    method fit_transform (line 229) | def fit_transform(
  class MultiCuped (line 261) | class MultiCuped(AbstractVarianceReducer):
    method __init__ (line 352) | def __init__(self, verbose: bool = True) -> None:
    method __str__ (line 358) | def __str__(self) -> str:
    method __call__ (line 361) | def __call__(self, y: np.ndarray, X: np.ndarray) -> np.ndarray:
    method get_params_dict (line 366) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 381) | def load_params_dict(self, params: Dict) -> None:
    method fit (line 400) | def fit(
    method transform (line 444) | def transform(
    method fit_transform (line 474) | def fit_transform(

FILE: ambrosia/preprocessing/ml_var_reducer.py
  class MLVarianceReducer (line 36) | class MLVarianceReducer(AbstractVarianceReducer):
    method __set_scorer (line 114) | def __set_scorer(self, scores: Optional[Dict[str, Callable]]):
    method __create_model (line 123) | def __create_model(self) -> None:
    method __init__ (line 136) | def __init__(
    method __str__ (line 150) | def __str__(self) -> str:
    method __call__ (line 153) | def __call__(self, y: np.ndarray, X: np.ndarray) -> np.ndarray:
    method _verbose_score (line 163) | def _verbose_score(self, dataframe: pd.DataFrame, prediction: np.ndarr...
    method _check_load_params (line 168) | def _check_load_params(self, params: Dict) -> None:
    method get_params_dict (line 175) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 193) | def load_params_dict(self, params: Dict) -> None:
    method store_params (line 209) | def store_params(self, config_store_path: Path, model_store_path: Path...
    method load_params (line 226) | def load_params(self, config_load_path: Path, model_load_path: Path) -...
    method fit (line 241) | def fit(
    method transform (line 275) | def transform(
    method fit_transform (line 305) | def fit_transform(

FILE: ambrosia/preprocessing/preprocessor.py
  class Preprocessor (line 40) | class Preprocessor:
    method __len__ (line 100) | def __len__(self) -> int:
    method __init__ (line 103) | def __init__(self, dataframe: pd.DataFrame, verbose: bool = True) -> N...
    method data (line 108) | def data(self, copy: bool = True):
    method aggregate (line 126) | def aggregate(
    method robust (line 176) | def robust(
    method iqr (line 215) | def iqr(
    method boxcox (line 247) | def boxcox(
    method log (line 278) | def log(
    method cuped (line 307) | def cuped(
    method multicuped (line 344) | def multicuped(
    method linearize (line 381) | def linearize(
    method transformations (line 425) | def transformations(self) -> List:
    method store_transformations (line 436) | def store_transformations(self, store_path: Path) -> None:
    method load_transformations (line 461) | def load_transformations(self, load_path: Path) -> None:
    method apply_transformations (line 482) | def apply_transformations(self) -> pd.DataFrame:
    method transform_from_config (line 498) | def transform_from_config(self, load_path: Path) -> pd.DataFrame:

FILE: ambrosia/preprocessing/robust.py
  class RobustPreprocessor (line 30) | class RobustPreprocessor(AbstractFittableTransformer):
    method __str__ (line 73) | def __str__(self) -> str:
    method __init__ (line 76) | def __init__(self, verbose: bool = True) -> None:
    method get_params_dict (line 89) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 104) | def load_params_dict(self, params: Dict) -> None:
    method __wrap_alpha (line 123) | def __wrap_alpha(self, alpha: Union[float, Iterable]) -> np.ndarray:
    method __check_tail (line 137) | def __check_tail(self, tail: str) -> str:
    method __calculate_quantiles (line 142) | def __calculate_quantiles(
    method fit (line 158) | def fit(
    method transform (line 193) | def transform(self, dataframe: pd.DataFrame, inplace: bool = False) ->...
    method fit_transform (line 237) | def fit_transform(
  class IQRPreprocessor (line 272) | class IQRPreprocessor(AbstractFittableTransformer):
    method __str__ (line 311) | def __str__(self) -> str:
    method __init__ (line 314) | def __init__(self, verbose: bool = True) -> None:
    method get_params_dict (line 322) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 337) | def load_params_dict(self, params: Dict) -> None:
    method __calculate_params (line 356) | def __calculate_params(
    method fit (line 364) | def fit(
    method transform (line 390) | def transform(self, dataframe: pd.DataFrame, inplace: bool = False) ->...
    method fit_transform (line 429) | def fit_transform(

FILE: ambrosia/preprocessing/transformers.py
  class BoxCoxTransformer (line 30) | class BoxCoxTransformer(AbstractFittableTransformer):
    method __str__ (line 59) | def __str__(self) -> str:
    method __init__ (line 62) | def __init__(
    method __calculate_lambda_ (line 72) | def __calculate_lambda_(
    method get_params_dict (line 82) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 97) | def load_params_dict(self, params: Dict) -> None:
    method fit (line 116) | def fit(
    method transform (line 142) | def transform(self, dataframe: pd.DataFrame, inplace: bool = False) ->...
    method fit_transform (line 171) | def fit_transform(
    method inverse_transform (line 199) | def inverse_transform(self, dataframe: pd.DataFrame, inplace: bool = F...
  class LogTransformer (line 229) | class LogTransformer(AbstractFittableTransformer):
    method __str__ (line 253) | def __str__(self) -> str:
    method __init__ (line 256) | def __init__(self) -> None:
    method get_params_dict (line 263) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 272) | def load_params_dict(self, params: Dict) -> None:
    method fit (line 287) | def fit(
    method transform (line 312) | def transform(self, dataframe: pd.DataFrame, inplace: bool = False) ->...
    method fit_transform (line 338) | def fit_transform(
    method inverse_transform (line 367) | def inverse_transform(self, dataframe: pd.DataFrame, inplace: bool = F...
  class LinearizationTransformer (line 391) | class LinearizationTransformer(AbstractFittableTransformer):
    method __str__ (line 419) | def __str__(self) -> str:
    method __init__ (line 422) | def __init__(self) -> None:
    method get_params_dict (line 429) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 438) | def load_params_dict(self, params: Dict) -> None:
    method fit (line 445) | def fit(
    method transform (line 477) | def transform(self, dataframe: pd.DataFrame, inplace: bool = False) ->...
    method fit_transform (line 494) | def fit_transform(

FILE: ambrosia/spark_tools/empiric.py
  function inject_effect (line 41) | def inject_effect(dataframe: types.SparkDataFrame, column: types.ColumnN...
  function evaluate_criterion (line 66) | def evaluate_criterion(
  function calc_empiric_power (line 117) | def calc_empiric_power(
  function get_table_power (line 183) | def get_table_power(
  function optimize_group_size (line 251) | def optimize_group_size(
  function get_table_size (line 298) | def get_table_size(
  function optimize_effect (line 362) | def optimize_effect(
  function get_table_effect (line 404) | def get_table_effect(

FILE: ambrosia/spark_tools/split_tools.py
  function unite_spark_tables (line 32) | def unite_spark_tables(*dataframes: types.SparkDataFrame) -> types.Spark...
  function add_hash_column (line 45) | def add_hash_column(
  function get_hash_split (line 73) | def get_hash_split(
  function add_to_required_size (line 101) | def add_to_required_size(
  function get_split (line 139) | def get_split(

FILE: ambrosia/spark_tools/stat_criteria.py
  class ABSparkCriterion (line 34) | class ABSparkCriterion(ABStatCriterion):
    method _init_cache (line 39) | def _init_cache(self) -> None:
    method __init__ (line 43) | def __init__(self, cache_parameters: bool = True) -> None:
    method _delete_cached_data_parameters (line 47) | def _delete_cached_data_parameters(self) -> None:
    method _calc_and_cache_data_parameters (line 50) | def _calc_and_cache_data_parameters(self, *args, **kwargs) -> None:
    method _recalc_cache (line 56) | def _recalc_cache(self, *args, **kwargs) -> None:
    method _check_clear_cache (line 60) | def _check_clear_cache(self) -> None:
    method _check_effect (line 64) | def _check_effect(self, effect_type: str) -> None:
    method get_results (line 67) | def get_results(
  class TtestIndCriterionSpark (line 86) | class TtestIndCriterionSpark(ABSparkCriterion):
    method __calc_and_cache_data_parameters (line 95) | def __calc_and_cache_data_parameters(
    method _apply_delta_method (line 104) | def _apply_delta_method(
    method calculate_pvalue (line 122) | def calculate_pvalue(
    method calculate_effect (line 150) | def calculate_effect(
    method calculate_conf_interval (line 169) | def calculate_conf_interval(
  class TtestRelativeCriterionSpark (line 200) | class TtestRelativeCriterionSpark(ABSparkCriterion):
    method _rename_col (line 210) | def _rename_col(column: str, group: str) -> str:
    method _calc_and_cache_data_parameters (line 213) | def _calc_and_cache_data_parameters(
    method calculate_pvalue (line 240) | def calculate_pvalue(
    method calculate_conf_interval (line 258) | def calculate_conf_interval(
    method calculate_effect (line 268) | def calculate_effect(

FILE: ambrosia/spark_tools/stratification.py
  class Stratification (line 30) | class Stratification(ab_abstract.StratificationUtil):
    method fit (line 36) | def fit(self, dataframe: types.SparkDataFrame, columns: Optional[Itera...
    method strat_sizes (line 50) | def strat_sizes(self) -> Dict[int, int]:

FILE: ambrosia/spark_tools/theory.py
  function get_stats_from_table (line 28) | def get_stats_from_table(dataframe: types.SparkDataFrame, column: types....
  function design_groups_size (line 40) | def design_groups_size(
  function design_effect (line 79) | def design_effect(
  function design_power (line 116) | def design_power(
  function ttest_spark (line 153) | def ttest_spark(

FILE: ambrosia/splitter/handlers.py
  function add_data_pandas (line 44) | def add_data_pandas(dataframe: pd.DataFrame, splitted_dataframe: pd.Data...
  function add_data_spark (line 68) | def add_data_spark(
  function add_data_to_splitted (line 101) | def add_data_to_splitted(
  function handle_full_split (line 134) | def handle_full_split(
  function data_shape (line 150) | def data_shape(dataframe: types.PassedDataType) -> int:
  function split_data_handler (line 162) | def split_data_handler(**kwargs) -> types.SplitterResult:

FILE: ambrosia/splitter/splitter.py
  class Splitter (line 42) | class Splitter(yaml.YAMLObject, ABToolAbstract, metaclass=ABMetaClass):
    method set_dataframe (line 181) | def set_dataframe(self, dataframe: Optional[types.PassedDataType]) -> ...
    method set_id_column (line 185) | def set_id_column(self, id_column: Optional[str]) -> None:
    method set_group_size (line 189) | def set_group_size(self, groups_size: Optional[int]) -> None:
    method set_test_group_ids (line 193) | def set_test_group_ids(self, test_group_ids: types.IndicesType) -> None:
    method set_fit_columns (line 197) | def set_fit_columns(self, fit_columns: types.ColumnNamesType) -> None:
    method set_strat_columns (line 201) | def set_strat_columns(self, strat_columns: types.ColumnNamesType) -> N...
    method __init__ (line 204) | def __init__(
    method __getstate__ (line 223) | def __getstate__(self):
    method from_yaml (line 235) | def from_yaml(cls, loader: yaml.Loader, node: yaml.Node):
    method run (line 239) | def run(
  function load_from_config (line 350) | def load_from_config(yaml_config: str, loader: type = yaml.Loader) -> Sp...
  function split (line 364) | def split(

FILE: ambrosia/tester/binary_result_evaluation.py
  function binary_absolute_result (line 33) | def binary_absolute_result(
  function binary_relative_confidence_interval (line 80) | def binary_relative_confidence_interval(
  function binary_relative_result (line 117) | def binary_relative_result(

FILE: ambrosia/tester/handlers.py
  function filter_spark_and_make_groups (line 19) | def filter_spark_and_make_groups(
  class PandasCriteria (line 38) | class PandasCriteria(enum.Enum):
  class SparkCriteria (line 45) | class SparkCriteria(enum.Enum):
  class TheoreticalTesterHandler (line 52) | class TheoreticalTesterHandler:
    method __init__ (line 53) | def __init__(
    method _correct_criterion (line 73) | def _correct_criterion(self, criterion: tp.Any) -> bool:
    method _raise_correct_criterion (line 76) | def _raise_correct_criterion(self, criterion: tp.Any) -> None:
    method get_criterion (line 80) | def get_criterion(self, criterion: str, data_example: types.SparkOrPan...
    method _set_kwargs (line 89) | def _set_kwargs(self):
    method solve (line 104) | def solve(self) -> types._SubResultType:

FILE: ambrosia/tester/tester.py
  class Tester (line 58) | class Tester(ABToolAbstract):
    method set_experiment_results (line 204) | def set_experiment_results(self, experiment_results: types.ExperimentR...
    method set_errors (line 207) | def set_errors(self, first_type_errors: types.StatErrorType) -> None:
    method set_metrics (line 213) | def set_metrics(self, metrics: types.MetricNamesType) -> None:
    method set_dataframe (line 219) | def set_dataframe(
    method __init__ (line 240) | def __init__(
    method __filter_data (line 270) | def __filter_data(
    method __bootstrap_result (line 303) | def __bootstrap_result(
    method __binary_result (line 336) | def __binary_result(
    method __theory_handler (line 353) | def __theory_handler(
    method __pre_run (line 374) | def __pre_run(method: str, args: types._UsageArgumentsType, **kwargs) ...
    method __apply_first_stage_multitest_correction (line 419) | def __apply_first_stage_multitest_correction(
    method __apply_second_stage_multitest_correction (line 431) | def __apply_second_stage_multitest_correction(
    method as_table (line 443) | def as_table(dict_result: types.TesterResult) -> pd.DataFrame:
    method run (line 476) | def run(
  function test (line 614) | def test(

FILE: ambrosia/tools/_lib/_bin_ci_aide.py
  function __helper_calc_empirical_power (line 23) | def __helper_calc_empirical_power(conf_interval: types.ManyIntervalType)...
  function __helper_bin_search_for_size (line 44) | def __helper_bin_search_for_size(
  function __helper_bin_search_for_delta (line 107) | def __helper_bin_search_for_delta(

FILE: ambrosia/tools/_lib/_bootstrap_tools.py
  class EmpiricSolution (line 20) | class EmpiricSolution:
    method __init__ (line 25) | def __init__(self, power_calulation: Callable, desired_power: float, v...
    method power (line 33) | def power(self, **kwargs) -> float:
  class EmpiricSizeSolution (line 37) | class EmpiricSizeSolution(EmpiricSolution):
    method calc_upper_bound (line 42) | def calc_upper_bound(self, **kwargs) -> int:
    method calc_binary (line 48) | def calc_binary(self, **kwargs) -> int:
  class EmpiricEffectSolution (line 55) | class EmpiricEffectSolution(EmpiricSolution):
    method calc_upper_bound (line 60) | def calc_upper_bound(self, **kwargs) -> float:
    method calc_binary (line 64) | def calc_binary(self, **kwargs) -> int:

FILE: ambrosia/tools/_lib/_selection_aide.py
  class Selector (line 25) | class Selector:
    method __init__ (line 30) | def __init__(
    method set_params (line 39) | def set_params(self, params: Tuple[Any, ...]) -> None:
    method iterate_params (line 43) | def iterate_params(self) -> Tuple[Tuple[Any, ...], List[Any]]:
    method handle_numeric (line 53) | def handle_numeric(report: pd.DataFrame, as_numeric: bool) -> None:
    method get_table_size (line 58) | def get_table_size(self, as_numeric: bool = False) -> pd.DataFrame:
    method get_table_effect (line 68) | def get_table_effect(self, as_numeric: bool = False) -> pd.DataFrame:

FILE: ambrosia/tools/_lib/_tools_aide.py
  function __helper_generate_bootstrap_samples (line 26) | def __helper_generate_bootstrap_samples(
  function __helper_inject_effect (line 42) | def __helper_inject_effect(
  function __helper_get_power_for_bootstraped (line 66) | def __helper_get_power_for_bootstraped(
  function estimate_power (line 100) | def estimate_power(power_function: Callable, **kwargs_power) -> float:
  function helper_bin_search_upper_bound_size (line 110) | def helper_bin_search_upper_bound_size(
  function helper_bin_searh_upper_bound_effect (line 131) | def helper_bin_searh_upper_bound_effect(power_function: Callable, power_...
  function helper_binary_search_optimal_effect (line 144) | def helper_binary_search_optimal_effect(
  function helper_binary_search_effect_with_injection (line 175) | def helper_binary_search_effect_with_injection(
  function helper_binary_search_optimal_size (line 199) | def helper_binary_search_optimal_size(

FILE: ambrosia/tools/ab_abstract_component.py
  class ABMetaClass (line 33) | class ABMetaClass(ABCMeta, YAMLObjectMetaclass):
  class ABToolAbstract (line 40) | class ABToolAbstract(ABC):
    method run (line 47) | def run(self):
    method _prepare_arguments (line 53) | def _prepare_arguments(_args: _PrepareArgumentsType) -> types._UsageAr...
  class AbstractFittableTransformer (line 84) | class AbstractFittableTransformer(ABC):
    method __init__ (line 94) | def __init__(self):
    method _check_fitted (line 97) | def _check_fitted(self) -> None:
    method _check_cols (line 101) | def _check_cols(self, dataframe: pd.DataFrame, columns: types.ColumnNa...
    method get_params_dict (line 107) | def get_params_dict(self) -> Dict:
    method load_params_dict (line 113) | def load_params_dict(self, params: Dict) -> None:
    method fit (line 124) | def fit(self):
    method transform (line 130) | def transform(self):
    method fit_transform (line 136) | def fit_transform(self):
    method store_params (line 141) | def store_params(self, store_path: Path) -> None:
    method load_params (line 151) | def load_params(self, load_path: Path) -> None:
  class AbstractVarianceReducer (line 163) | class AbstractVarianceReducer(AbstractFittableTransformer):
    method __call__ (line 191) | def __call__(self, y: np.ndarray, X: np.ndarray) -> np.ndarray:  # pyl...
    method __init__ (line 201) | def __init__(self, verbose: bool = True) -> None:
    method _return_result (line 206) | def _return_result(
    method _verbose (line 220) | def _verbose(self, old_variance: float, new_variance: float) -> None:
  function choose_on_table (line 229) | def choose_on_table(alternatives: List[Any], dataframe) -> Any:
  class DataframeHandler (line 240) | class DataframeHandler:
    method _handle_cases (line 242) | def _handle_cases(__func_pandas: Callable, __func_spark: Callable, *ar...
    method _handle_on_table (line 251) | def _handle_on_table(
  class SimpleDesigner (line 262) | class SimpleDesigner(ABC, DataframeHandler):
    method size_design (line 270) | def size_design(self, **kwargs) -> pd.DataFrame:
    method effect_design (line 274) | def effect_design(self, **kwargs) -> pd.DataFrame:
    method power_design (line 278) | def power_design(self, **kwargs) -> pd.DataFrame:
  class EmptyStratValue (line 282) | class EmptyStratValue(Enum):
  class StratificationUtil (line 286) | class StratificationUtil(ABC):
    method __init__ (line 292) | def __init__(self):
    method fit (line 296) | def fit(self, dataframe: types.SparkOrPandas, columns) -> None:
    method strat_sizes (line 300) | def strat_sizes(self) -> Dict[Any, int]:
    method is_trained (line 303) | def is_trained(self) -> bool:
    method empty_strat (line 309) | def empty_strat(self) -> bool:
    method _check_fit (line 315) | def _check_fit(self) -> None:
    method groups (line 322) | def groups(self):
    method size (line 332) | def size(self) -> int:
    method get_group_sizes (line 346) | def get_group_sizes(self, group_size: int) -> Dict[Any, int]:
  class StatCriterion (line 370) | class StatCriterion(ABC):
    method calculate_pvalue (line 376) | def calculate_pvalue(self, group_a: Iterable[float], group_b: Iterable...
  class ABStatCriterion (line 380) | class ABStatCriterion(StatCriterion):
    method _send_type_error_msg (line 388) | def _send_type_error_msg(cls):
    method calculate_effect (line 393) | def calculate_effect(self, group_a: Iterable[float], group_b: Iterable...
    method calculate_conf_interval (line 397) | def calculate_conf_interval(
    method _make_ci (line 402) | def _make_ci(self, left_ci: np.ndarray, right_ci: np.ndarray, alternat...
    method get_results (line 407) | def get_results(

FILE: ambrosia/tools/back_tools.py
  function create_seed_sequence (line 12) | def create_seed_sequence(length: int, entropy: Optional[Union[int, Itera...
  function tqdm_joblib (line 36) | def tqdm_joblib(tqdm_object):
  function handle_nested_multiprocessing (line 58) | def handle_nested_multiprocessing(
  function wrap_cols (line 81) | def wrap_cols(cols: types.ColumnNamesType) -> types.ColumnNamesType:

FILE: ambrosia/tools/bin_intervals.py
  class BinomTwoSampleCI (line 34) | class BinomTwoSampleCI(ABC):
    method __init__ (line 77) | def __init__(self):
    method __wald_ci (line 81) | def __wald_ci(
    method __yule_ci (line 100) | def __yule_ci(
    method __bayes_conjugate_beta (line 128) | def __bayes_conjugate_beta(
    method __square_eq_newcombe (line 155) | def __square_eq_newcombe(p_est: float, m: int, quantile: float) -> typ...
    method __newcombe_ci (line 170) | def __newcombe_ci(
    method __recentered_ci (line 202) | def __recentered_ci(
    method calculate_pvalue (line 221) | def calculate_pvalue(
    method confidence_interval (line 281) | def confidence_interval(
  function get_table_power_on_size_and_conversions (line 386) | def get_table_power_on_size_and_conversions(
  function get_table_power_on_size_and_delta (line 452) | def get_table_power_on_size_and_delta(
  function iterate_for_sample_size (line 536) | def iterate_for_sample_size(
  function get_table_sample_size_on_effect (line 575) | def get_table_sample_size_on_effect(
  function iterate_for_delta (line 635) | def iterate_for_delta(
  function get_table_effect_on_sample_size (line 675) | def get_table_effect_on_sample_size(

FILE: ambrosia/tools/configs.py
  class AmbrosiaEnum (line 19) | class AmbrosiaEnum(enum.Enum):
    method _check_for_existing_members (line 24) | def _check_for_existing_members():
    method check_value_in_enum (line 32) | def check_value_in_enum(cls, value: tp.Any) -> bool:
    method get_all_enum_values (line 36) | def get_all_enum_values(cls) -> tp.List[str]:
    method raise_if_value_incorrect_enum (line 40) | def raise_if_value_incorrect_enum(cls, value: tp.Any) -> None:
  class Alternatives (line 46) | class Alternatives(AmbrosiaEnum):
  class Effects (line 52) | class Effects(AmbrosiaEnum):

FILE: ambrosia/tools/decorators.py
  function filter_kwargs (line 5) | def filter_kwargs(func):
  function tqdm_parallel_decorator (line 18) | def tqdm_parallel_decorator(func):

FILE: ambrosia/tools/empirical_tools.py
  function inject_effect (line 35) | def inject_effect(
  function estim_stat_criterion_power (line 94) | def estim_stat_criterion_power(
  function get_bs_stat (line 137) | def get_bs_stat(sample: np.ndarray, stat: str = "mean", N: int = 1000, r...
  function get_bs_sample_stat (line 169) | def get_bs_sample_stat(
  function make_bootstrap (line 227) | def make_bootstrap(
  function eval_error (line 291) | def eval_error(
  class BootstrapStats (line 359) | class BootstrapStats:
    method __init__ (line 381) | def __init__(self, bootstrap_size: int = 10000, metric: Union[str, Cal...
    method __handle_str_metric (line 406) | def __handle_str_metric(self, bootstrap_a: np.ndarray, bootstrap_b: np...
    method __handle_std_value (line 417) | def __handle_std_value(self) -> float:
    method __handle_sampling (line 428) | def __handle_sampling(
    method fit (line 445) | def fit(self, group_a: Iterable[float], group_b: Iterable[float], rand...
    method min_of_distrbution (line 471) | def min_of_distrbution(self) -> float:
    method max_of_distribution (line 479) | def max_of_distribution(self) -> float:
    method confidence_interval (line 488) | def confidence_interval(
    method pvalue_criterion (line 526) | def pvalue_criterion(self, alternative: str = "two-sided") -> float:

FILE: ambrosia/tools/import_tools.py
  class NotInstalledPackage (line 24) | class NotInstalledPackage(Exception):
    method __init_subclass__ (line 27) | def __init_subclass__(cls, default_message: str) -> None:
    method __init__ (line 31) | def __init__(self, *args, **kwargs):
  class PysparkNotInstalled (line 38) | class PysparkNotInstalled(NotInstalledPackage, default_message="Install ...
  function get_installed_package_names (line 42) | def get_installed_package_names() -> tp.List[str]:
  function check_package (line 46) | def check_package(package_name: str) -> bool:
  function spark_installed (line 50) | def spark_installed() -> bool:
  function check_spark_installed (line 54) | def check_spark_installed() -> tp.NoReturn:

FILE: ambrosia/tools/knn.py
  class NMTree (line 34) | class NMTree:
    method __init__ (line 39) | def __init__(self, points: np.ndarray, payload: np.ndarray, ef_search:...
    method query_batch (line 65) | def query_batch(

FILE: ambrosia/tools/log.py
  function info_log (line 27) | def info_log(message: str):
  class RobustLogger (line 32) | class RobustLogger:
    method verbose (line 39) | def verbose(prev_stats: Dict[str, float], new_stats: Dict[str, float],...
    method verbose_list (line 49) | def verbose_list(
    method __calculate_stats (line 62) | def __calculate_stats(values: np.ndarray) -> Dict[str, float]:
    method get_stats (line 71) | def get_stats(

FILE: ambrosia/tools/pvalue_tools.py
  function calculate_point_effect_by_delta_method (line 25) | def calculate_point_effect_by_delta_method(
  function calc_statistic_for_delta_method (line 51) | def calc_statistic_for_delta_method(
  function calculate_pvalue_by_delta_method (line 65) | def calculate_pvalue_by_delta_method(
  function check_alternative (line 123) | def check_alternative(alternative: str) -> None:
  function corrected_alpha (line 133) | def corrected_alpha(alpha: Union[float, np.ndarray], alternative: str) -...
  function choose_from_bounds (line 143) | def choose_from_bounds(
  function calculate_intervals_by_delta_method (line 162) | def calculate_intervals_by_delta_method(
  function calculate_pvalue_by_interval (line 222) | def calculate_pvalue_by_interval(

FILE: ambrosia/tools/split_tools.py
  function check_ids_duplicates (line 36) | def check_ids_duplicates(
  function get_integer_salt (line 52) | def get_integer_salt(salt: Optional[str]) -> int:
  function encode_id (line 67) | def encode_id(enc_id: Any, salt: str, hash_function: Union[str, Callable...
  function get_simple_split (line 112) | def get_simple_split(
  function get_hash_split (line 141) | def get_hash_split(
  function get_metric_split (line 211) | def get_metric_split(
  function get_dim_decrease_split (line 259) | def get_dim_decrease_split(
  function make_labels_for_groups (line 299) | def make_labels_for_groups(groups_number: int) -> List[str]:
  function add_to_required_size (line 323) | def add_to_required_size(
  function get_split (line 349) | def get_split(

FILE: ambrosia/tools/stat_criteria.py
  function get_results_dict (line 27) | def get_results_dict(alpha: float, pvalue: float, effect: float, conf_in...
  function get_calc_effect_ttest (line 34) | def get_calc_effect_ttest(group_a: np.ndarray, group_b: np.ndarray, effe...
  class TtestIndCriterion (line 46) | class TtestIndCriterion(ABStatCriterion):
    method calculate_pvalue (line 53) | def calculate_pvalue(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method calculate_effect (line 62) | def calculate_effect(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method _build_intervals_absolute (line 65) | def _build_intervals_absolute(
    method calculate_conf_interval (line 84) | def calculate_conf_interval(
    method get_results (line 103) | def get_results(
  class TtestRelCriterion (line 121) | class TtestRelCriterion(ABStatCriterion):
    method calculate_pvalue (line 128) | def calculate_pvalue(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method calculate_effect (line 137) | def calculate_effect(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method _build_intervals_absolute (line 140) | def _build_intervals_absolute(
    method calculate_conf_interval (line 161) | def calculate_conf_interval(
    method get_results (line 182) | def get_results(
  class MannWhitneyCriterion (line 202) | class MannWhitneyCriterion(ABStatCriterion):
    method calculate_pvalue (line 209) | def calculate_pvalue(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method calculate_effect (line 215) | def calculate_effect(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method calculate_conf_interval (line 221) | def calculate_conf_interval(
  class WilcoxonCriterion (line 237) | class WilcoxonCriterion(ABStatCriterion):
    method calculate_pvalue (line 244) | def calculate_pvalue(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method calculate_effect (line 250) | def calculate_effect(self, group_a: np.ndarray, group_b: np.ndarray, e...
    method calculate_conf_interval (line 256) | def calculate_conf_interval(
  class ShapiroCriterion (line 270) | class ShapiroCriterion(StatCriterion):

FILE: ambrosia/tools/stratification.py
  class Stratification (line 23) | class Stratification(ab_abstract.StratificationUtil):
    method __init__ (line 61) | def __init__(self, threshold: Optional[int] = None, verbose: bool = Fa...
    method fit (line 76) | def fit(self, dataframe: pd.DataFrame, columns: Optional[List[Any]] = ...
    method strat_sizes (line 102) | def strat_sizes(self) -> Dict[Any, int]:
    method __corresponding_strat (line 113) | def __corresponding_strat(test_id: Iterable, strat_id: Iterable) -> List:
    method get_test_inds (line 119) | def get_test_inds(self, test_id: Iterable, id_column: Any = None) -> D...

FILE: ambrosia/tools/theoretical_tools.py
  function get_stats (line 37) | def get_stats(values: Iterable[float], ddof: int = 1) -> Tuple[float, fl...
  function get_table_stats (line 44) | def get_table_stats(data: pd.DataFrame, column: types.ColumnNameType) ->...
  function check_encode_alternative (line 51) | def check_encode_alternative(alternative: str) -> str:
  function unbiased_to_sufficient (line 63) | def unbiased_to_sufficient(std: float, size: int) -> float:
  function check_target_type (line 71) | def check_target_type(
  function stabilize_effect (line 87) | def stabilize_effect(
  function destabilize_effect (line 106) | def destabilize_effect(
  function get_sample_size (line 125) | def get_sample_size(
  function get_minimal_determinable_effect (line 188) | def get_minimal_determinable_effect(
  function get_power (line 249) | def get_power(
  function get_table_sample_size (line 311) | def get_table_sample_size(
  function design_groups_size (line 390) | def design_groups_size(
  function get_minimal_effects_table (line 443) | def get_minimal_effects_table(
  function design_effect (line 528) | def design_effect(
  function get_power_table (line 593) | def get_power_table(
  function design_power (line 679) | def design_power(
  function get_ttest_info_from_stats (line 744) | def get_ttest_info_from_stats(
  function get_ttest_info (line 757) | def get_ttest_info(group_a: np.ndarray, group_b: np.ndarray, alpha: np.n...
  function apply_delta_method_by_stats (line 782) | def apply_delta_method_by_stats(
  function apply_delta_method (line 841) | def apply_delta_method(
  function ttest_1samp_from_stats (line 894) | def ttest_1samp_from_stats(

FILE: ambrosia/tools/tools.py
  function bootstrap_over_statistical_population (line 34) | def bootstrap_over_statistical_population(
  function get_errors (line 114) | def get_errors(
  function get_empirical_table_power (line 206) | def get_empirical_table_power(
  function optimize_group_size (line 312) | def optimize_group_size(
  function calculate_group_size (line 464) | def calculate_group_size(
  function get_group_sizes (line 550) | def get_group_sizes(
  function get_empirical_table_sample_size (line 642) | def get_empirical_table_sample_size(
  function optimize_mde (line 739) | def optimize_mde(
  function calculate_empirical_mde (line 876) | def calculate_empirical_mde(
  function get_empirical_mde (line 962) | def get_empirical_mde(
  function get_empirical_mde_table (line 1054) | def get_empirical_mde_table(

FILE: ambrosia/tools/type_checks.py
  function check_type_decorator (line 27) | def check_type_decorator(type_checker=lambda set_value: set_value):
  function none_check_decorator (line 47) | def none_check_decorator(function):
  function check_type_dataframe (line 65) | def check_type_dataframe(dataframe: types.PassedDataType) -> types.Passe...
  function check_type_id_column (line 80) | def check_type_id_column(id_column: types.ColumnNameType) -> types.Colum...
  function check_type_id_columns (line 88) | def check_type_id_columns(id_columns: types.ColumnNamesType) -> types.Co...
  function check_type_group_size (line 98) | def check_type_group_size(groups_size: int) -> int:
  function check_type_test_group_ids (line 108) | def check_type_test_group_ids(test_group_ids: types.IndicesType) -> type...
  function check_type_fit_columns (line 116) | def check_type_fit_columns(fit_columns: types.ColumnNamesType) -> types....
  function check_type_strat_columns (line 126) | def check_type_strat_columns(strat_columns: types.ColumnNamesType) -> ty...
  function check_type_salt (line 136) | def check_type_salt(salt: str) -> str:
  function check_split_method_value (line 144) | def check_split_method_value(split_method: str) -> str:
  function check_metric_method_value (line 155) | def check_metric_method_value(method_metric: str) -> None:
  function check_norm_value (line 166) | def check_norm_value(norm: str) -> None:

FILE: ambrosia/types.py
  class PySparkStub (line 28) | class PySparkStub:

FILE: tests/conftest.py
  function local_spark_session (line 16) | def local_spark_session() -> None:
  function simple_binary_retention_table (line 27) | def simple_binary_retention_table() -> pd.DataFrame:
  function ltv_and_retention_dataset (line 37) | def ltv_and_retention_dataset() -> pd.DataFrame:
  function designer_simple_table (line 49) | def designer_simple_table(simple_binary_retention_table):
  function designer_ltv (line 61) | def designer_ltv(ltv_and_retention_dataset):
  function designer_ltv_spark (line 73) | def designer_ltv_spark(local_spark_session, ltv_and_retention_dataset):
  function results_ltv_retention_conversions (line 84) | def results_ltv_retention_conversions() -> pd.DataFrame:
  function tester_spark_ltv_ret (line 94) | def tester_spark_ltv_ret(local_spark_session, results_ltv_retention_conv...
  function tester_on_ltv_retention (line 105) | def tester_on_ltv_retention(results_ltv_retention_conversions):
  function stratification_table (line 119) | def stratification_table() -> pd.DataFrame:
  function stratificator (line 129) | def stratificator(stratification_table):
  function answer_ids_strat (line 140) | def answer_ids_strat(stratificator) -> Dict:
  function id_for_b_strat (line 155) | def id_for_b_strat(stratification_table) -> np.ndarray:
  function answer_ids_strat_column (line 164) | def answer_ids_strat_column(stratificator) -> Dict:
  function data_split (line 179) | def data_split() -> pd.DataFrame:
  function data_index_split (line 189) | def data_index_split() -> pd.DataFrame:
  function splitter_ltv_spark (line 202) | def splitter_ltv_spark(local_spark_session, ltv_and_retention_dataset):
  function data_variance_lin (line 213) | def data_variance_lin() -> pd.DataFrame:
  function data_nonlin_var (line 223) | def data_nonlin_var() -> pd.DataFrame:
  function data_for_agg (line 233) | def data_for_agg() -> pd.DataFrame:
  function robust_moments (line 244) | def robust_moments() -> pd.DataFrame:

FILE: tests/test_aggregate.py
  function test_inst (line 9) | def test_inst():
  function test_aggregation_by_agg_params (line 14) | def test_aggregation_by_agg_params(data_for_agg):
  function test_agg_params_priority (line 34) | def test_agg_params_priority(data_for_agg):
  function test_aggregation_by_week (line 58) | def test_aggregation_by_week(data_for_agg):
  function test_aggregate_load_store (line 73) | def test_aggregate_load_store(data_for_agg):

FILE: tests/test_cuped.py
  function test_cuped_instance (line 13) | def test_cuped_instance():
  function test_multicuped_instance (line 21) | def test_multicuped_instance():
  function test_cuped_decrease_var (line 30) | def test_cuped_decrease_var(covariate_column, data_variance_lin):
  function test_multi_cuped (line 55) | def test_multi_cuped(columns, data_variance_lin):
  function test_equal_multi_simple (line 70) | def test_equal_multi_simple(column, data_variance_lin):
  function test_load_store_params (line 87) | def test_load_store_params(Model, factor, data_variance_lin):

FILE: tests/test_designer.py
  function test_instance (line 16) | def test_instance():
  function test_constructors (line 24) | def test_constructors(ltv_and_retention_dataset):
  function test_corret_type (line 42) | def test_corret_type(designer_simple_table, designer_ltv):
  function test_run_theory (line 65) | def test_run_theory(param_to_design, expected_value, designer):
  function test_as_numeric (line 84) | def test_as_numeric(param_to_design, method, size, ltv_and_retention_dat...
  function test_every_type_run (line 106) | def test_every_type_run(to_design, method, effects, sizes, designer_ltv):
  function test_binary (line 120) | def test_binary(to_design, effects, sizes, designer_ltv, designer_simple...
  function test_design_function (line 130) | def test_design_function(ltv_and_retention_dataset, designer_ltv):
  function test_design_binary_function (line 150) | def test_design_binary_function(to_design, effects, sizes, beta, method,...
  function test_run_theory_spark (line 176) | def test_run_theory_spark(param_to_design, expected_value, designer):
  function test_empiric_spark (line 195) | def test_empiric_spark(param_to_design, designer):
  function test_not_available_dataframe (line 206) | def test_not_available_dataframe():
  function test_more_alpha_less_size (line 222) | def test_more_alpha_less_size(designer_ltv, method, metric):
  function test_designer_load_from_config (line 239) | def test_designer_load_from_config(ltv_and_retention_dataset):
  function test_alternative_parameter (line 265) | def test_alternative_parameter(to_design, method, effects, sizes, design...
  function test_groups_ratio_parameter (line 302) | def test_groups_ratio_parameter(to_design, method, effects, sizes, desig...

FILE: tests/test_ml_variance_reducer.py
  function test_instance (line 16) | def test_instance():
  function test_ml_reduce_variance (line 35) | def test_ml_reduce_variance(data_nonlin_var, columns):
  function test_store_load_catboost (line 47) | def test_store_load_catboost(data_nonlin_var):

FILE: tests/test_preprocessor.py
  function test_init (line 13) | def test_init(data_nonlin_var):
  function test_cuped_sequential (line 23) | def test_cuped_sequential(data_nonlin_var):
  function test_full_sequential (line 38) | def test_full_sequential(data_nonlin_var):
  function test_load_store_methods (line 55) | def test_load_store_methods(data_nonlin_var):
  function test_transform_from_config (line 77) | def test_transform_from_config(data_nonlin_var):
  function test_store_load_config (line 101) | def test_store_load_config(data_for_agg):
  function test_linearize_basic (line 124) | def test_linearize_basic(data_nonlin_var):
  function test_linearize_formula (line 135) | def test_linearize_formula(data_nonlin_var):
  function test_linearize_in_chain (line 149) | def test_linearize_in_chain(data_nonlin_var):
  function test_linearize_load_store (line 163) | def test_linearize_load_store(data_nonlin_var):
  function test_linearize_default_name (line 182) | def test_linearize_default_name(data_nonlin_var):

FILE: tests/test_robust.py
  function test_robust_constructor (line 16) | def test_robust_constructor():
  function test_robust (line 34) | def test_robust(tail, column_names, alpha, transf_name, data_nonlin_var,...
  function test_robust_load_store (line 49) | def test_robust_load_store(data_nonlin_var, robust_moments):
  function test_iqr_constructor (line 68) | def test_iqr_constructor():
  function test_iqr (line 84) | def test_iqr(column_names, transf_name, data_nonlin_var, robust_moments):
  function test_iqr_load_store (line 99) | def test_iqr_load_store(data_nonlin_var, robust_moments):

FILE: tests/test_splitter.py
  function test_instance (line 15) | def test_instance():
  function test_constructors (line 23) | def test_constructors(results_ltv_retention_conversions):
  function test_setter_method (line 35) | def test_setter_method():
  function test_all_inputs_metric (line 52) | def test_all_inputs_metric(strat_columns, fit_columns, groups_size, data...
  function test_split_hash_stable (line 71) | def test_split_hash_stable(strat_columns, groups_size, salt, id_column, ...
  function test_many_groups_split (line 94) | def test_many_groups_split(groups_size, groups_number, method, strat_col...
  function test_index_metric (line 113) | def test_index_metric(groups_size, groups_number, data_index_split):
  function test_fixed_b_group (line 130) | def test_fixed_b_group(groups_size, method, id_column, strat_columns, da...
  function test_split_function (line 155) | def test_split_function(strat_columns, groups_size, salt, id_column, fit...
  function test_full_split (line 178) | def test_full_split(ltv_and_retention_dataset, factor, method, strat_col...
  function test_spark_split (line 195) | def test_spark_split(method, groups_number, strat_columns, splitter_ltv_...
  function test_full_split_spark (line 209) | def test_full_split_spark(ltv_and_retention_dataset, splitter_ltv_spark,...
  function test_splitter_load_from_config (line 222) | def test_splitter_load_from_config(ltv_and_retention_dataset):
  function test_duplication_exception (line 244) | def test_duplication_exception(id_column):
  function test_duplication_exception_spark (line 263) | def test_duplication_exception_spark(local_spark_session, ltv_and_retent...

FILE: tests/test_stratification.py
  function test_instance (line 10) | def test_instance():
  function test_fit (line 19) | def test_fit(stratification_table):
  function test_strat_sizes (line 34) | def test_strat_sizes(stratificator):
  function test_test_ids (line 51) | def test_test_ids(ids, column, answer, stratificator):
  function test_groups_size (line 60) | def test_groups_size(group_size, stratificator):

FILE: tests/test_tester.py
  function check_eq (line 11) | def check_eq(a: float, b: float, eps: float = 1e-5) -> bool:
  function check_eq_int (line 19) | def check_eq_int(i1, i2) -> bool:
  function test_instance (line 24) | def test_instance():
  function test_constructors (line 32) | def test_constructors(results_ltv_retention_conversions):
  function test_correct_type (line 48) | def test_correct_type(effect_type, as_table, tester_on_ltv_retention):
  function test_every_type_run (line 59) | def test_every_type_run(effect_type, method, tester_on_ltv_retention):
  function check_pvalue_for_interval (line 68) | def check_pvalue_for_interval(interval: Tuple, pvalue: float, alpha: flo...
  function test_coinf_interval_absolute (line 85) | def test_coinf_interval_absolute(method, alpha, metrics, criterion, test...
  function test_coinf_interval_relative (line 102) | def test_coinf_interval_relative(method, alpha, metrics, alternative, te...
  function test_coinf_interval_bin_abs (line 124) | def test_coinf_interval_bin_abs(alpha, metrics, interval_type, alternati...
  function test_coinf_interval_bin_rel (line 148) | def test_coinf_interval_bin_rel(alpha, metrics, alternative, tester_on_l...
  function test_standalone_test_function (line 173) | def test_standalone_test_function(
  function test_criteria_ttest_different (line 199) | def test_criteria_ttest_different(effect_type):
  function test_kwargs_passing_theory (line 217) | def test_kwargs_passing_theory(criterion, metrics, alternative, tester_o...
  function test_kwargs_passing_empiric (line 229) | def test_kwargs_passing_empiric(metrics, alternative, tester_on_ltv_rete...
  function test_kwargs_passing_binary (line 251) | def test_kwargs_passing_binary(interval_type, tester_on_ltv_retention):
  function get_ci_pvalue (line 264) | def get_ci_pvalue(tester_on_ltv_retention, alternative: str, idx: int = ...
  function calc_intervals_pvalue (line 274) | def calc_intervals_pvalue(tester_on_ltv_retention, idx: int = 0, **run_k...
  function check_bound_intervals (line 284) | def check_bound_intervals(int_center, int_less, int_gr, left_bound: floa...
  function test_alternative_change_binary (line 296) | def test_alternative_change_binary(effect_type, interval_type, tester_on...
  function test_alternative_change_th (line 313) | def test_alternative_change_th(effect_type, criterion, tester_on_ltv_ret...
  function test_spark_tester (line 334) | def test_spark_tester(tester_spark_ltv_ret, tester_on_ltv_retention, alt...
  function test_paired_bootstrap (line 351) | def test_paired_bootstrap(effect_type, alternative):
  function test_metric_func_constructor (line 390) | def test_metric_func_constructor(results_ltv_retention_conversions):
  function test_metric_func_run (line 409) | def test_metric_func_run(method, results_ltv_retention_conversions):
  function test_metric_func_overrides_constructor (line 431) | def test_metric_func_overrides_constructor(results_ltv_retention_convers...
  function test_metric_func_bootstrap (line 448) | def test_metric_func_bootstrap(results_ltv_retention_conversions):

FILE: tests/test_transformers.py
  function test_boxcox_constructor (line 16) | def test_boxcox_constructor():
  function test_boxcox (line 32) | def test_boxcox(column_names, transf_name, data_nonlin_var, robust_momen...
  function test_boxcox_load_store (line 47) | def test_boxcox_load_store(data_nonlin_var, robust_moments):
  function test_boxcox_inverse (line 72) | def test_boxcox_inverse(column_names, data_nonlin_var):
  function test_log_constructor (line 84) | def test_log_constructor():
  function test_logarithm (line 100) | def test_logarithm(column_names, transf_name, data_nonlin_var, robust_mo...
  function test_logarithm_load_store (line 115) | def test_logarithm_load_store(data_nonlin_var, robust_moments):
  function test_logarithm_inverse (line 140) | def test_logarithm_inverse(column_names, data_nonlin_var):
Condensed preview — 154 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (6,383K chars).
[
  {
    "path": ".editorconfig",
    "chars": 279,
    "preview": "root = true\n\n[*]\ncharset=utf-8\nend_of_line=lf\ninsert_final_newline=true\nindent_style=space\nindent_size=4\nmax_line_length"
  },
  {
    "path": ".github/workflows/publish.yaml",
    "chars": 683,
    "preview": "name: Publish\n\non:\n  release:\n    types:\n      - created\n\njobs:\n  publish:\n    runs-on: ubuntu-latest\n    steps:\n      -"
  },
  {
    "path": ".github/workflows/test.yaml",
    "chars": 1122,
    "preview": "name: Test\n\non:\n  push:\n    branches:\n      - \"**\"\n  pull_request:\n    branches:\n      - main\n      - dev\n\njobs:\n  lint:"
  },
  {
    "path": ".gitignore",
    "chars": 540,
    "preview": ".idea/\n.vscode/\n\n# Sphynx docs\ndocs/_build/\ndocs/build\ndocs/*.tar.gz\n\n# Data\ndata/\n\n# Virtualenv\nmars_env/\n.venv/\n\n# Jup"
  },
  {
    "path": ".pylintrc",
    "chars": 16429,
    "preview": "[MASTER]\n\n# A comma-separated list of package or module names from where C extensions may\n# be loaded. Extensions are lo"
  },
  {
    "path": ".readthedocs.yaml",
    "chars": 470,
    "preview": "version: 2\n\nbuild:\n  os: ubuntu-22.04\n  tools:\n    python: \"3.11\"\n  jobs:\n    post_install:\n      - pip install --no-cac"
  },
  {
    "path": "CHANGELOG.rst",
    "chars": 6012,
    "preview": "Release Notes\n=============\n\nVersion 0.5.1 (26.03.2026)\n---------------------------\n\n**New Features:**\n\n* Custom metric "
  },
  {
    "path": "CONTRIBUTING.rst",
    "chars": 3281,
    "preview": "Contributing Guide \n===================\n\n`Ambrosia` is an open source project and there are many ways to contribute, fro"
  },
  {
    "path": "LICENSE",
    "chars": 11419,
    "preview": "Copyright 2022 MTS (Mobile Telesystems).  All rights reserved.\n\n                                 Apache License\n        "
  },
  {
    "path": "Makefile",
    "chars": 2008,
    "preview": "VENV=.venv\n\nifeq (${OS},Windows_NT)\n\tBIN=${VENV}/Scripts\nelse\n\tBIN=${VENV}/bin\nendif\n\nexport PATH := $(BIN):$(PATH)\n\nFLA"
  },
  {
    "path": "README.rst",
    "chars": 4398,
    "preview": ".. shields start\n\nAmbrosia\n========\n\n|PyPI| |PyPI License| |ReadTheDocs| |Tests| |Coverage| |Black| |Python Versions| |T"
  },
  {
    "path": "SECURITY.rst",
    "chars": 919,
    "preview": "Security Policy\r\n===============\r\n\r\nSupported Python versions\r\n-------------------------\r\n\r\n3.7 or above\r\n\r\nProduct deve"
  },
  {
    "path": "ambrosia/VERSION",
    "chars": 6,
    "preview": "0.5.1\n"
  },
  {
    "path": "ambrosia/__init__.py",
    "chars": 1815,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/designer/__init__.py",
    "chars": 991,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/designer/designer.py",
    "chars": 34918,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/designer/handlers.py",
    "chars": 3806,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/preprocessing/__init__.py",
    "chars": 1247,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/preprocessing/aggregate.py",
    "chars": 9977,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/preprocessing/cuped.py",
    "chars": 18784,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/preprocessing/ml_var_reducer.py",
    "chars": 12442,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/preprocessing/preprocessor.py",
    "chars": 18037,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/preprocessing/robust.py",
    "chars": 16535,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/preprocessing/transformers.py",
    "chars": 16907,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/spark_tools/__init__.py",
    "chars": 749,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/spark_tools/empiric.py",
    "chars": 15736,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/spark_tools/split_tools.py",
    "chars": 6939,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/spark_tools/stat_criteria.py",
    "chars": 10930,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/spark_tools/stratification.py",
    "chars": 2228,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/spark_tools/theory.py",
    "chars": 6006,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/splitter/__init__.py",
    "chars": 766,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/splitter/handlers.py",
    "chars": 5393,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/splitter/splitter.py",
    "chars": 16966,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tester/__init__.py",
    "chars": 728,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tester/binary_result_evaluation.py",
    "chars": 4932,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tester/handlers.py",
    "chars": 4096,
    "preview": "import enum\nimport typing as tp\n\nimport numpy as np\nimport pandas as pd\n\nimport ambrosia.spark_tools.stat_criteria as sp"
  },
  {
    "path": "ambrosia/tester/tester.py",
    "chars": 29666,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/__init__.py",
    "chars": 860,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/_lib/__init__.py",
    "chars": 598,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/_lib/_bin_ci_aide.py",
    "chars": 5609,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/_lib/_bootstrap_tools.py",
    "chars": 2420,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/_lib/_selection_aide.py",
    "chars": 3071,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/_lib/_tools_aide.py",
    "chars": 7000,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/ab_abstract_component.py",
    "chars": 13164,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/back_tools.py",
    "chars": 2654,
    "preview": "import contextlib\nfrom typing import Any, Callable, Dict, Iterable, Optional, Union\n\nimport joblib\nimport numpy as np\nfr"
  },
  {
    "path": "ambrosia/tools/bin_intervals.py",
    "chars": 28241,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/configs.py",
    "chars": 1561,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/decorators.py",
    "chars": 575,
    "preview": "import inspect\nfrom functools import wraps\n\n\ndef filter_kwargs(func):\n    @wraps(func)\n    def wrapper(*args, **kwargs):"
  },
  {
    "path": "ambrosia/tools/empirical_tools.py",
    "chars": 20102,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/import_tools.py",
    "chars": 1731,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/knn.py",
    "chars": 3725,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/log.py",
    "chars": 2648,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/pvalue_tools.py",
    "chars": 9209,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/split_tools.py",
    "chars": 18073,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/stat_criteria.py",
    "chars": 10392,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/stratification.py",
    "chars": 5793,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/theoretical_tools.py",
    "chars": 31724,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/tools.py",
    "chars": 40529,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/tools/type_checks.py",
    "chars": 5562,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/types.py",
    "chars": 3904,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "ambrosia/version.py",
    "chars": 797,
    "preview": "#  Copyright 2022 MTS (Mobile Telesystems)\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you m"
  },
  {
    "path": "context7.json",
    "chars": 107,
    "preview": "{\n  \"url\": \"https://context7.com/mobiletelesystems/ambrosia\",\n  \"public_key\": \"pk_cSA3cXMaOugl1CKQyOg54\"\n}\n"
  },
  {
    "path": "docs/Makefile",
    "chars": 638,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
  },
  {
    "path": "docs/make.bat",
    "chars": 804,
    "preview": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sp"
  },
  {
    "path": "docs/requirements.txt",
    "chars": 86,
    "preview": "sphinx>=7.0,<9.0\nsphinx_copybutton\nnbsphinx\nnbsphinx_link\nnumpydoc>=1.6.0,<1.9.0\nfuro\n"
  },
  {
    "path": "docs/source/_static/css/style.css",
    "chars": 205,
    "preview": "svg {\n    width: 100%;\n}\n\n@media(min-width:97em) {\n    html {\n        font-size: 100%\n    }\n}\n\n.strike {\n    text-decora"
  },
  {
    "path": "docs/source/ab_cases/kion_ab.rst",
    "chars": 71,
    "preview": "KION A/B case\n-------------\n\n.. raw:: html\n   :file: data/kion_ab.html\n"
  },
  {
    "path": "docs/source/ab_cases.rst",
    "chars": 139,
    "preview": "A/B Testing Cases\n-----------------\n\nScenarios for using *Ambrosia* in A/B testing cases\n\n.. toctree::\n   :maxdepth: 1\n\n"
  },
  {
    "path": "docs/source/ambrosia_elements/advanced_transformations.rst",
    "chars": 549,
    "preview": "Advanced metric transformations\n-------------------------------\n\n.. currentmodule:: ambrosia.preprocessing\n\n.. autosumma"
  },
  {
    "path": "docs/source/ambrosia_elements/aggregation.rst",
    "chars": 234,
    "preview": "Aggregation\n-----------\n\n.. currentmodule:: ambrosia.preprocessing\n\n.. autosummary::\n   :nosignatures:\n\n   AggregatePrep"
  },
  {
    "path": "docs/source/ambrosia_elements/designer.rst",
    "chars": 900,
    "preview": "=================\nExperiment Design\n=================\n\n*Ambrosia* offers tools for calculating A/B test parameters such "
  },
  {
    "path": "docs/source/ambrosia_elements/preprocessing.rst",
    "chars": 831,
    "preview": "==================\nData Preprocessing\n==================\n\nThe tools from this subsection allow to automatically perform "
  },
  {
    "path": "docs/source/ambrosia_elements/processor.rst",
    "chars": 357,
    "preview": "Preprocessor\n------------\n\n.. currentmodule:: ambrosia.preprocessing\n\n.. autosummary::\n   :nosignatures:\n\n   Preprocesso"
  },
  {
    "path": "docs/source/ambrosia_elements/robust.rst",
    "chars": 407,
    "preview": "Outliers removal\n----------------\n\n.. currentmodule:: ambrosia.preprocessing\n\n.. autosummary::\n   :nosignatures:\n\n   Rob"
  },
  {
    "path": "docs/source/ambrosia_elements/simple_transformation.rst",
    "chars": 429,
    "preview": "Simple metric transformations\n-----------------------------\n\n.. currentmodule:: ambrosia.preprocessing\n\n.. autosummary::"
  },
  {
    "path": "docs/source/ambrosia_elements/splitter.rst",
    "chars": 776,
    "preview": "================\nGroups Splitting\n================\n\nThe following classes and functions helps to split batch data into\ne"
  },
  {
    "path": "docs/source/ambrosia_elements/tester.rst",
    "chars": 845,
    "preview": "==================\nEffect Measurement\n==================\n\nTools for assessing the statistical significance of completed "
  },
  {
    "path": "docs/source/ambrosia_nutshell.rst",
    "chars": 1639,
    "preview": ".. role:: bolditalic\n    :class: bolditalic\n\n.. brief description \n\nA/B testing with *Ambrosia* in a Nutshell\n----------"
  },
  {
    "path": "docs/source/authors.rst",
    "chars": 60,
    "preview": ".. include:: ../../README.rst\n    :start-after: contributors"
  },
  {
    "path": "docs/source/changelog.rst",
    "chars": 32,
    "preview": ".. include:: ../../CHANGELOG.rst"
  },
  {
    "path": "docs/source/conf.py",
    "chars": 3320,
    "preview": "# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here"
  },
  {
    "path": "docs/source/contributing.rst",
    "chars": 35,
    "preview": ".. include:: ../../CONTRIBUTING.rst"
  },
  {
    "path": "docs/source/develop.rst",
    "chars": 187,
    "preview": ".. include:: ../../README.rst\n    :start-after: develop\n    :end-before: contributors\n\n\n\n.. toctree::\n    :maxdepth: 1\n "
  },
  {
    "path": "docs/source/index.rst",
    "chars": 1075,
    "preview": ":hide-toc:\n\n.. include:: ../../README.rst\n    :end-before: shields end\n\n.. include:: ../../README.rst\n    :start-after: "
  },
  {
    "path": "docs/source/installation.rst",
    "chars": 177,
    "preview": ".. include:: ../../README.rst\n    :start-after: install\n    :end-before: usage\n\n.. admonition:: Python versions support\n"
  },
  {
    "path": "docs/source/nb_pandas_examples.rst",
    "chars": 455,
    "preview": "Pandas Data Examples\n--------------------\n\n.. toctree::\n    :maxdepth: 1\n\n    /pandas_examples/00_preprocessing\n    /pan"
  },
  {
    "path": "docs/source/nb_spark_examples.rst",
    "chars": 183,
    "preview": "Spark Data Examples\n-------------------\n\n.. toctree::\n    :maxdepth: 1\n\n    /spark_examples/07_spark_designer\n    /spark"
  },
  {
    "path": "docs/source/pandas_examples/00_preprocessing.nblink",
    "chars": 56,
    "preview": "{\n  \"path\": \"../../../examples/00_preprocessing.ipynb\"\n}"
  },
  {
    "path": "docs/source/pandas_examples/01_vr_transformations.nblink",
    "chars": 61,
    "preview": "{\n  \"path\": \"../../../examples/01_vr_transformations.ipynb\"\n}"
  },
  {
    "path": "docs/source/pandas_examples/02_preprocessor.nblink",
    "chars": 55,
    "preview": "{\n  \"path\": \"../../../examples/02_preprocessor.ipynb\"\n}"
  },
  {
    "path": "docs/source/pandas_examples/03_pandas_designer.nblink",
    "chars": 58,
    "preview": "{\n  \"path\": \"../../../examples/03_pandas_designer.ipynb\"\n}"
  },
  {
    "path": "docs/source/pandas_examples/04_binary_design.nblink",
    "chars": 57,
    "preview": "{\n  \"path\": \"../../../examples/04_binary_design.ipynb\"\n}\n"
  },
  {
    "path": "docs/source/pandas_examples/05_pandas_splitter.nblink",
    "chars": 58,
    "preview": "{\n  \"path\": \"../../../examples/05_pandas_splitter.ipynb\"\n}"
  },
  {
    "path": "docs/source/pandas_examples/06_pandas_tester.nblink",
    "chars": 56,
    "preview": "{\n  \"path\": \"../../../examples/06_pandas_tester.ipynb\"\n}"
  },
  {
    "path": "docs/source/pandas_examples/10_synthetic_experiment_full_pipeline_short.nblink",
    "chars": 83,
    "preview": "{\n  \"path\": \"../../../examples/10_synthetic_experiment_full_pipeline_short.ipynb\"\n}"
  },
  {
    "path": "docs/source/pandas_examples/11_cuped_example.nblink",
    "chars": 56,
    "preview": "{\n  \"path\": \"../../../examples/11_cuped_example.ipynb\"\n}"
  },
  {
    "path": "docs/source/security.rst",
    "chars": 31,
    "preview": ".. include:: ../../SECURITY.rst"
  },
  {
    "path": "docs/source/spark_examples/07_spark_designer.nblink",
    "chars": 57,
    "preview": "{\n  \"path\": \"../../../examples/07_spark_designer.ipynb\"\n}"
  },
  {
    "path": "docs/source/spark_examples/08_spark_splitter.nblink",
    "chars": 57,
    "preview": "{\n  \"path\": \"../../../examples/08_spark_splitter.ipynb\"\n}"
  },
  {
    "path": "docs/source/spark_examples/09_spark_tester.nblink",
    "chars": 55,
    "preview": "{\n  \"path\": \"../../../examples/09_spark_tester.ipynb\"\n}"
  },
  {
    "path": "docs/source/usage.rst",
    "chars": 378,
    "preview": ".. role:: bolditalic\n    :class: bolditalic\n\n.. include:: ../../README.rst\n    :start-after: usage\n    :end-before: deve"
  },
  {
    "path": "examples/00_preprocessing.ipynb",
    "chars": 92799,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"8f4933e1\",\n   \"metadata\": {},\n   \"source\": [\n    \"# *Ambrosia* d"
  },
  {
    "path": "examples/01_vr_transformations.ipynb",
    "chars": 32025,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"bdd654b0\",\n   \"metadata\": {},\n   \"source\": [\n    \"# *Ambrosia* a"
  },
  {
    "path": "examples/02_preprocessor.ipynb",
    "chars": 21664,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a4759f2c\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Chain ``Prep"
  },
  {
    "path": "examples/03_pandas_designer.ipynb",
    "chars": 59097,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"211f7b78\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Example of t"
  },
  {
    "path": "examples/04_binary_design.ipynb",
    "chars": 63719,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"247a2c2a\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Binary metri"
  },
  {
    "path": "examples/05_pandas_splitter.ipynb",
    "chars": 85737,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f9f85dba\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Example of t"
  },
  {
    "path": "examples/06_pandas_tester.ipynb",
    "chars": 33751,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"2e99ccd3\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Example of t"
  },
  {
    "path": "examples/07_spark_designer.ipynb",
    "chars": 23515,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"8dc91a90\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Overview of "
  },
  {
    "path": "examples/08_spark_splitter.ipynb",
    "chars": 21728,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"1e2187de\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Overview of "
  },
  {
    "path": "examples/09_spark_tester.ipynb",
    "chars": 9345,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"565e7f2f\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Overview of "
  },
  {
    "path": "examples/10_synthetic_experiment_full_pipeline_short.ipynb",
    "chars": 28756,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"46382b8f\",\n   \"metadata\": {},\n   \"source\": [\n    \"# *Ambrosia* i"
  },
  {
    "path": "examples/11_cuped_example.ipynb",
    "chars": 20960,
    "preview": "{\n    \"cells\": [\n        {\n            \"cell_type\": \"markdown\",\n            \"id\": \"00946f46\",\n            \"metadata\": {}"
  },
  {
    "path": "examples/12_ratio_metrics_and_custom_functions.ipynb",
    "chars": 32357,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"j7gnymneo8t\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Ratio-мет"
  },
  {
    "path": "examples/_examples_configs/aggregator.json",
    "chars": 142,
    "preview": "{\"aggregation_params\": {\"watched\": \"sum\", \"sessions\": \"max\", \"gender\": \"simple\", \"platform\": \"mode\"}, \"groupby_columns\":"
  },
  {
    "path": "examples/_examples_configs/boxcox_tranformer.json",
    "chars": 62,
    "preview": "{\"column_names\": [\"watched\"], \"lambda_\": [0.4314844480895849]}"
  },
  {
    "path": "examples/_examples_configs/cuped_config.json",
    "chars": 152,
    "preview": "{\"target_column\": \"target\", \"transformed_name\": \"target_cuped\", \"covariate_column\": \"feature_2\", \"theta\": 3.085966714908"
  },
  {
    "path": "examples/_examples_configs/designer_config.yaml",
    "chars": 175,
    "preview": "!designer {effects: [1.05, 1.1, 1.2], first_type_errors: [0.01, 0.05], method: theory,\n  metrics: [sum_dur, ln_vod_cnt],"
  },
  {
    "path": "examples/_examples_configs/kion_cuped_params.json",
    "chars": 143,
    "preview": "{\"target_column\": \"ln_vod_cnt\", \"transformed_name\": null, \"covariate_column\": \"sum_dur\", \"theta\": 5.0821173038763154e-08"
  },
  {
    "path": "examples/_examples_configs/multicuped_coef.json",
    "chars": 198,
    "preview": "{\"target_column\": \"target\", \"transformed_name\": \"target_multicuped\", \"covariate_columns\": [\"feature_2\", \"feature_3\"], \"t"
  },
  {
    "path": "examples/_examples_configs/multicuped_config.json",
    "chars": 198,
    "preview": "{\"target_column\": \"target\", \"transformed_name\": \"target_multicuped\", \"covariate_columns\": [\"feature_2\", \"feature_3\"], \"t"
  },
  {
    "path": "examples/_examples_configs/params_cuped.json",
    "chars": 151,
    "preview": "{\"target_column\": \"watched\", \"transformed_name\": \"watched_cuped\", \"covariate_column\": \"audio\", \"theta\": 23.5638039428584"
  },
  {
    "path": "examples/_examples_configs/preprocessor.json",
    "chars": 478,
    "preview": "{\"AggregatePreprocessor_1\": {\"aggregation_params\": {\"watched\": \"sum\", \"audio\": \"sum\", \"gender\": \"simple\", \"platform\": \"m"
  },
  {
    "path": "examples/_examples_configs/robust.json",
    "chars": 100,
    "preview": "{\"tail\": \"right\", \"column_names\": [\"watched\"], \"alpha\": [0.01], \"quantiles\": [[1049.5734329308516]]}"
  },
  {
    "path": "examples/_examples_configs/splitter_config.yaml",
    "chars": 89,
    "preview": "!splitter\nfit_columns: null\ngroups_size: 322\nid_column: object_id\nstrat_columns:\n- l\n- e\n"
  },
  {
    "path": "examples/test_installation.ipynb",
    "chars": 4803,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Тест установки Ambrosia\\n\",\n    \""
  },
  {
    "path": "poetry.toml",
    "chars": 45,
    "preview": "[virtualenvs]\ncreate = true\nin-project = true"
  },
  {
    "path": "pyproject.toml",
    "chars": 2852,
    "preview": "[tool.poetry]\nname = \"Ambrosia\"\nversion = \"0.5.1\"\ndescription = \"A Python library for working with A/B tests.\"\nlicense ="
  },
  {
    "path": "setup.cfg",
    "chars": 1239,
    "preview": "[coverage:run]\n# the name of the data file to use for storing or reporting coverage.\ndata_file = reports/.coverage.data\n"
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/configs/designer_config.yaml",
    "chars": 85,
    "preview": "!designer\neffects:\n    - 1.1\nsizes:\n    - 500\n    - 1000\nmethod: theory\nmetrics: LTV\n"
  },
  {
    "path": "tests/conftest.py",
    "chars": 6868,
    "preview": "from typing import Dict, List\n\nimport numpy as np\nimport pandas as pd\nimport pyspark\nimport pytest\nimport scipy.stats as"
  },
  {
    "path": "tests/test_aggregate.py",
    "chars": 2363,
    "preview": "import os\n\nimport pytest\n\nfrom ambrosia.preprocessing import AggregatePreprocessor\n\n\n@pytest.mark.smoke()\ndef test_inst("
  },
  {
    "path": "tests/test_cuped.py",
    "chars": 3052,
    "preview": "import os\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom ambrosia.preprocessing import Cuped, MultiCuped\n\ns"
  },
  {
    "path": "tests/test_data/ltv_retention.csv",
    "chars": 223078,
    "preview": "LTV,retention\n38.00489086282513,0.0\n70.58806871196717,1.0\n13.585602026263787,1.0\n19.813550364522207,0.0\n207.213002906203"
  },
  {
    "path": "tests/test_data/nonlin_var_table.csv",
    "chars": 221843,
    "preview": "feature_1,feature_2,feature_3,target\n1.3335272566924914,34.18422541475197,24.730372402735654,1378.481590494959\n6.0976610"
  },
  {
    "path": "tests/test_data/pipeline_test.csv",
    "chars": 1919644,
    "preview": "id,gender,watched,audio,day,platform\n0,Male,7.912889151471635,2.2109727822571457,1,web\n1,Male,6.678690213549574,0.020715"
  },
  {
    "path": "tests/test_data/result_ltv_ret_conv.csv",
    "chars": 65756,
    "preview": ",retention,conversions,ltv,group\n0,1.0,0.0,23.762541950141355,A\n1,0.0,1.0,396.66279064238284,A\n2,0.0,0.0,400.77766939065"
  },
  {
    "path": "tests/test_data/robust_moments.csv",
    "chars": 820,
    "preview": "transf_name,mean,std\nlog_feature_1,0.6393200791700496,1.1097352611771127\nlog_feature_2,2.009838264143796,0.9143088107724"
  },
  {
    "path": "tests/test_data/splitter_dataframe.csv",
    "chars": 619356,
    "preview": ",index,m,a,b,l,e,sub_index\n0,0,0.0,1.7189115558748598,0.03349427912911918,1,1,6142\n1,1,0.0,-0.06702272173244096,0.171568"
  },
  {
    "path": "tests/test_data/stratification_data.csv",
    "chars": 34960,
    "preview": ",metric,gender,retention,id\n0,4.880947823414682,Male,1,1\n1,4.645748749578122,Male,0,8\n2,1.7508482879507357,Male,0,15\n3,3"
  },
  {
    "path": "tests/test_data/var_table.csv",
    "chars": 222831,
    "preview": "feature_1,feature_2,feature_3,target\n-2.4269160848601494,5.575498356074747,43.50532332352297,187.3854588509813\n-2.745188"
  },
  {
    "path": "tests/test_data/watch_result.csv",
    "chars": 342751,
    "preview": "id,watched,group,day\n1708,349.5811328983496,A,1\n24,124.22416918817862,A,1\n1692,14.812921522151456,A,1\n185,179.6072835075"
  },
  {
    "path": "tests/test_data/watch_result_agg.csv",
    "chars": 44875,
    "preview": "id,watched,group\n6,597.8333622096858,A\n11,549.3142339732931,A\n20,564.4019415744934,A\n21,248.73535770545743,A\n23,926.0489"
  },
  {
    "path": "tests/test_data/week_metrics.csv",
    "chars": 1348480,
    "preview": "id,gender,watched,sessions,day,platform\n0,Male,28.440846140406855,4,1,android\n1,Female,1.825271367392974,2,1,ios\n2,Femal"
  },
  {
    "path": "tests/test_designer.py",
    "chars": 11102,
    "preview": "import os\nfrom typing import Dict, List\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nimport yaml\nfrom pytest_la"
  },
  {
    "path": "tests/test_ml_variance_reducer.py",
    "chars": 1954,
    "preview": "import os\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom ambrosia.preprocessing import MLVarianceReducer\n\nS"
  },
  {
    "path": "tests/test_preprocessor.py",
    "chars": 6886,
    "preview": "import os\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom ambrosia.preprocessing import Preprocessor\n\nstore_"
  },
  {
    "path": "tests/test_robust.py",
    "chars": 4149,
    "preview": "import os\nfrom typing import List\n\nimport numpy as np\nimport pytest\n\nfrom ambrosia.preprocessing import IQRPreprocessor,"
  },
  {
    "path": "tests/test_splitter.py",
    "chars": 10855,
    "preview": "import os\nfrom typing import List\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nimport yaml\n\nfrom ambrosia.split"
  },
  {
    "path": "tests/test_stratification.py",
    "chars": 1970,
    "preview": "import numpy as np\nimport pytest\nfrom pytest_lazy_fixtures import lf\n\nfrom ambrosia.tools.ab_abstract_component import E"
  },
  {
    "path": "tests/test_tester.py",
    "chars": 17412,
    "preview": "from typing import List, Tuple\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom ambrosia.tester import Tester"
  },
  {
    "path": "tests/test_transformers.py",
    "chars": 5026,
    "preview": "import os\nfrom typing import List\n\nimport numpy as np\nimport pytest\n\nfrom ambrosia.preprocessing import BoxCoxTransforme"
  }
]

// ... and 1 more files (download for full content)

About this extraction

This page contains the full source code of the MobileTeleSystems/Ambrosia GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 154 files (17.8 MB), approximately 1.6M tokens, and a symbol index with 586 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!