[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n*.pyc\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\noutput/\nwheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# pyenv\n.python-version\n\n# celery beat schedule file\ncelerybeat-schedule\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n\n# vs settings\n.vscode/\n\n# data\ndata/**\n!data/open_source_logs/\n!data/open_source_logs/**\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2019 Federico Zaiter\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "## LogClass\nThis repository provides an open-source toolkit for LogClass framework from W. Meng et al., \"[LogClass: Anomalous Log Identification and Classification with Partial Labels](https://ieeexplore.ieee.org/document/9339940),\" in IEEE Transactions on Network and Service Management, doi: 10.1109/TNSM.2021.3055425.\n\nLogClass automatically and accurately detects and classifies anomalous logs based on partial labels.\n\n### Table of Contents\n\n[LogClass](#logclass)\n\n- [Table of Contents](#table-of-contents)\n- [Requirements](#requirements)\n- [Quick Start](#quick-start)\n\t- [Run LogClass](#run-logclass)\n\t- [Arguments](#arguments)\n\t- [Directory Structure](#directory-structure)\n\t- [Datasets](#datasets)\n- [How to](#how-to)\n\t- [How to add a new dataset](#how-to-add-a-new-dataset)\n\t\t- [Preprocessed Logs Format](#preprocessed-logs-format)\n\t- [How to run a new experiment](#how-to-run-a-new-experiment)\n\t\t- [Custom experiment](#custom-experiment)\n\t- [How to add a new model](#how-to-add-a-new-model)\n\t- [How to extract a new feature](#how-to-extract-a-new-feature)\n- [Included Experiments](#included-experiments)\n\t- [Testing PULearning](#testing-pulearning)\n\t- [Testing Anomaly Classification](#testing-anomaly-classification)\n\t- [Global LogClass](#global-logclass)\n\t- [Binary training/inference](#binary-traininginference)\n- [Citing](#citing)\n\n​\t\t\n\n### Requirements\n\nRequirements are listed in `requirements.txt`. To install these, run:\n\n```\npip install -r requirements.txt\n```\n\n\n\n### Quick Start\n\n#### Run LogClass\n\nSeveral example experiments using LogClass are included in this repository. \n\nHere is an example to run one of them -  training of the global experiment doing anomaly detection and classification.  Run the following command in the home directory of this project: \n\n```\npython -m LogClass.logclass --train --kfold 3 --logs_type \"bgl\" --raw_logs \"./Data/RAS_LOGS\" --report macro\n```\n\n\n\n#### Arguments\n\n```\npython -m LogClass.logclass --help\nusage: logclass.py [-h] [--raw_logs raw_logs] [--base_dir base_dir]\n                   [--logs logs] [--models_dir models_dir]\n                   [--features_dir features_dir] [--logs_type logs_type]\n                   [--kfold kfold] [--healthy_label healthy_label]\n                   [--features features [features ...]]\n                   [--report report [report ...]]\n                   [--binary_classifier binary_classifier]\n                   [--multi_classifier multi_classifier] [--train] [--force]\n                   [--id id] [--swap]\n\nRuns binary classification with PULearning to detect anomalous logs.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --raw_logs raw_logs   input raw logs file path (default: None)\n  --base_dir base_dir   base output directory for pipeline output files\n                        (default: ['{your_logclass_dir}\\\\output'])\n  --logs logs           input logs file path and output for raw logs\n                        preprocessing (default: None)\n  --models_dir models_dir\n                        trained models input/output directory path (default:\n                        None)\n  --features_dir features_dir\n                        trained features_dir input/output directory path\n                        (default: None)\n  --logs_type logs_type\n                        Input type of logs. (default: ['open_Apache'])\n  --kfold kfold         kfold crossvalidation (default: None)\n  --healthy_label healthy_label\n                        the labels of unlabeled logs (default: ['unlabeled'])\n  --features features [features ...]\n                        Features to be extracted from the logs messages.\n                        (default: ['tfilf'])\n  --report report [report ...]\n                        Reports to be generated from the model and its\n                        predictions. (default: None)\n  --binary_classifier binary_classifier\n                        Binary classifier to be used as anomaly detector.\n                        (default: ['pu_learning'])\n  --multi_classifier multi_classifier\n                        Multi-clas classifier to classify anomalies. (default:\n                        ['svm'])\n  --train               If set, logclass will train on the given data.\n                        Otherwiseit will run inference on it. (default: False)\n  --force               Force training overwriting previous output with same\n                        id. (default: False)\n  --id id               Experiment id. Automatically generated if not\n                        specified. (default: None)\n  --swap                Swap testing/training data in kfold cross validation.\n                        (default: False)\n```\n\n\n\n#### Directory Structure\n\n```\n.\n├── data\n│   └── open_source_logs\t\t# Included open-source log datasets\n│       ├── Apache\n│       ├── bgl\n│       ├── hadoop\n│       ├── hdfs\n│       ├── hpc\n│       ├── proxifier\n│       └── zookeeper\n├── output\t\t\t\t# Example output folder\n│   ├── preprocessed_logs\t\t# Saved preprocessed logs for reuse\n│   │   ├── open_Apache.txt\n│   │   └── open_bgl.txt\n│   └── train_multi_open_bgl_2283696426\t# Example experiment output\n│       ├── best_params.json\n│       ├── features\n│       │   ├── tfidf.pkl\n│       │   └── vocab.pkl\n│       ├── models\n│       │   └── multi.pkl\n│       └── results.csv\n├── feature_engineering\n│   ├── __init__.py\n│   ├── length.py\n│   ├── tf_idf.py\n│   ├── tf_ilf.py\n│   ├── tf.py\n│   ├── registry.py\n│   ├── vectorizer.py\t\t\t# Log message vectorizing utilities\n│   └── utils.py\n├── models\n│   ├── __init__.py\n│   ├── base_model.py\t\t\t# BaseModel class extended by all models\n│   ├── pu_learning.py\n│   ├── regular.py\n│   ├── svm.py\n│   ├── binary_registry.py\n│   └── multi_registry.py\n├── preprocess\n│   ├── __init__.py\n│   ├── bgl_preprocessor.py\n│   ├── open_source_logs.py\n│   ├── registry.py\n│   └── utils.py\n├── reporting\n│   ├── __init__.py\n│   ├── accuracy.py\n│   ├── confusion_matrix.py\n│   ├── macrof1.py\n│   ├── microf1.py\n│   ├── multi_class_acc.py\n│   ├── top_k_svm.py\n│   ├── bb_registry.py\n│   └── wb_registry.py\n├── puLearning\t\t\t\t# PULearning third party implementation\n│   ├── __init__.py\n│   └── puAdapter.py\n├── __init__.py\n├── LICENSE\n├── README.md\n├── requirements.txt\n├── init_params.py\t\t\t# Parses arguments, initializes global parameters\n├── logclass.py\t\t\t\t# Performs training and inference of LogClass\n├── test_pu.py\t\t\t\t# Compares robustness of LogClass\n├── train_multi.py\t\t\t# Trains LogClass for anomalies classification\n├── train_binary.py\t\t\t# Trains LogClass for log anomaly detection\n├── run_binary.py\t\t\t# Loads trained LogClass and detects anomalies\n├── decorators.py\n└── utils.py\n```\n\n#### Datasets\n\nIn this repository we include various [open-source logs datasets](https://github.com/logpai/loghub) in the `data` folder as well as their corresponding preprocessing module in the `preprocess` package. Additionally there is  another preprocessor provided for [BGL logs data](https://www.usenix.org/cfdr-data#hpc4), which can be downloaded directly from [here](https://www.usenix.org/sites/default/files/4372-intrepid_ras_0901_0908_scrubbed.zip.tar).\n\n\n\n### How to\n\nExplain how to use and extend this toolkit.\n\n#### How to add a new dataset\n\nAdd a new preprocessor module in the `preprocess` package.\n\nThe module should implement a function that follows the `preprocess_datset(params)` function template included in all preprocessors. It should be decorated with `@register(f\"{dataset_name}\")` , e.g. open_Apache, and call the `process_logs(input_source, output, process_line)` function. This `process_line` function should also be defined in the processor as well. \n\nWhen done, add the module name to the `__init__.py`  list of modules from the `preprocess` package and also the name from the decorator in the argsparse parameters options as the logs type. For example,  `--logs_type open_Apache`.\n\n##### Preprocessed Logs Format\n\nThis format is ensured by the `process_line` function which is to be defined in each preprocessor.\n\n```python\ndef process_line(line):\n    \"\"\" \n    Processes a given line from the raw logs.\n\n    Parameter\n    ---------\n    line : str\n        One line from the raw logs.\n\n    Returns\n    -------\n    str\n        String with the format f\"{label} {msg}\" where the `label` indicates whether\n        the log is anomalous and if so, which anomaly category, and `msg` is the\n        filtered log message without parameters.\n\n    \"\"\"\n# your code\n```\n\nTo filter the log message parameters, use the `remove_parameters(msg)`function from the `utils.py` module in the `preprocess` package.\n\n#### How to run a new experiment\n\nSeveral experiments examples are included in the repository.  The best way to start with creating a new one is to follow the example from the others, specially the main function structure and its experiment function be it training or testing focused.\n\nThe key things to consider the experiment should include are the following:\n\n- **Args parsing**:  create custom `init_args()` and `parse_args(args)` functions for your experiment that call `init_main_args()` from the `init_params.py` module.\n\n- **Output file handling**: use `file_handling(params)` function (see `utils.py` in the main directory of the repo).\n\n- **Preprocessing raw logs**: if `--raw_logs` argument is provided, get the preprocessing function using the `--logs_type` argument from the `preprocess` module registry calling `get_preprocessor(f'{logs_type}')` function.\n\n- **Load logs**: call the `load_logs(params, ...)` function to get the preprocessed logs from the directory specified in the `--logs` parameter. It will return a tuple  of x, y, and target label names data.\n\n\n##### Custom experiment\n\nMain functions to consider for a custom experiment. Usually in its own function.\n\n**Feature Engineering**\n\n- `extract_features(x, params)` from `feature_engineering` package's `utils.py` module: Extracts all specified features in `--features` parameter from the preprocessed logs. See the function definition for further details.\n- `build_vocabulary(x)` from `feature_engineering` package's `vectorizer.py` module: Divides log into tokens and creates vocabulary. See the function definition for further details.\n- `log_to_vector(x, vocabulary)` from `feature_engineering` package's `vectorizer.py` module: Vectorizes each log message using a dict of words to index. See the function definition for further details.\n- `get_features_vector(x_vector, vocabulary, params)` from `feature_engineering` package's `utils.py` module: Extracts all specified features from the vectorized logs. See the function definition for further details.\n\n\n\n**Model training and inference**\n\nEach model extends the `BaseModel` class from module `base_model.py`.  See the class definition for further details.\n\nThere are two registries in the `models` package, one for binary models meant to be used for anomaly detection and another one for multi-classification models to classify the anomalies. Get the constructor for either using the `--binary_classifier` or `--multi_classifier` argument specified. E.g. `binary_classifier_registry.get_binary_model(params['binary_classifier'])`.\n\nBy extending `BaseModel` the model is always saved when it fits the data. Load a model by calling its `load()` method. It will use the `params` attribute of the `BaseModel` class to get the experiment id and load its corresponding model. \n\nTo save the params of an experiment call the `save_params(params)` function from the `utils.py` module in the main directory. `load_params(params)` in case of only using the module for inference. \n\n**Reporting**\n\nThere are two kinds of reports, black box and white box and a registry for each in the `reporting` module.\n\nTo use them, call the corresponding registry and obtain the report wrapper using `black_box_report_registry.get_bb_report('acc')`, for example. \n\nTo add new reports, see the analogous explanation for [models](#how-to-add-a-new-model) or [features](#how-to-extract-a-new-feature) below.\n\n**Saving results**\n\nAmong the provided experiments, `test_pu.py` and `train_multi.py` save their results creating a dict of column names to lists of results. Then the `save_results.py` function from the `utils.py` module is used to save them to a CSV file.\n\n\n\n#### How to add a new model\n\nTo add a new model, implement a class that extends the `BaseModel` class and include its module in the `models` package. See the class definition for further details. \n\nDecorate a method that calls its constructor and returns an instance of the model with the `@register(f\"{model_name}\")`decorator from either the `binary_registry.py` or the `multi_registry.py` modules from the `models` package depending on whether the model is for anomaly detection or classification respectively. \n\nFinally, make sure you add the module's name in the `__init__.py` module from the `models` package and the model option in the `init_params.py` module within the list for either `--binary_classifier` or `multi_classifier` arguments. This way the constructor can be obtained by doing `binary_classifier_registry.get_binary_model(params['binary_classifier'])`, for example.\n\n\n\n#### How to extract a new feature\n\nTo add a new feature extractor, create a module in the `feature_engineering` package that wraps your feature extractor function and returns the features. See `length.py` module as an example for further details.\n\nAs in the other cases, decorate the wrapper function with `@register(f\"{feature_name}\")` and make sure you add the module name in the `__init__.py` from the `feature_engineering` package and the feature as an option in the `init_params.py` module `--features` argument.\n\n\n\n### Included Experiments\n\nHigh level overview of each of the experiments included in the repository.  \n\n#### Testing PULearning\n\n`test_pu.py` is mainly focused on proving the robustness of LogClass for anomaly detection when just providing few labeled data as anomalous.\n\nIt would compare PULearning+RandomForest with any other given anomaly detection algorithm. Using the given data, it would start with having only healthy logs on the unlabeled data and gradually increase this up to 10%. To test PULearning, run the following command in the home directory of this project: \n\n```\npython -m LogClass.test_pu --logs_type \"bgl\" --raw_logs \"./Data/RAS from Weibin/RAS_raw_label.dat\" --binary_classifier regular --ratio 8 --step 1 --top_percentage 11 --kfold 3\n```\n\nThis would first preprocess the logs. Then, for each kfold iteration, it will perform feature extraction and force a 1:8 ratio of anomalous:healthy logs. Finally with a step of 1% it will go from 0% to 10% anomalous logs in the unlabeled set and compare the accuracy of both anomaly detection algorithms. If none specified it will default to a plain RF. \n\n#### Testing Anomaly Classification\n\n`train_multi.py` is focused on showing the robustness of LogClass' TF-ILF feature extraction approach for multi-class anomaly classification. The main detail is that when using `--kfold N`, one can swap training/testing data slices using the `--swap` flag. This way, for instance, it can train on 10% logs and test on the remaining 90%, when pairing `--swap` with n ==10. To run such an experiment, use the following command from the parent directory of the project:\n\n```\npython -m LogClass.train_multi --logs_type \"open_Apache\" --raw_logs \"./Data/open_source_logs/\" --kfold 10 --swap\n```\n\n#### Global LogClass\n\n`logclass.py` is set up so that it does both training or testing of the learned models depending on the flags. For example to train and preprocessing run the following command in the home directory of this project: :\n\n```\npython -m LogClass.logclass --train --kfold 3 --logs_type \"bgl\" --raw_logs \"./Data/RAS_LOGS\" \n```\n\nThis would first preprocess the raw BGL logs and extract their TF-ILF features, then train and save both PULearning with a RandomForest for anomaly detection and an SVM for multi-class anomaly classification. \n\nFor running inference simply run:\n\n```\npython -m LogClass.logclass --logs_type \n```\n\nIn this case it would load the learned feature extraction approach, both learned models and run inference on the whole logs.\n\n#### Binary training/inference\n\n`train_binary.py` and `run_binary.py` simply separate the binary part of `logclass.py` into two modules: one for training both feature extraction and the models, and another one for loading these and running inference.\n\n\n\n### Citing\n\nIf you find LogClass is useful for your research, please consider citing the paper:\n\n```\n@ARTICLE{9339940,  author={Meng, Weibin and Liu, Ying and Zhang, Shenglin and Zaiter, Federico and Zhang, Yuzhe and Huang, Yuheng and Yu, Zhaoyang and Zhang, Yuzhi and Song, Lei and Zhang, Ming and Pei, Dan},\njournal={IEEE Transactions on Network and Service Management},\ntitle={LogClass: Anomalous Log Identification and Classification with Partial Labels},\nyear={2021},\ndoi={10.1109/TNSM.2021.3055425}\n}\n```\n\nThis code was completed by [@Weibin Meng](https://github.com/WeibinMeng) and [@Federico Zaiter](https://github.com/federicozaiter).\n"
  },
  {
    "path": "__init__.py",
    "content": "__all__ = [\"utils\", \"logclass\"]\n\nfrom .preprocess import *\nfrom .feature_engineering import *\nfrom .models import *\nfrom .reporting import *\n"
  },
  {
    "path": "compare_pu.py",
    "content": "from sklearn.model_selection import StratifiedKFold\nfrom .utils import (\n    file_handling,\n    TestingParameters,\n    print_params,\n    save_results,\n)\nfrom .preprocess import registry as preprocess_registry\nfrom .preprocess.utils import load_logs\nfrom .feature_engineering.utils import (\n    binary_train_gtruth,\n    extract_features,\n)\nfrom .models import binary_registry as binary_classifier_registry\nfrom .reporting import bb_registry as black_box_report_registry\nfrom .init_params import init_main_args, parse_main_args\nimport numpy as np\n\n\ndef init_args():\n    \"\"\"Init command line args used for configuration.\"\"\"\n\n    parser = init_main_args()\n    parser.add_argument(\n        \"--ratio\",\n        metavar=\"ratio\",\n        type=int,\n        nargs=1,\n        default=[8],\n        help=\"ratio\",\n    )\n    parser.add_argument(\n        \"--top_percentage\",\n        metavar=\"top_percentage\",\n        type=int,\n        nargs=1,\n        default=[11],\n        help=\"top_percentage\",\n    )\n    parser.add_argument(\n        \"--step\",\n        metavar=\"step\",\n        type=int,\n        nargs=1,\n        default=[2],\n        help=\"step\",\n    )\n    return parser.parse_args()\n\n\ndef parse_args(args):\n    \"\"\"Parse provided args for runtime configuration.\"\"\"\n    params = parse_main_args(args)\n    additional_params = {\n                            \"ratio\": args.ratio[0],\n                            \"top_percentage\": args.top_percentage[0],\n                            \"step\": args.step[0],\n                            \"train\": True,\n                        }\n    params.update(additional_params)\n    return params\n\n\ndef force_ratio(params, x_data, y_data):\n    \"\"\"Force a ratio between anomalous and healthy logs\"\"\"\n    ratio = params['ratio']\n    if ratio > 0:\n        anomalous = np.where(y_data == 1.0)[0]\n        healthy = np.where(y_data == -1.0)[0]\n        if len(anomalous) * ratio <= len(healthy):\n            keep_anomalous = len(anomalous)\n            keep_healthy = keep_anomalous * ratio\n        else:\n            keep_anomalous = len(healthy) // ratio\n            keep_healthy = len(healthy)\n        np.random.seed(10)\n        permut = np.random.permutation(len(healthy))\n        keep = permut[:keep_healthy]\n        healthy = healthy[keep]\n        permut = np.random.permutation(len(anomalous))\n        keep = permut[:keep_anomalous]\n        anomalous = anomalous[keep]\n        result = sorted(np.concatenate((anomalous, healthy)))\n        y_data = y_data[result]\n        x_data = x_data[result]\n        return x_data, y_data\n\n\ndef init_results(params):\n    results = {\n        'exp_name': [],\n        'logs_type': [],\n        'percentage': [],\n        'pu_f1': [],\n        f\"{params['binary_classifier']}_f1\": [],\n    }\n    return results\n\n\ndef add_result(results, params, percentage, pu_acc, b_clf_acc):\n    results['exp_name'].append(params['id'])\n    results['logs_type'].append(params['logs_type'])\n    results['percentage'].append(percentage)\n    results['pu_f1'].append(pu_acc)\n    results[f\"{params['binary_classifier']}_f1\"].append(b_clf_acc)\n\n\ndef run_test(params, x_data, y_data):\n    results = init_results(params)\n    # Binary training features\n    y_data = binary_train_gtruth(y_data)\n    x_data, y_data = force_ratio(params, x_data, y_data)\n    print(\"total logs\", len(y_data))\n    print(len(np.where(y_data == -1.0)[0]), \" are unlabeled\")\n    print(len(np.where(y_data == 1.0)[0]), \" are anomalous\")\n    # KFold Cross Validation\n    kfold = StratifiedKFold(n_splits=params['kfold']).split(x_data, y_data)\n    for train_index, test_index in kfold:\n        x_train, x_test = x_data[train_index], x_data[test_index]\n        y_train, y_test = y_data[train_index], y_data[test_index]\n        x_train, _ = extract_features(x_train, params)\n        with TestingParameters(params):\n            x_test, _ = extract_features(x_test, params)\n        np.random.seed(5)\n        permut = np.random.permutation(len(y_train))\n        x_train = x_train[permut]\n        y_train = y_train[permut]\n        top_percentage = params['top_percentage']\n        step = params['step']\n        # Relabeling anomalous logs to unlabeled to test PU Learning Robustness\n        for i in range(0, top_percentage, step):\n            y_train_pu = np.copy(y_train)\n            if i > 0:\n                n_unlabeled = len(np.where(y_train_pu == -1.0)[0])\n                sacrifice_size = (i*n_unlabeled)//(100 - i)\n                print(i, n_unlabeled, sacrifice_size)\n                pos = np.where(y_train == 1.0)[0]\n                np.random.shuffle(pos)\n                sacrifice = pos[: sacrifice_size]\n                y_train_pu[sacrifice] = -1.0\n\n            print(f\"{i}% of anomalous log in unlabeled logs:\")\n            print(len(np.where(y_train_pu == -1.0)[0]), \" are unlabeled\")\n            print(len(np.where(y_train_pu == 1.0)[0]), \" are anomalous\")\n            # Binary PULearning with RF\n            pu_getter =\\\n                binary_classifier_registry.get_binary_model(\"pu_learning\")\n            binary_clf = pu_getter(params)\n            binary_clf.fit(x_train, y_train_pu)\n            y_pred_pu = binary_clf.predict(x_test)\n            get_accuracy = black_box_report_registry.get_bb_report('acc')\n            pu_acc = get_accuracy(y_test, y_pred_pu)\n            # Comparing given Binary Classifier with PU Learning\n            comparison_clf_getter =\\\n                binary_classifier_registry.get_binary_model(\n                    params['binary_classifier'])\n            binary_clf = comparison_clf_getter(params)\n            binary_clf.fit(x_train, y_train_pu)\n            y_pred = binary_clf.predict(x_test)\n            b_clf_acc = get_accuracy(y_test, y_pred)\n            print(f\"PU Acc: {pu_acc}\\n{params['binary_classifier']}\"\n                  + \" Acc: {b_clf_acc}\")\n\n            add_result(\n                results,\n                params,\n                i,\n                pu_acc,\n                b_clf_acc\n            )\n\n    save_results(results, params)\n\n\ndef main():\n    # Init params\n    params = parse_args(init_args())\n    print_params(params)\n    file_handling(params)\n    # Filter params from raw logs\n    if \"raw_logs\" in params:\n        preprocess = preprocess_registry.get_preprocessor(params['logs_type'])\n        preprocess(params)\n    # Load filtered params from file\n    print('Loading logs')\n    x_data, y_data, _ = load_logs(params)\n    run_test(params, x_data, y_data)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "decorators.py",
    "content": "import functools\n\n\n# Borrowed from https://realpython.com/primer-on-python-decorators/\ndef debug(func):\n    \"\"\"Print the function signature and return value\"\"\"\n    @functools.wraps(func)\n    def wrapper_debug(*args, **kwargs):\n        args_repr = [repr(a) for a in args]                      # 1\n        kwargs_repr = [f\"{k}={v!r}\" for k, v in kwargs.items()]  # 2\n        signature = \", \".join(args_repr + kwargs_repr)           # 3\n        print(f\"Calling {func.__name__}({signature})\")\n        value = func(*args, **kwargs)\n        print(f\"{func.__name__!r} returned {value!r}\")           # 4\n        return value\n    return wrapper_debug\n\n\ndef print_step(func):\n    \"\"\"Print the function signature and return value\"\"\"\n    @functools.wraps(func)\n    def wrapper_print_name(*args, **kwargs):\n        print(f\"Calling {func.__qualname__}\")\n        value = func(*args, **kwargs)\n        return value\n    return wrapper_print_name\n"
  },
  {
    "path": "feature_engineering/__init__.py",
    "content": "__all__ = [\"length\", \"tf_idf\", \"tf_ilf\", \"tf\"]"
  },
  {
    "path": "feature_engineering/length.py",
    "content": "from .registry import register\nimport numpy as np\n\n\n@register(\"length\")\ndef create_length_feature(params, input_vector, **kwargs):\n    \"\"\"\n        Returns an array of lengths of each tokenized log message from the input.\n\n        Parameters\n        ----------\n        params : dict of experiment parameters.\n        input_vector : numpy Array vector of word indexes from each log message line.\n\n        Returns\n        -------\n        numpy array of lengths of each tokenized log message from the input\n        with shape (number_of_logs, N).\n    \"\"\"\n    length = np.vectorize(len)\n    length_feature = length(input_vector)\n    length_feature = length_feature.reshape(-1, 1)\n    return length_feature\n"
  },
  {
    "path": "feature_engineering/registry.py",
    "content": "\"\"\"Basic registry for logs vector feature engineering. These take the\nlog messages as input and extract and return a feature to be appended to\nthe feature vector.\"\"\"\n\n_FEATURE_EXTRACTORS = dict()\n\n\ndef register(name):\n    \"\"\"Registers a new log message feature extraction function under the\n    given name.\"\"\"\n\n    def add_to_dict(func):\n        _FEATURE_EXTRACTORS[name] = func\n        return func\n\n    return add_to_dict\n\n\ndef get_feature_extractor(feature):\n    \"\"\"Fetches the feature extraction function associated with the given\n    raw logs\"\"\"\n    return _FEATURE_EXTRACTORS[feature]\n"
  },
  {
    "path": "feature_engineering/tf.py",
    "content": "from .registry import register\nfrom .vectorizer import get_tf\nimport numpy as np\nfrom .utils import save_feature_dict, load_feature_dict\n\n\ndef create_tf_vector(input_vector, tf_dict, vocabulary):\n    tf_vector = []\n    # Creating the idf/ilf vector for each log message\n    for line in input_vector:\n        cur_tf_vector = np.zeros(len(vocabulary))\n        for token_index in line:\n            cur_tf_vector[token_index] = len(tf_dict[token_index])\n        tf_vector.append(cur_tf_vector)\n\n    tf_vector = np.array(tf_vector)\n    return tf_vector\n\n\n@register(\"tf\")\ndef create_term_count_feature(params, input_vector, **kwargs):\n    \"\"\"\n        Returns an array of the counts of each word per log message.\n    \"\"\"\n    if params['train']:\n        tf_dict = get_tf(input_vector)\n        save_feature_dict(params, tf_dict, \"tf\")\n    else:\n        tf_dict = load_feature_dict(params, \"tf\")\n\n    tf_features =\\\n        create_tf_vector(input_vector, tf_dict, kwargs['vocabulary'])\n\n    return tf_features\n"
  },
  {
    "path": "feature_engineering/tf_idf.py",
    "content": "from .registry import register\nfrom .vectorizer import (\n    get_tf,\n    calculate_idf,\n    calculate_tf_invf_train,\n    create_invf_vector,\n)\nfrom .utils import save_feature_dict, load_feature_dict\n\n\n@register(\"tfidf\")\ndef create_tfidf_feature(params, train_vector, **kwargs):\n    \"\"\"\n        Returns the tf-idf matrix of features.\n    \"\"\"\n    if params['train']:\n        invf_dict = calculate_tf_invf_train(\n            train_vector,\n            get_f=get_tf,\n            calc_invf=calculate_idf\n            )\n        save_feature_dict(params, invf_dict, \"tfidf\")\n    else:\n        invf_dict = load_feature_dict(params, \"tfidf\")\n\n    features = create_invf_vector(\n        train_vector, invf_dict, kwargs['vocabulary'])\n    return features\n"
  },
  {
    "path": "feature_engineering/tf_ilf.py",
    "content": "from .registry import register\nfrom .vectorizer import (\n    get_lf,\n    calculate_ilf,\n    calculate_tf_invf_train,\n    create_invf_vector,\n)\nfrom .utils import save_feature_dict, load_feature_dict\n\n\n@register(\"tfilf\")\ndef create_tfilf_feature(params, train_vector, **kwargs):\n    \"\"\"\n        Returns the tf-ilf matrix of features.\n    \"\"\"\n    if params['train']:\n        invf_dict = calculate_tf_invf_train(\n            train_vector,\n            get_f=get_lf,\n            calc_invf=calculate_ilf\n            )\n        save_feature_dict(params, invf_dict, \"tfilf\")\n    else:\n        invf_dict = load_feature_dict(params, \"tfilf\")\n\n    features = create_invf_vector(\n        train_vector, invf_dict, kwargs['vocabulary'])\n    return features\n"
  },
  {
    "path": "feature_engineering/utils.py",
    "content": "import os\nimport pickle\nimport numpy as np\nfrom .vectorizer import log_to_vector, build_vocabulary\nfrom . import registry as feature_registry\nfrom ..decorators import print_step\n\n\ndef load_feature_dict(params, name):\n    dict_file = os.path.join(params['features_dir'], f\"{name}.pkl\")\n    with open(dict_file, \"rb\") as fp:\n        feat_dict = pickle.load(fp)\n    return feat_dict\n\n\ndef save_feature_dict(params, feat_dict, name):\n    dict_file = os.path.join(params['features_dir'], f\"{name}.pkl\")\n    with open(dict_file, \"wb\") as fp:\n        pickle.dump(feat_dict, fp)\n\n\ndef binary_train_gtruth(y):\n    return np.where(y == -1.0, -1.0, 1.0)\n\n\ndef multi_features(x, y):\n    anomalous = (y != -1)\n    x_multi, y_multi = x[anomalous], y[anomalous]\n    return x_multi, y_multi\n\n\n@print_step\ndef get_features_vector(log_vector, vocabulary, params):\n    \"\"\" Extracts all specified features from the vectorized logs.\n\n    For each feature specified in params it gets the feature function from the\n    feature registry and applies to the data.\n    A numpy array vector of shape (number_of_logs, N) is expected for each to\n    be concatenated along the second axis.\n\n    Parameters\n    ----------\n    log_vector : numpy Array vector of word indexes from each log message line.\n    vocabulary : dict mapping a word to an index.\n    params : dict of experiment parameters.\n\n    Returns\n    -------\n    x_features : numpy ndArray of all specified features.\n\n    \"\"\"\n    feature_vectors = []\n    for feature in params['features']:\n        extract_feature = feature_registry.get_feature_extractor(feature)\n        feature_vector = extract_feature(\n            params, log_vector, vocabulary=vocabulary)\n        feature_vectors.append(feature_vector)\n    X = np.hstack(feature_vectors)\n    return X\n\n\n@print_step\ndef extract_features(x, params):\n    \"\"\" Gets vocabulary and specified features from the preprocessed logs.\n\n    Creates a vocabulary from the preprocessed logs to vectorize each message.\n    Extracts all specified features in params from the logs vector and\n    vocabulary, then returns them both.\n\n    Parameters\n    ----------\n    x : list of preprocessed logs. One log message per line.\n    params : dict of experiment parameters.\n\n    Returns\n    -------\n    x_features : numpy ndArray of all specified features.\n    vocabulary : dict mapping a word to an index.\n\n    \"\"\"\n    # Build Vocabulary\n    if params['train']:\n        vocabulary = build_vocabulary(x)\n        save_feature_dict(params, vocabulary, \"vocab\")\n    else:\n        vocabulary = load_feature_dict(params, \"vocab\")\n    # Feature Engineering\n    x_vector = log_to_vector(x, vocabulary)\n    x_features = get_features_vector(x_vector, vocabulary, params)\n    return x_features, vocabulary\n"
  },
  {
    "path": "feature_engineering/vectorizer.py",
    "content": "import numpy as np\nfrom ..decorators import print_step\nfrom collections import defaultdict, Counter\n\n\ndef get_ngrams(n, line):\n    line = line.strip().split()\n    cur_len = len(line)\n    ngrams_list = []\n    if cur_len == 0:\n        # Token list is empty\n        pass\n    elif cur_len < n:\n        # Token list fits in one ngram\n        ngrams_list.append(\" \".join(line))\n    else:\n        # Token list spans multiple ngrams\n        loop_num = cur_len - n + 1\n        for i in range(loop_num):\n            cur_gram = \" \".join(line[i: i + n])\n            ngrams_list.append(cur_gram)\n    return ngrams_list\n\n\ndef tokenize(line):\n    return line.strip().split()\n\n\n@print_step\ndef build_vocabulary(inputData):\n    \"\"\" Divides log into tokens and creates vocabulary.\n\n    Parameter\n    ---------\n    inputData: list of log message lines\n\n    Returns\n    -------\n    vocabulary : word to index dict\n\n    \"\"\"\n    vocabulary = {}\n    for line in inputData:\n        token_list = tokenize(line)\n        for token in token_list:\n            if token not in vocabulary:\n                vocabulary[token] = len(vocabulary)\n    return vocabulary\n\n\n@print_step\ndef log_to_vector(inputData, vocabulary):\n    \"\"\" Vectorizes each log message using a dict of words to index.\n\n    Parameter\n    ---------\n    inputData: list of log message lines.\n    vocabulary : word to index dict.\n\n    Returns\n    -------\n    numpy Array vector of word indexes from each log message line.\n\n    \"\"\"\n    result = []\n    for line in inputData:\n        temp = []\n        token_list = tokenize(line)\n        if token_list:\n            for token in token_list:\n                if token not in vocabulary:\n                    continue\n                else:\n                    temp.append(vocabulary[token])\n        result.append(temp)\n    return np.array(result)\n\n\ndef setTrainDataForILF(x, y):\n    x_res, indices = np.unique(x, return_index=True)\n    y_res = y[indices]\n    return x_res, y_res\n\n\ndef calculate_inv_freq(total, num):\n    return np.log(float(total) / float(num + 0.01))\n\n\ndef get_max_line(inputVector):\n    return len(max(inputVector, key=len))\n\n\ndef get_tf(inputVector):\n    token_index_dict = defaultdict(set)\n    # Counting the number of logs the word appears in\n    for index, line in enumerate(inputVector):\n        for token in line:\n            token_index_dict[token].add(index)\n    return token_index_dict\n\n\ndef get_lf(inputVector):\n    token_index_ilf_dict = defaultdict(set)\n    for line in inputVector:\n        for location, token in enumerate(line):\n            token_index_ilf_dict[token].add(location)\n    return token_index_ilf_dict\n\n\ndef calculate_idf(token_index_dict, inputVector):\n    idf_dict = {}\n    total_log_num = len(inputVector)\n    for token in token_index_dict:\n        idf_dict[token] = calculate_inv_freq(total_log_num,\n                                             len(token_index_dict[token]))\n    return idf_dict\n\n\ndef calculate_ilf(token_index_dict, inputVector):\n    ilf_dict = {}\n    max_length = get_max_line(inputVector)\n    # calculating ilf for each token\n    for token in token_index_dict:\n        ilf_dict[token] = calculate_inv_freq(max_length,\n                                             len(token_index_dict[token]))\n    return ilf_dict\n\n\ndef create_invf_vector(inputVector, invf_dict, vocabulary):\n    tfinvf = []\n    # Creating the idf/ilf vector for each log message\n    for line in inputVector:\n        cur_tfinvf = np.zeros(len(vocabulary))\n        count_dict = Counter(line)\n        for token_index in line:\n            cur_tfinvf[token_index] = (\n                float(count_dict[token_index]) * invf_dict[token_index]\n            )\n        tfinvf.append(cur_tfinvf)\n    tfinvf = np.array(tfinvf)\n    return tfinvf\n\n\ndef normalize_tfinvf(tfinvf):\n    return 2.*(tfinvf - np.min(tfinvf))/np.ptp(tfinvf)-1\n\n\ndef calculate_tf_invf_train(\n    inputVector, get_f=get_tf, calc_invf=calculate_idf\n):\n    token_index_dict = get_f(inputVector)\n    invf_dict = calc_invf(token_index_dict, inputVector)\n    return invf_dict\n"
  },
  {
    "path": "init_params.py",
    "content": "import argparse\nimport os\nimport sys\nimport warnings\nfrom uuid import uuid4\n\n\ndef init_main_args():\n    \"\"\"Init command line args used for configuration.\"\"\"\n\n    parser = argparse.ArgumentParser(\n        description=\"Runs experiment using LogClass Framework\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--raw_logs\",\n        metavar=\"raw_logs\",\n        type=str,\n        nargs=1,\n        help=\"input raw logs file path\",\n    )\n    base_dir_default = os.path.join(\n        os.path.dirname(os.path.realpath(__file__)), \"output\"\n    )\n    parser.add_argument(\n        \"--base_dir\",\n        metavar=\"base_dir\",\n        type=str,\n        nargs=1,\n        default=[base_dir_default],\n        help=\"base output directory for pipeline output files\",\n    )\n    parser.add_argument(\n        \"--logs\",\n        metavar=\"logs\",\n        type=str,\n        nargs=1,\n        help=\"input logs file path and output for raw logs preprocessing\",\n    )\n    parser.add_argument(\n        \"--models_dir\",\n        metavar=\"models_dir\",\n        type=str,\n        nargs=1,\n        help=\"trained models input/output directory path\",\n    )\n    parser.add_argument(\n        \"--features_dir\",\n        metavar=\"features_dir\",\n        type=str,\n        nargs=1,\n        help=\"trained features_dir input/output directory path\",\n    )\n    parser.add_argument(\n        \"--logs_type\",\n        metavar=\"logs_type\",\n        type=str,\n        nargs=1,\n        default=[\"open_Apache\"],\n        choices=[\n            \"bgl\",\n            \"open_Apache\",\n            \"open_bgl\",\n            \"open_hadoop\",\n            \"open_hdfs\",\n            \"open_hpc\",\n            \"open_proxifier\",\n            \"open_zookeeper\",\n            ],\n        help=\"Input type of logs.\",\n    )\n    parser.add_argument(\n        \"--kfold\",\n        metavar=\"kfold\",\n        type=int,\n        nargs=1,\n        help=\"kfold crossvalidation\",\n    )\n    parser.add_argument(\n        \"--healthy_label\",\n        metavar='healthy_label',\n        type=str,\n        nargs=1,\n        default=[\"unlabeled\"],\n        help=\"the labels of unlabeled logs\",\n    )\n    parser.add_argument(\n        \"--features\",\n        metavar=\"features\",\n        type=str,\n        nargs='+',\n        default=[\"tfilf\"],\n        choices=[\"tfidf\", \"tfilf\", \"length\", \"tf\"],\n        help=\"Features to be extracted from the logs messages.\",\n    )\n    parser.add_argument(\n        \"--report\",\n        metavar=\"report\",\n        type=str,\n        nargs='+',\n        default=[\"confusion_matrix\"],\n        choices=[\"confusion_matrix\",\n                 \"acc\",\n                 \"multi_acc\",\n                 \"top_k_svm\",\n                 \"micro\",\n                 \"macro\"\n                 ],\n        help=\"Reports to be generated from the model and its predictions.\",\n    )\n    parser.add_argument(\n        \"--binary_classifier\",\n        metavar=\"binary_classifier\",\n        type=str,\n        nargs=1,\n        default=[\"pu_learning\"],\n        choices=[\"pu_learning\", \"regular\"],\n        help=\"Binary classifier to be used as anomaly detector.\",\n    )\n    parser.add_argument(\n        \"--multi_classifier\",\n        metavar=\"multi_classifier\",\n        type=str,\n        nargs=1,\n        default=[\"svm\"],\n        choices=[\"svm\"],\n        help=\"Multi-clas classifier to classify anomalies.\",\n    )\n    parser.add_argument(\n        \"--train\",\n        action=\"store_true\",\n        default=False,\n        help=\"If set, logclass will train on the given data. Otherwise\"\n             + \"it will run inference on it.\",\n    )\n    parser.add_argument(\n        \"--force\",\n        action=\"store_true\",\n        default=False,\n        help=\"Force training overwriting previous output with same id.\",\n    )\n    parser.add_argument(\n        \"--id\",\n        metavar=\"id\",\n        type=str,\n        nargs=1,\n        help=\"Experiment id. Automatically generated if not specified.\",\n    )\n    parser.add_argument(\n        \"--swap\",\n        action=\"store_true\",\n        default=False,\n        help=\"Swap testing/training data in kfold cross validation.\",\n    )\n\n    return parser\n\n\ndef parse_main_args(args):\n    \"\"\"Parse provided args for runtime configuration.\"\"\"\n    params = {\n        \"report\": args.report,\n        \"train\": args.train,\n        \"force\": args.force,\n        \"base_dir\": args.base_dir[0],\n        \"logs_type\": args.logs_type[0],\n        \"healthy_label\": args.healthy_label[0],\n        \"features\": args.features,\n        \"binary_classifier\": args.binary_classifier[0],\n        \"multi_classifier\": args.multi_classifier[0],\n        \"swap\": args.swap,\n    }\n    if args.raw_logs:\n        params[\"raw_logs\"] = os.path.normpath(args.raw_logs[0])\n    if args.kfold:\n        params[\"kfold\"] = args.kfold[0]\n    if args.logs:\n        params['logs'] = os.path.normpath(args.logs[0])\n    else:\n        params['logs'] = os.path.join(\n            params['base_dir'],\n            \"preprocessed_logs\",\n            f\"{params['logs_type']}.txt\"\n        )\n    if args.id:\n        params['id'] = args.id[0]\n    else:\n        if not params[\"train\"]:\n            warnings.warn(\n                \"--id parameter is not set when running inference.\"\n                \"If --train is not set, you might want to provide the\"\n                \"experiment id of your best training experiment run,\"\n                \" E.g. `--id 2310136305`\"\n                )\n        params['id'] = str(uuid4().time_low)\n\n    print(f\"\\nExperiment ID: {params['id']}\")\n    # Creating experiments results folder with the format\n    # {experiment_module_name}_{logs_type}_{id}\n    experiment_name = os.path.basename(sys.argv[0]).split('.')[0]\n    params['id_dir'] = os.path.join(\n            params['base_dir'],\n            '_'.join((\n                experiment_name, params['logs_type'], params['id']\n                ))\n        )\n    if args.models_dir:\n        params['models_dir'] = os.path.normpath(args.models_dir[0])\n    else:\n        params['models_dir'] = os.path.join(\n            params['id_dir'],\n            \"models\",\n        )\n    if args.features_dir:\n        params['features_dir'] = os.path.normpath(args.features_dir[0])\n    else:\n        params['features_dir'] = os.path.join(\n            params['id_dir'],\n            \"features\",\n        )\n    params['results_dir'] = os.path.join(params['id_dir'], \"results\")\n\n    return params\n"
  },
  {
    "path": "logclass.py",
    "content": "from sklearn.model_selection import StratifiedKFold\nfrom .utils import (\n    save_params,\n    load_params,\n    file_handling,\n    TestingParameters,\n    print_params,\n)\nfrom .preprocess import registry as preprocess_registry\nfrom .preprocess.utils import load_logs\nfrom .feature_engineering.utils import (\n    binary_train_gtruth,\n    multi_features,\n    extract_features,\n)\nfrom tqdm import tqdm\nfrom .models import binary_registry as binary_classifier_registry\nfrom .models import multi_registry as multi_classifier_registry\nfrom .reporting import bb_registry as black_box_report_registry\nfrom .reporting import wb_registry as white_box_report_registry\nfrom .init_params import init_main_args, parse_main_args\n\n\ndef init_args():\n    \"\"\"Init command line args used for configuration.\"\"\"\n\n    parser = init_main_args()\n    return parser.parse_args()\n\n\ndef parse_args(args):\n    \"\"\"Parse provided args for runtime configuration.\"\"\"\n    params = parse_main_args(args)\n    return params\n\n\ndef inference(params, x_data, y_data, target_names):\n    # Inference\n    # Feature engineering\n    x_test, vocabulary = extract_features(x_data, params)\n    # Binary training features\n    y_test = binary_train_gtruth(y_data)\n    # Binary PU estimator with RF\n    # Load Trained PU Estimator\n    binary_clf_getter =\\\n        binary_classifier_registry.get_binary_model(\n            params['binary_classifier'])\n    binary_clf = binary_clf_getter(params)\n    binary_clf.load()\n    # Anomaly detection\n    y_pred_pu = binary_clf.predict(x_test)\n    get_accuracy = black_box_report_registry.get_bb_report('acc')\n    binary_acc = get_accuracy(y_test, y_pred_pu)\n    # MultiClass remove healthy logs\n    x_infer_multi, y_infer_multi = multi_features(x_test, y_data)\n    # Load MultiClass\n    multi_classifier_getter =\\\n        multi_classifier_registry.get_multi_model(params['multi_classifier'])\n    multi_classifier = multi_classifier_getter(params)\n    multi_classifier.load()\n    # Anomaly Classification\n    pred = multi_classifier.predict(x_infer_multi)\n    get_multi_acc = black_box_report_registry.get_bb_report('multi_acc')\n    score = get_multi_acc(y_infer_multi, pred)\n\n    print(binary_acc, score)\n    for report in params['report']:\n        try:\n            get_bb_report = black_box_report_registry.get_bb_report(report)\n            result = get_bb_report(y_test, y_pred_pu)\n        except Exception:\n            pass\n        else:\n            print(f'Binary classification {report} report:')\n            print(result)\n\n        try:\n            get_bb_report = black_box_report_registry.get_bb_report(report)\n            result = get_bb_report(y_infer_multi, pred)\n        except Exception:\n            pass\n        else:\n            print(f'Multi classification {report} report:')\n            print(result)\n\n        try:\n            get_wb_report = white_box_report_registry.get_wb_report(report)\n            result =\\\n                get_wb_report(params, binary_clf.model, vocabulary,\n                              target_names=target_names, top_features=5)\n        except Exception:\n            pass\n        else:\n            print(f'Multi classification {report} report:')\n            print(result)\n\n\ndef train(params, x_data, y_data, target_names):\n    # KFold Cross Validation\n    kfold = StratifiedKFold(n_splits=params['kfold']).split(x_data, y_data)\n    best_pu_fs = 0.\n    best_multi = 0.\n    for train_index, test_index in tqdm(kfold):\n        x_train, x_test = x_data[train_index], x_data[test_index]\n        y_train, y_test = y_data[train_index], y_data[test_index]\n        x_train, vocabulary = extract_features(x_train, params)\n        with TestingParameters(params):\n            x_test, _ = extract_features(x_test, params)\n        # Binary training features\n        y_test_pu = binary_train_gtruth(y_test)\n        y_train_pu = binary_train_gtruth(y_train)\n        # Binary PULearning with RF\n        binary_clf_getter =\\\n            binary_classifier_registry.get_binary_model(\n                params['binary_classifier'])\n        binary_clf = binary_clf_getter(params)\n        binary_clf.fit(x_train, y_train_pu)\n        y_pred_pu = binary_clf.predict(x_test)\n        get_accuracy = black_box_report_registry.get_bb_report('acc')\n        binary_acc = get_accuracy(y_test_pu, y_pred_pu)\n        # Multi-class training features\n        x_train_multi, y_train_multi =\\\n            multi_features(x_train, y_train)\n        x_test_multi, y_test_multi = multi_features(x_test, y_test)\n        # MultiClass\n        multi_classifier_getter =\\\n            multi_classifier_registry.get_multi_model(params['multi_classifier'])\n        multi_classifier = multi_classifier_getter(params)\n        multi_classifier.fit(x_train_multi, y_train_multi)\n        pred = multi_classifier.predict(x_test_multi)\n        get_multi_acc = black_box_report_registry.get_bb_report('multi_acc')\n        score = get_multi_acc(y_test_multi, pred)\n        better_results = (\n            binary_acc > best_pu_fs\n            or (binary_acc == best_pu_fs and score > best_multi)\n        )\n\n        if better_results:\n            if binary_acc > best_pu_fs:\n                best_pu_fs = binary_acc\n            save_params(params)\n            if score > best_multi:\n                best_multi = score\n            print(binary_acc, score)\n\n        # TryCatch are necessary since I'm trying to consider all \n        # reports the same when they are not \n        for report in params['report']:\n            try:\n                get_bb_report = black_box_report_registry.get_bb_report(report)\n                result = get_bb_report(y_test_pu, y_pred_pu)\n            except Exception:\n                pass\n            else:\n                print(f'Binary classification {report} report:')\n                print(result)\n\n            try:\n                get_bb_report = black_box_report_registry.get_bb_report(report)\n                result = get_bb_report(y_test_multi, pred)\n            except Exception:\n                pass\n            else:\n                print(f'Multi classification {report} report:')\n                print(result)\n\n            try:\n                get_wb_report = white_box_report_registry.get_wb_report(report)\n                result =\\\n                    get_wb_report(params, multi_classifier.model, vocabulary,\n                                  target_names=target_names, top_features=5)\n            except Exception:\n                pass\n            else:\n                print(f'Multi classification {report} report:')\n                print(result)\n\n\ndef main():\n    # Init params\n    params = parse_args(init_args())\n    if not params['train']:\n        load_params(params)\n    print_params(params)\n    file_handling(params)  # TODO: handle the case when the experiment ID already exists - this I think is the only one that matters\n    # Filter params from raw logs\n    if 'raw_logs' in params:\n        preprocess = preprocess_registry.get_preprocessor(params['logs_type'])\n        preprocess(params)\n    # Load filtered params from file\n    x_data, y_data, target_names = load_logs(params)\n    if params['train']:\n        train(params, x_data, y_data, target_names)\n    else:\n        inference(params, x_data, y_data, target_names)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "models/__init__.py",
    "content": "__all__ = [\"regular\", \"svm\", \"pu_learning\"]"
  },
  {
    "path": "models/base_model.py",
    "content": "from abc import ABC, abstractmethod\nfrom time import time\nfrom ..decorators import print_step\n\n\nclass BaseModel(ABC):\n    \"\"\" Abstract class used to wrap models and add further functionality.\n\n        Attributes\n        ----------\n        model : model that implements fit and predict functions as sklearn\n        ML models do.\n        params : dict of experiment parameters.\n        name : str of the original model class name.\n        train_time : time it took to run fit in seconds.\n        run_time : time it took to run predict in seconds.\n\n        Methods\n        -------\n        save(self, **kwargs)\n            Abstract method for the subclass to implement how the model is\n            saved. Should use the experiment id as reference.\n        load(self, **kwargs)\n            Abstract method for the subclass to implement how it's meant to be\n            loaded. Should correspond to how the save method saves the model.\n        predict(self, X, **kwargs)\n            Wraps original model predict and times its running time.\n        fit(self, X, Y, **kwargs)\n            Wraps original model fit, times fit running time and saves the model.\n\n    \"\"\"\n    def __init__(self, model, params):\n        self.model = model\n        self.params = params\n        self.name = type(model).__name__\n        self.train_time = None\n        self.run_time = None\n\n    @abstractmethod\n    def save(self, **kwargs):\n        \"\"\"\n            Abstract method for the subclass to implement how the model is\n            saved. Should use the experiment id as reference.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def load(self, **kwargs):\n        \"\"\"\n            Abstract method for the subclass to implement how it's meant to be\n            loaded. Should correspond to how the save method saves the model.\n        \"\"\"\n        pass\n\n    @print_step\n    def predict(self, X, **kwargs):\n        \"\"\"\n            Wraps original model predict and times its running time.\n        \"\"\"\n        t0 = time()\n        pred = self.model.predict(X, **kwargs)\n        t1 = time()\n        lapse = t1 - t0\n        self.run_time = lapse\n        print(f\"{self.name} took {lapse}s to run inference.\")\n        return pred\n\n    @print_step\n    def fit(self, X, Y, **kwargs):\n        \"\"\"\n            Wraps original model fit, times fit running time and saves the model.\n        \"\"\"\n        t0 = time()\n        self.model.fit(X, Y, **kwargs)\n        t1 = time()\n        lapse = t1 - t0\n        self.train_time = lapse\n        print(f\"{self.name} took {lapse}s to train.\")\n        self.save()\n"
  },
  {
    "path": "models/binary_registry.py",
    "content": "\"\"\"Registry for binary models to be used for anomaly detection.\"\"\"\n\n_BINARY_MODELS = dict()\n\n\ndef register(name):\n    \"\"\"Registers a new binary classification anomaly detection model.\"\"\"\n\n    def add_to_dict(func):\n        _BINARY_MODELS[name] = func\n        return func\n\n    return add_to_dict\n\n\ndef get_binary_model(model):\n    \"\"\"Fetches the binary classification anomaly detection model.\"\"\"\n    return _BINARY_MODELS[model]\n"
  },
  {
    "path": "models/multi_registry.py",
    "content": "\"\"\"Registry for multi-class models to be used for anomaly classification.\"\"\"\n\n_MULTI_MODELS = dict()\n\n\ndef register(name):\n    \"\"\"Registers a new multi-class anomaly classification model.\"\"\"\n\n    def add_to_dict(func):\n        _MULTI_MODELS[name] = func\n        return func\n\n    return add_to_dict\n\n\ndef get_multi_model(model):\n    \"\"\"Fetches the multi-class anomaly classification model.\"\"\"\n    return _MULTI_MODELS[model]\n"
  },
  {
    "path": "models/pu_learning.py",
    "content": "from .binary_registry import register\nfrom ..puLearning.puAdapter import PUAdapter\nfrom sklearn.ensemble import RandomForestClassifier\nfrom .base_model import BaseModel\nimport os\nimport pickle\n\n\nclass PUAdapterWrapper(BaseModel):\n    def __init__(self, model, params):\n        super().__init__(model, params)\n\n    def save(self, **kwargs):\n        pu_estimator_file = os.path.join(\n            self.params['models_dir'],\n            \"pu_estimator.pkl\"\n            )\n        pu_saver = {'estimator': self.model.estimator,\n                    'c': self.model.c}\n        with open(pu_estimator_file, 'wb') as pu_estimator_file:\n            pickle.dump(pu_saver, pu_estimator_file)\n\n    def load(self, **kwargs):\n        pu_estimator_file = os.path.join(\n            self.params['models_dir'],\n            \"pu_estimator.pkl\"\n            )\n        with open(pu_estimator_file, 'rb') as pu_estimator_file:\n            pu_saver = pickle.load(pu_estimator_file)\n            estimator = pu_saver['estimator']\n            pu_estimator = PUAdapter(estimator)\n            pu_estimator.c = pu_saver['c']\n            pu_estimator.estimator_fitted = True\n            self.model = pu_estimator\n\n\n@register(\"pu_learning\")\ndef instatiate_pu_adapter(params, **kwargs):\n    \"\"\"\n        Returns a RF adapted to do PU Learning wrapped by the PUAdapterWrapper.\n    \"\"\"\n    hparms = {\n        'n_estimators': 10,\n        'criterion': \"entropy\",\n        'bootstrap': True,\n        'n_jobs': -1,\n    }\n    hparms.update(kwargs)\n    estimator = RandomForestClassifier(**hparms)\n    wrapped_pu_estimator = PUAdapterWrapper(PUAdapter(estimator), params)\n    return wrapped_pu_estimator\n"
  },
  {
    "path": "models/regular.py",
    "content": "from .binary_registry import register\nfrom sklearn.ensemble import RandomForestClassifier\nfrom .base_model import BaseModel\nimport os\nimport pickle\n\n\nclass RegularClassifierWrapper(BaseModel):\n    def __init__(self, model, params):\n        super().__init__(model, params)\n\n    def save(self, **kwargs):\n        regular_file = os.path.join(\n            self.params['models_dir'],\n            \"regular.pkl\"\n            )\n        with open(regular_file, 'wb') as regular_clf_file:\n            pickle.dump(self.model, regular_clf_file)\n\n    def load(self, **kwargs):\n        regular_file = os.path.join(\n            self.params['models_dir'],\n            \"regular.pkl\"\n            )\n        with open(regular_file, 'rb') as regular_clf_file:\n            regular_classifier = pickle.load(regular_clf_file)\n            self.model = regular_classifier\n\n\n@register(\"regular\")\ndef instatiate_regular_classifier(params, **kwargs):\n    \"\"\"\n        Returns a RF wrapped by the PU Learning Adapter.\n    \"\"\"\n    hparms = {\n        'n_estimators': 10,\n        'bootstrap': True,\n        'n_jobs': -1,\n    }\n    hparms.update(kwargs)\n    wrapped_regular = RegularClassifierWrapper(\n        RandomForestClassifier(**hparms), params)\n    return wrapped_regular\n"
  },
  {
    "path": "models/svm.py",
    "content": "from .multi_registry import register\nfrom sklearn.svm import LinearSVC\nfrom .base_model import BaseModel\nimport os\nimport pickle\n\n\nclass SVMWrapper(BaseModel):\n    def __init__(self, model, params):\n        super().__init__(model, params)\n\n    def save(self, **kwargs):\n        multi_file = os.path.join(\n            self.params['models_dir'],\n            \"multi.pkl\"\n            )\n        with open(multi_file, 'wb') as multi_clf_file:\n            pickle.dump(self.model, multi_clf_file)\n\n    def load(self, **kwargs):\n        multi_file = os.path.join(\n            self.params['models_dir'],\n            \"multi.pkl\"\n            )\n        with open(multi_file, 'rb') as multi_clf_file:\n            multi_classifier = pickle.load(multi_clf_file)\n            self.model = multi_classifier\n\n\n@register(\"svm\")\ndef instatiate_svm(params, **kwargs):\n    \"\"\"\n        Returns a RF wrapped by the PU Learning Adapter.\n    \"\"\"\n    hparms = {\n        'penalty': \"l2\",\n        'dual': False,\n        'tol': 1e-1,\n    }\n    hparms.update(kwargs)\n    wrapped_svm = SVMWrapper(LinearSVC(**hparms), params)\n    return wrapped_svm\n"
  },
  {
    "path": "preprocess/__init__.py",
    "content": "__all__ = [\n    \"bgl_preprocessor\",\n    \"open_source_logs\",\n]\n\n"
  },
  {
    "path": "preprocess/bgl_preprocessor.py",
    "content": "from .registry import register\nfrom .utils import process_logs, remove_parameters\nimport re\n\n\nrecid_regx = re.compile(r\"^(\\d+)\")\nseparator = re.compile(r\"(?:-.{1,3}){2} (.+)$\")\nmsg_split_regx = re.compile(r\"x'.+'\")\nseverity = re.compile(r\"(\\w+)\\s+(INFO|WARN|ERROR|FATAL)\")\n\n\ndef process_line(line):\n    line = line.strip()\n    sep = separator.search(line)\n    if sep:\n        msg = sep.group(1).strip().split('   ')[-1].strip()\n        msg = msg_split_regx.split(msg)[-1].strip()\n        error_label = severity.search(line)\n        recid = recid_regx.search(line)\n        if recid and error_label and len(msg) > 20:\n            # recid = recid.group(1).strip() We may want to use it later\n            general_label = error_label.group(2)\n            label = error_label.group(1)\n            if general_label == 'WARN':\n                return ''\n            if general_label == 'INFO':  # or label == 'WARN':\n                label = 'unlabeled'\n            msg = remove_parameters(msg)\n            if msg:\n                msg = ' '.join((label, msg))\n                msg = ''.join((msg, '\\n'))\n                return msg\n    return ''\n\n\n@register(\"bgl\")\ndef preprocess_dataset(params):\n    \"\"\"\n    Runs BGL logs preprocessing executor.\n    \"\"\"\n    input_source = params['raw_logs']\n    output = params['logs']\n    params['healthy_label'] = 'unlabeled'\n    process_logs(input_source, output, process_line)\n"
  },
  {
    "path": "preprocess/open_source_logs.py",
    "content": "import os\nfrom multiprocessing import Pool\n\nfrom tqdm import tqdm\n\nfrom .registry import register\nfrom .utils import remove_parameters\n\n\ndef process_line(line):\n    label = line[0].strip()\n    msg = \" \".join(line[1].strip().split()[1:])\n    msg = remove_parameters(msg)\n    if msg:\n        msg = \" \".join((label, msg))\n        msg = \"\".join((msg, \"\\n\"))\n        return msg\n    return \"\"\n\n\ndef process_open_source(input_source, output):\n    with open(output, \"w\", encoding=\"latin-1\") as f:\n        gtruth = os.path.join(input_source, \"groundtruth.seq\")\n        rawlog = os.path.join(input_source, \"rawlog.log\")\n        with open(gtruth, \"r\", encoding=\"latin-1\") as IN:\n            line_count = sum(1 for line in IN)\n        with open(gtruth, \"r\", encoding=\"latin-1\") as in_gtruth:\n            with open(rawlog, \"r\", encoding=\"latin-1\") as in_log:\n                IN = zip(in_gtruth, in_log)\n                with Pool() as pool:\n                    results = pool.imap(process_line, IN, chunksize=10000)\n                    f.writelines(tqdm(results, total=line_count))\n\n\nopen_source_datasets = [\n    \"open_Apache\",\n    \"open_bgl\",\n    \"open_hadoop\",\n    \"open_hdfs\",\n    \"open_hpc\",\n    \"open_proxifier\",\n    \"open_zookeeper\",\n]\nfor dataset in open_source_datasets:\n\n    @register(dataset)\n    def preprocess_dataset(params):\n        \"\"\"\n        Runs open source logs preprocessing executor.\n        \"\"\"\n        input_source = params[\"raw_logs\"]\n        output = params[\"logs\"]\n        params[\"healthy_label\"] = \"NA\"\n        process_open_source(input_source, output)\n"
  },
  {
    "path": "preprocess/registry.py",
    "content": "\"\"\"Basic registry for logs preprocessors. These read the rawlog file and\noutputs filtered logs removing parameter words or tokens with non-letter\ncharacters keeping only text words.\"\"\"\n\n_PREPROCESSORS = dict()\n\n\ndef register(name):\n    \"\"\"Registers a new logs preprocessor function under the given name.\"\"\"\n\n    def add_to_dict(func):\n        _PREPROCESSORS[name] = func\n        return func\n\n    return add_to_dict\n\n\ndef get_preprocessor(data_src):\n    \"\"\"Fetches the logs preprocessor function associated with the given raw logs\"\"\"\n    return _PREPROCESSORS[data_src]\n"
  },
  {
    "path": "preprocess/utils.py",
    "content": "import re\nimport numpy as np\nfrom tqdm import tqdm\nfrom ..decorators import print_step\nfrom multiprocessing import Pool\n\n\n# Compiling for optimization\nre_sub_1 = re.compile(r\"(:(?=\\s))|((?<=\\s):)\")\nre_sub_2 = re.compile(r\"(\\d+\\.)+\\d+\")\nre_sub_3 = re.compile(r\"\\d{2}:\\d{2}:\\d{2}\")\nre_sub_4 = re.compile(r\"Mar|Apr|Dec|Jan|Feb|Nov|Oct|May|Jun|Jul|Aug|Sep\")\nre_sub_5 = re.compile(r\":?(\\w+:)+\")\nre_sub_6 = re.compile(r\"\\.|\\(|\\)|\\<|\\>|\\/|\\-|\\=|\\[|\\]\")\np = re.compile(r\"[^(A-Za-z)]\")\ndef remove_parameters(msg):\n    # Removing parameters with Regex\n    msg = re.sub(re_sub_1, \"\", msg)\n    msg = re.sub(re_sub_2, \"\", msg)\n    msg = re.sub(re_sub_3, \"\", msg)\n    msg = re.sub(re_sub_4, \"\", msg)\n    msg = re.sub(re_sub_5, \"\", msg)\n    msg = re.sub(re_sub_6, \" \", msg)\n    L = msg.split()\n    # Filtering strings that have non-letter tokens\n    new_msg = [k for k in L if not p.search(k)]\n    msg = \" \".join(new_msg)\n    return msg\n\n\ndef remove_parameters_slower(msg):\n    # Removing parameters with Regex\n    msg = re.sub(r\"(:(?=\\s))|((?<=\\s):)\", \"\", msg)\n    msg = re.sub(r\"(\\d+\\.)+\\d+\", \"\", msg)\n    msg = re.sub(r\"\\d{2}:\\d{2}:\\d{2}\", \"\", msg)\n    msg = re.sub(r\"Mar|Apr|Dec|Jan|Feb|Nov|Oct|May|Jun|Jul|Aug|Sep\", \"\", msg)\n    msg = re.sub(r\":?(\\w+:)+\", \"\", msg)\n    msg = re.sub(r\"\\.|\\(|\\)|\\<|\\>|\\/|\\-|\\=|\\[|\\]\", \" \", msg)\n    L = msg.split()\n    p = re.compile(\"[^(A-Za-z)]\")\n    # Filtering strings that have non-letter tokens\n    new_msg = [k for k in L if not p.search(k)]\n    msg = \" \".join(new_msg)\n    return msg\n\n\n@print_step\ndef process_logs(input_source, output, process_line=None):\n    with open(output, \"w\", encoding='latin-1') as f:\n        # counting first to show progress with tqdm\n        with open(input_source, 'r', encoding='latin-1') as IN:\n            line_count = sum(1 for line in IN)\n        with open(input_source, 'r', encoding='latin-1') as IN:\n            with Pool() as pool:\n                results = pool.imap(process_line, IN, chunksize=10000)\n                f.writelines(tqdm(results, total=line_count))\n\n\n@print_step\ndef load_logs(params, ignore_unlabeled=False):\n    log_path = params['logs']\n    unlabel_label = params['healthy_label']\n    x_data = []\n    y_data = []\n    label_dict = {}\n    target_names = []\n    with open(log_path, 'r', encoding='latin-1') as IN:\n        line_count = sum(1 for line in IN)\n    with open(log_path, 'r', encoding='latin-1') as IN:\n        for line in tqdm(IN, total=line_count):\n            L = line.strip().split()\n            label = L[0]\n            if label not in label_dict:\n                if ignore_unlabeled and label == unlabel_label:\n                    continue\n                if label == unlabel_label:\n                    label_dict[label] = -1.0\n                elif label not in label_dict:\n                    label_dict[label] = len(label_dict)\n                    target_names.append(label)\n            x_data.append(\" \".join(L[1:]))\n            y_data.append(label_dict[label])\n    x_data = np.array(x_data)\n    y_data = np.array(y_data)\n    return x_data, y_data, target_names\n"
  },
  {
    "path": "puLearning/__init__.py",
    "content": ""
  },
  {
    "path": "puLearning/puAdapter.py",
    "content": "#!/usr/bin/python\n# -*- coding: UTF-8 -*-\n\n# **********************************************************\n# * Author        : Weibin Meng\n# * Email         : m_weibin@163.com\n# * Create time   : 2019-07-24 15:10\n# * Last modified : 2019-07-24 15:10\n# * Filename      : puAdapter.py\n# * Description   : \n'''\n'''\n# **********************************************************\n#!/usr/bin/env python\n#-*- coding:utf-8 -*-\n\"\"\"\nCreated on Dec 21, 2012\n\n@author: Alexandre\n\"\"\"\nimport numpy as np\n\nclass PUAdapter(object):\n    \"\"\"\n    Adapts any probabilistic binary classifier to positive-unlabled learning using the PosOnly method proposed by\n    Elkan and Noto:\n\n    Elkan, Charles, and Keith Noto. \\\"Learning classifiers from only positive and unlabeled data.\\\"\n    Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008.\n    \"\"\"\n\n\n    def __init__(self, estimator, hold_out_ratio=0.1, precomputed_kernel=False):\n        \"\"\"\n        estimator -- An estimator of p(s=1|x) that must implement:\n                     * predict_proba(X): Takes X, which can be a list of feature vectors or a precomputed\n                                         kernel matrix and outputs p(s=1|x) for each example in X\n                     * fit(X,y): Takes X, which can be a list of feature vectors or a precomputed\n                                 kernel matrix and takes y, which are the labels associated to the\n                                 examples in X\n        hold_out_ratio -- The ratio of training examples that must be held out of the training set of examples\n                          to estimate p(s=1|y=1) after training the estimator\n        precomputed_kernel -- Specifies if the X matrix for predict_proba and fit is a precomputed kernel matrix\n        \"\"\"\n        self.estimator = estimator\n        self.c = 1.0\n        self.hold_out_ratio = hold_out_ratio\n\n        if precomputed_kernel:\n            self.fit = self.__fit_precomputed_kernel\n        else:\n            self.fit = self.__fit_no_precomputed_kernel\n\n        self.estimator_fitted = False\n\n    def __str__(self):\n        return 'Estimator:' + str(self.estimator) + '\\n' + 'p(s=1|y=1,x) ~= ' + str(self.c) + '\\n' + \\\n            'Fitted: ' + str(self.estimator_fitted)\n\n\n    def __fit_precomputed_kernel(self, X, y):\n        \"\"\"\n        Fits an estimator of p(s=1|x) and estimates the value of p(s=1|y=1) using a subset of the training examples\n\n        X -- Precomputed kernel matrix\n        y -- Labels associated to each example in X (Positive label: 1.0, Negative label: -1.0)\n        \"\"\"\n        positives = np.where(y == 1.)[0]\n        hold_out_size = np.ceil(len(positives) * self.hold_out_ratio)\n\n        if len(positives) <= hold_out_size:\n            raise('Not enough positive examples to estimate p(s=1|y=1,x). Need at least ' + str(hold_out_size + 1) + '.')\n\n        np.random.shuffle(positives)\n        hold_out = positives[:hold_out_size]\n\n        #Hold out test kernel matrix\n        X_test_hold_out = X[hold_out]\n        keep = list(set(np.arange(len(y))) - set(hold_out))\n        X_test_hold_out = X_test_hold_out[:,keep]\n\n        #New training kernel matrix\n        X = X[:, keep]\n        X = X[keep]\n\n        y = np.delete(y, hold_out)\n\n        self.estimator.fit(X, y)\n\n        hold_out_predictions = self.estimator.predict_proba(X_test_hold_out)\n\n        try:\n            hold_out_predictions = hold_out_predictions[:,1]\n        except:\n            pass\n\n        c = np.mean(hold_out_predictions)\n        self.c = c\n\n        self.estimator_fitted = True\n\n\n    def __fit_no_precomputed_kernel(self, X, y):\n        \"\"\"\n        Fits an estimator of p(s=1|x) and estimates the value of p(s=1|y=1,x)\n\n        X -- List of feature vectors\n        y -- Labels associated to each feature vector in X (Positive label: 1.0, Negative label: -1.0)\n        \"\"\"\n        positives = np.where(y == 1.)[0]\n        hold_out_size = np.ceil(len(positives) * self.hold_out_ratio)\n\n        if len(positives) <= hold_out_size:\n            raise('Not enough positive examples to estimate p(s=1|y=1,x). Need at least ' + str(hold_out_size + 1) + '.')\n\n        np.random.shuffle(positives)\n        #print hold_out_size\n        #print type(hold_out_size)\n        hold_out = positives[:int(hold_out_size)]\n        X_hold_out = X[hold_out]\n        X = np.delete(X, hold_out,0)\n        y = np.delete(y, hold_out)\n\n        self.estimator.fit(X, y)\n\n        hold_out_predictions = self.estimator.predict_proba(X_hold_out)\n\n        try:\n            hold_out_predictions = hold_out_predictions[:,1]\n        except:\n            pass\n\n        c = np.mean(hold_out_predictions)\n        self.c = c\n\n        self.estimator_fitted = True\n\n\n    def predict_proba(self, X):\n        \"\"\"\n        Predicts p(y=1|x) using the estimator and the value of p(s=1|y=1) estimated in fit(...)\n\n        X -- List of feature vectors or a precomputed kernel matrix\n        \"\"\"\n        if not self.estimator_fitted:\n            raise Exception('The estimator must be fitted before calling predict_proba(...).')\n\n        probabilistic_predictions = self.estimator.predict_proba(X)\n\n        try:\n            probabilistic_predictions = probabilistic_predictions[:,1]\n        except:\n            pass\n\n        return probabilistic_predictions / self.c\n\n\n    def predict(self, X, treshold=0.5):\n        \"\"\"\n        Assign labels to feature vectors based on the estimator's predictions\n\n        X -- List of feature vectors or a precomputed kernel matrix\n        treshold -- The decision treshold between the positive and the negative class\n        \"\"\"\n        if not self.estimator_fitted:\n            raise Exception('The estimator must be fitted before calling predict(...).')\n\n        return np.array([1. if p > treshold else -1. for p in self.predict_proba(X)])\n\n\n\n"
  },
  {
    "path": "reporting/__init__.py",
    "content": "__all__ = [\n    \"accuracy\",\n    \"confusion_matrix\",\n    \"multi_class_acc\",\n    \"top_k_svm\",\n    \"microf1\",\n    \"macrof1\",\n]\n"
  },
  {
    "path": "reporting/accuracy.py",
    "content": "from .bb_registry import register\nfrom sklearn.metrics import f1_score\n\n\n@register('acc')\ndef model_accuracy(y, pred):\n    return f1_score(y, pred)\n"
  },
  {
    "path": "reporting/bb_registry.py",
    "content": "\"\"\"Registry for black box reports or metrics.\"\"\"\n\n_BB_REPORTS = dict()\n\n\ndef register(name):\n    \"\"\"Registers a new black box report or metric function.\"\"\"\n\n    def add_to_dict(func):\n        _BB_REPORTS[name] = func\n        return func\n\n    return add_to_dict\n\n\ndef get_bb_report(model):\n    \"\"\"Fetches the black box report or metric function.\"\"\"\n    return _BB_REPORTS[model]\n"
  },
  {
    "path": "reporting/confusion_matrix.py",
    "content": "from .bb_registry import register\nfrom sklearn.metrics import confusion_matrix\n\n\n@register('confusion_matrix')\ndef report(y, pred):\n    return confusion_matrix(y, pred)\n"
  },
  {
    "path": "reporting/macrof1.py",
    "content": "from .bb_registry import register\nfrom sklearn.metrics import f1_score\n\n\n@register('macro')\ndef model_accuracy(y, pred):\n    return f1_score(y, pred, average='macro')\n"
  },
  {
    "path": "reporting/microf1.py",
    "content": "from .bb_registry import register\nfrom sklearn.metrics import f1_score\n\n\n@register('micro')\ndef model_accuracy(y, pred):\n    return f1_score(y, pred, average='micro')\n"
  },
  {
    "path": "reporting/multi_class_acc.py",
    "content": "from .bb_registry import register\nfrom sklearn.metrics import accuracy_score\n\n\n@register('multi_acc')\ndef model_accuracy(y, pred):\n    return accuracy_score(y, pred)\n"
  },
  {
    "path": "reporting/top_k_svm.py",
    "content": "from .wb_registry import register\nimport numpy as np\n\n\ndef get_feature_names(params, vocabulary, add_length=True):\n    feature_names = zip(vocabulary.keys(), vocabulary.values())\n    feature_names = sorted(feature_names, key=lambda x: x[1])\n    feature_names = [x[0] for x in feature_names]\n    if 'length' in params['features']:\n        feature_names.append('LENGTH')\n    return np.array(feature_names)\n\n\n@register('top_k_svm')\ndef get_top_k_SVM_features(params, model, vocabulary, **kwargs):\n    hparms = {\n        'target_names': [],\n        'top_features': 5,\n    }\n    hparms.update(kwargs)\n\n    top_k_label = {}\n    feature_names = get_feature_names(params, vocabulary)\n    for i, label in enumerate(hparms['target_names']):\n        if len(hparms['target_names']) < 3 and i == 1:\n            break  # coef is unidemensional when there's only two labels\n        coef = model.coef_[i]\n        top_coefficients = np.argsort(coef)[-hparms['top_features']:]\n        top_k_features = feature_names[top_coefficients]\n        top_k_label[label] = list(reversed(top_k_features))\n    return top_k_label\n"
  },
  {
    "path": "reporting/wb_registry.py",
    "content": "\"\"\"Registry for white box reports or metrics.\"\"\"\n\n_WB_REPORTS = dict()\n\n\ndef register(name):\n    \"\"\"Registers a new white box report or metric function.\"\"\"\n\n    def add_to_dict(func):\n        _WB_REPORTS[name] = func\n        return func\n\n    return add_to_dict\n\n\ndef get_wb_report(model):\n    \"\"\"Fetches the white box report or metric function.\"\"\"\n    return _WB_REPORTS[model]\n"
  },
  {
    "path": "requirements.txt",
    "content": "certifi==2019.9.11\njoblib==0.14.0\nnumpy==1.17.4\npandas==0.25.3\npython-dateutil==2.8.1\npytz==2019.3\nscikit-learn==0.21.3\nscipy==1.3.3\nsix==1.13.0\nsklearn==0.0\ntqdm==4.39.0\nwincertstore==0.2\n"
  },
  {
    "path": "run_binary.py",
    "content": "from .utils import (\n    load_params,\n    file_handling,\n    print_params,\n)\nfrom .preprocess import registry as preprocess_registry\nfrom .preprocess.utils import load_logs\nfrom .feature_engineering.utils import (\n    binary_train_gtruth,\n    extract_features,\n)\nfrom .models import binary_registry as binary_classifier_registry\nfrom .reporting import bb_registry as black_box_report_registry\nfrom .init_params import init_main_args, parse_main_args\n\n\ndef init_args():\n    \"\"\"Init command line args used for configuration.\"\"\"\n\n    parser = init_main_args()\n    return parser.parse_args()\n\n\ndef parse_args(args):\n    \"\"\"Parse provided args for runtime configuration.\"\"\"\n    params = parse_main_args(args)\n    params.update({'train': False})\n    return params\n\n\ndef inference(params, x_data, y_data, target_names):\n    # Inference\n    # Feature engineering\n    x_test, _ = extract_features(x_data, params)\n    # Binary training features\n    y_test = binary_train_gtruth(y_data)\n    # Binary PU estimator with RF\n    # Load Trained PU Estimator\n    binary_clf_getter =\\\n        binary_classifier_registry.get_binary_model(\n            params['binary_classifier'])\n    binary_clf = binary_clf_getter(params)\n    binary_clf.load()\n    # Anomaly detection\n    y_pred_pu = binary_clf.predict(x_test)\n    get_accuracy = black_box_report_registry.get_bb_report('acc')\n    binary_acc = get_accuracy(y_test, y_pred_pu)\n\n    print(binary_acc)\n    for report in params['report']:\n        try:\n            get_bb_report = black_box_report_registry.get_bb_report(report)\n            result = get_bb_report(y_test, y_pred_pu)\n        except Exception:\n            pass\n        else:\n            print(f'Binary classification {report} report:')\n            print(result)\n\n\ndef main():\n    # Init params\n    params = parse_args(init_args())\n    load_params(params)\n    print_params(params)\n    file_handling(params)\n    # Filter params from raw logs\n    if \"raw_logs\" in params:\n        preprocess = preprocess_registry.get_preprocessor(params['logs_type'])\n        preprocess(params)\n    # Load filtered params from file\n    print('Loading logs')\n    x_data, y_data, target_names = load_logs(params)\n    inference(params, x_data, y_data, target_names)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "train_binary.py",
    "content": "from sklearn.model_selection import StratifiedKFold\nfrom .utils import (\n    save_params,\n    file_handling,\n    TestingParameters,\n    print_params,\n)\nfrom .preprocess import registry as preprocess_registry\nfrom .preprocess.utils import load_logs\nfrom .feature_engineering.utils import (\n    binary_train_gtruth,\n    extract_features,\n)\nfrom tqdm import tqdm\nfrom .models import binary_registry as binary_classifier_registry\nfrom .reporting import bb_registry as black_box_report_registry\nfrom .init_params import init_main_args, parse_main_args\n\n\ndef init_args():\n    \"\"\"Init command line args used for configuration.\"\"\"\n\n    parser = init_main_args()\n    return parser.parse_args()\n\n\ndef parse_args(args):\n    \"\"\"Parse provided args for runtime configuration.\"\"\"\n    params = parse_main_args(args)\n    params.update({'train': True})\n    return params\n\n\ndef train(params, x_data, y_data, target_names):\n    # KFold Cross Validation\n    kfold = StratifiedKFold(n_splits=params['kfold']).split(x_data, y_data)\n    best_pu_fs = 0.\n    for train_index, test_index in tqdm(kfold):\n        x_train, x_test = x_data[train_index], x_data[test_index]\n        y_train, y_test = y_data[train_index], y_data[test_index]\n        x_train, _ = extract_features(x_train, params)\n        with TestingParameters(params):\n            x_test, _ = extract_features(x_test, params)\n        # Binary training features\n        y_test_pu = binary_train_gtruth(y_test)\n        y_train_pu = binary_train_gtruth(y_train)\n        # Binary PULearning with RF\n        binary_clf_getter =\\\n            binary_classifier_registry.get_binary_model(\n                params['binary_classifier'])\n        binary_clf = binary_clf_getter(params)\n        binary_clf.fit(x_train, y_train_pu)\n        y_pred_pu = binary_clf.predict(x_test)\n        get_accuracy = black_box_report_registry.get_bb_report('acc')\n        binary_acc = get_accuracy(y_test_pu, y_pred_pu)\n        better_results = binary_acc > best_pu_fs\n        if better_results:\n            if binary_acc > best_pu_fs:\n                best_pu_fs = binary_acc\n            save_params(params)\n            binary_clf.save()\n            print(binary_acc)\n\n        for report in params['report']:\n            try:\n                get_bb_report = black_box_report_registry.get_bb_report(report)\n                result = get_bb_report(y_test_pu, y_pred_pu)\n            except Exception:\n                pass\n            else:\n                print(f'Binary classification {report} report:')\n                print(result)\n\n\ndef main():\n    # Init params\n    params = parse_args(init_args())\n    file_handling(params)\n    # Filter params from raw logs\n    if \"raw_logs\" in params:\n        preprocess = preprocess_registry.get_preprocessor(params['logs_type'])\n        preprocess(params)\n    # Load filtered params from file\n    print('Loading logs')\n    x_data, y_data, target_names = load_logs(params)\n    print_params(params)\n    train(params, x_data, y_data, target_names)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "train_multi.py",
    "content": "from sklearn.model_selection import StratifiedKFold\nfrom .utils import (\n    save_params,\n    file_handling,\n    TestingParameters,\n    print_params,\n    save_results,\n)\nfrom .preprocess import registry as preprocess_registry\nfrom .preprocess.utils import load_logs\nfrom .feature_engineering.utils import (\n    multi_features,\n    extract_features,\n)\nfrom tqdm import tqdm\nfrom .models import multi_registry as multi_classifier_registry\nfrom .reporting import bb_registry as black_box_report_registry\nfrom .init_params import init_main_args, parse_main_args\n\n\ndef init_args():\n    \"\"\"Init command line args used for configuration.\"\"\"\n\n    parser = init_main_args()\n    return parser.parse_args()\n\n\ndef parse_args(args):\n    \"\"\"Parse provided args for runtime configuration.\"\"\"\n    params = parse_main_args(args)\n    params.update({'train': True})\n    return params\n\n\ndef init_results():\n    results = {\n        'exp_name': [],\n        'logs_type': [],\n        'macro': [],\n        'micro': [],\n        'train_time': [],\n        'run_time': [],\n    }\n    return results\n\n\ndef add_result(results, params, macro, micro, train_time, run_time):\n    results['exp_name'].append(params['id'])\n    results['logs_type'].append(params['logs_type'])\n    results['macro'].append(macro)\n    results['micro'].append(micro)\n    results['train_time'].append(train_time)\n    results['run_time'].append(run_time)\n\n\ndef train(params, x_data, y_data, target_names):\n    results = init_results()\n    # KFold Cross Validation\n    kfold = StratifiedKFold(n_splits=params['kfold']).split(x_data, y_data)\n    best_multi = 0.\n    for train_index, test_index in tqdm(kfold):\n        # Test & Train are interchanged to enable testing with 10% of the data\n        if params['swap']:\n            x_test, x_train = x_data[train_index], x_data[test_index]\n            y_test, y_train = y_data[train_index], y_data[test_index]\n        else:\n            x_train, x_test = x_data[train_index], x_data[test_index]\n            y_train, y_test = y_data[train_index], y_data[test_index]\n        x_train, _ = extract_features(x_train, params)\n        print(y_train.shape, y_test.shape)\n        with TestingParameters(params):\n            x_test, _ = extract_features(x_test, params)\n        # Multi-class training features\n        x_train_multi, y_train_multi =\\\n            multi_features(x_train, y_train)\n        x_test_multi, y_test_multi = multi_features(x_test, y_test)\n        # MultiClass\n        multi_classifier_getter =\\\n            multi_classifier_registry.get_multi_model(params['multi_classifier'])\n        multi_classifier = multi_classifier_getter(params)\n        multi_classifier.fit(x_train_multi, y_train_multi)\n        pred = multi_classifier.predict(x_test_multi)\n        get_multi_acc = black_box_report_registry.get_bb_report('macro')\n        macro = get_multi_acc(y_test_multi, pred)\n        get_multi_acc = black_box_report_registry.get_bb_report('micro')\n        micro = get_multi_acc(y_test_multi, pred)\n        better_results = macro > best_multi\n        if better_results:\n            save_params(params)\n            best_multi = macro\n            print(macro)\n\n        add_result(\n            results,\n            params,\n            macro,\n            micro,\n            multi_classifier.train_time,\n            multi_classifier.run_time\n            )\n\n    save_results(results, params)\n\n\ndef main():\n    # Init params\n    params = parse_args(init_args())\n    print_params(params)\n    file_handling(params)\n    # Filter params from raw logs\n    if \"raw_logs\" in params:\n        preprocess = preprocess_registry.get_preprocessor(params['logs_type'])\n        preprocess(params)\n    # Load filtered params from file\n    x_data, y_data, target_names = load_logs(params)\n    train(params, x_data, y_data, target_names)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "utils.py",
    "content": "import os\nimport json\nimport shutil\nimport pandas as pd\n\n\n# trim is only used when showing the top keywords for each class\ndef trim(s):\n    \"\"\"Trim string to fit on terminal (assuming 80-column display)\"\"\"\n    return s if len(s) <= 80 else s[:77] + \"...\"\n\n\nclass TestingParameters():\n    def __init__(self, params):\n        self.params = params\n        self.original_state = params['train']\n\n    def __enter__(self):\n        self.params['train'] = False\n\n    def __exit__(self, exc_type, exc_value, traceback):\n        self.params['train'] = self.original_state\n\n\ndef load_params(params):\n    params_file = os.path.join(\n        params['id_dir'], f\"best_params.json\")\n    with open(params_file, \"r\") as fp:\n        best_params = json.load(fp)\n    params.update(best_params)\n\n\ndef save_params(params):\n    params_file = os.path.join(\n        params['id_dir'], f\"best_params.json\")\n    with open(params_file, \"w\") as fp:\n        json.dump(params, fp)\n\n\ndef file_handling(params):\n    if \"raw_logs\" in params:\n        if not os.path.exists(params['raw_logs']):\n            raise FileNotFoundError(\n                f\"File {params['raw_logs']} doesn't exist. \"\n                + \"Please provide the raw logs path.\"\n            )\n        logs_directory = os.path.dirname(params['logs'])\n        if not os.path.exists(logs_directory):\n            os.makedirs(logs_directory)\n    else:\n        # Checks if preprocessed logs exist as input\n        if not os.path.exists(params['logs']):\n            raise FileNotFoundError(\n                f\"File {params['base_dir']} doesn't exist. \"\n                + \"Preprocess target logs first and provide their path.\"\n            )\n\n    if params['train']:\n        # Checks if the experiment id already exists\n        if os.path.exists(params[\"id_dir\"]) and not params[\"force\"]:\n            raise FileExistsError(\n                f\"directory '{params['id_dir']} already exists. \"\n                + \"Run with --force to overwrite.\"\n                + f\"If --force is used, you could lose your training results.\"\n            )\n        if os.path.exists(params[\"id_dir\"]):\n            shutil.rmtree(params[\"id_dir\"])\n        for target_dir in ['id_dir', 'models_dir', 'features_dir']:\n            os.makedirs(params[target_dir])\n    else:\n        # Checks if input models and features are provided\n        for concern in ['models_dir', 'features_dir']:\n            target_path = params[concern]\n            if not os.path.exists(target_path):\n                raise FileNotFoundError(\n                    \"directory '{} doesn't exist. \".format(target_path)\n                    + \"Run train first before running inference.\"\n                )\n\n\ndef print_params(params):\n    print(\"{:-^80}\".format(\"params\"))\n    print(\"Beginning experiment using the following configuration:\\n\")\n    for param, value in params.items():\n        print(\"\\t{:>13}: {}\".format(param, value))\n    print()\n    print(\"-\" * 80)\n\n\ndef save_results(results, params):\n    df = pd.DataFrame(results)\n    file_name = os.path.join(\n        params['id_dir'],\n        \"results.csv\",\n        )\n    df.to_csv(file_name, index=False)\n"
  }
]