Repository: amazon-science/fraud-dataset-benchmark
Branch: main
Commit: f100cb829599
Files: 39
Total size: 623.7 KB

Directory structure:
gitextract_sn16q5ml/

├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── scripts/
│   ├── examples/
│   │   └── Test_FDB_Loader.ipynb
│   └── reproducibility/
│       ├── afd/
│       │   ├── README.md
│       │   ├── configs/
│       │   │   ├── CreditCardFraudDetection.json
│       │   │   ├── FakeJobPostingPrediction.json
│       │   │   ├── Fraudecommerce.json
│       │   │   ├── IEEECISFraudDetection.json
│       │   │   ├── IPBlocklist.json
│       │   │   ├── MaliciousURL.json
│       │   │   ├── SimulatedCreditCardTransactionsSparkov.json
│       │   │   ├── TwitterBotAccounts.json
│       │   │   └── VehicleLoanDefaultPrediction.json
│       │   ├── create_afd_resources.py
│       │   └── score_afd_model.py
│       ├── autogluon/
│       │   ├── README.md
│       │   ├── benchmark_ag.py
│       │   └── example-ag-ieeecis.ipynb
│       ├── autosklearn/
│       │   ├── README.md
│       │   └── benchmark_autosklearn.py
│       ├── benchmark_utils.py
│       ├── h2o/
│       │   ├── README.md
│       │   ├── benchmark_h2o.py
│       │   └── example-h2o-ieeecis.ipynb
│       └── label-noise/
│           ├── benchmark_experiments.ipynb
│           ├── feature_dict.py
│           ├── load_fdb_datasets.py
│           └── micro_models.py
├── setup.py
└── src/
    ├── __init__.py
    └── fdb/
        ├── __init__.py
        ├── datasets.py
        ├── kaggle_configs.py
        ├── preprocessing.py
        ├── preprocessing_objects.py
        └── versioned_datasets/
            ├── __init__.py
            └── ipblock/
                └── __init__.py

================================================
FILE CONTENTS
================================================

================================================
FILE: CODE_OF_CONDUCT.md
================================================
## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.


================================================
FILE: CONTRIBUTING.md
================================================
# Contributing Guidelines

Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
documentation, we greatly value feedback and contributions from our community.

Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
information to effectively respond to your bug report or contribution.


## Reporting Bugs/Feature Requests

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

* A reproducible test case or series of steps
* The version of our code being used
* Any modifications you've made relevant to the bug
* Anything unusual about your environment or deployment


## Contributing via Pull Requests
Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:

1. You are working against the latest source on the *main* branch.
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.

To send us a pull request, please:

1. Fork the repository.
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull request interface.
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.

GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).


## Finding contributions to work on
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.


## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.


## Security issue notifications
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.


## Licensing

See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2021-2022 Prince Grover
Copyright (c) 2021-2022 Zheng Li
Copyright (c) 2022 Jianbo Liu
Copyright (c) 2022 Jakub Zablocki
Copyright (c) 2022 Jianbo Liu
Copyright (c) 2022 Hao Zhou
Copyright (c) 2022 Julia Xu
Copyright (c) 2022 Anqi Cheng

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# FDB: Fraud Dataset Benchmark

*By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin Tittelfitz](jtittelfitz), Anqi Cheng, [Jakub Zablocki](qbaza), Jianbo Liu, and [Hao Zhou](haozhouamzn)*


[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 


The **Fraud Dataset Benchmark (FDB)** is a compilation of publicly available datasets relevant to **fraud detection** ([arXiv Link](https://arxiv.org/abs/2208.14417)). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning. 


## Datasets used in FDB
Brief summary of the datasets used in FDB. Each dataset is described in detail in [data source section](#data-sources).

| **#** | **Dataset name**                                           | **Dataset key** | **Fraud category**                  | **#Train** | **#Test** | **Class ratio (train)** | **#Feats** | **#Cat** | **#Num** | **#Text** | **#Enrichable** |
|-------|------------------------------------------------------------|-----------------|-------------------------------------|------------|-----------|-------------------------|------------|----------|----------|-----------|-----------------|
| 1     | IEEE-CIS Fraud Detection                                   | ieeecis         | Card Not Present Transactions Fraud | 561,013    | 28,527    | 3.50%                   | 67         | 6        | 61       | 0         | 0               |
| 2     | Credit Card Fraud Detection                                | ccfraud         | Card Not Present Transactions Fraud | 227,845    | 56,962    | 0.18%                   | 28         | 0        | 28       | 0         | 0               |
| 3     | Fraud ecommerce                                            | fraudecom       | Card Not Present Transactions Fraud | 120,889    | 30,223    | 10.60%                  | 6          | 2        | 3        | 0         | 1               |
| 4     | Simulated Credit Card Transactions generated using Sparkov | sparknov        | Card Not Present Transactions Fraud | 1,296,675  | 20,000    | 5.70%                   | 17         | 10       | 6        | 1         | 0               |
| 5     | Twitter Bots Accounts                                      | twitterbot      | Bot Attacks                         | 29,950     | 7,488     | 33.10%                  | 16         | 6        | 6        | 4         | 0               |
| 6     | Malicious URLs dataset                                     | malurl          | Malicious Traffic                  | 586,072   | 65,119    | 34.20%                  | 2          | 0        | 1        | 1         | 0               |
| 7     | Fake Job Posting Prediction                                | fakejob         | Content Moderation                  | 14,304     | 3,576     | 4.70%                   | 16         | 10       | 1        | 5         | 0               |
| 8     | Vehicle Loan Default Prediction                            | vehicleloan    | Credit Risk                         | 186,523    | 46,631    | 21.60%                  | 38         | 13       | 22       | 3         | 0               |
| 9     | IP Blocklist                                               | ipblock         | Malicious Traffic                   | 172,000    | 43,000    | 7%                      | 1          | 0        | 0        | 0         | 1               |


## Installation

### Requirements
- Kaggle account
    - **Important**: `ieeecis` dataset requires you to [**join IEEE-CIS competetion**](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call fdb API. Otherwise you will get <span style="color:red">ApiException: (403)</span>.
- AWS account
- Python 3.7+ 

- Python requirements
```
autogluon==0.4.2
h2o==3.36.1.2
boto3==1.20.21
click==8.0.3
click-plugins==1.1.1
Faker==4.14.2
joblib==1.0.0
kaggle==1.5.12
numpy==1.19.5
pandas==1.1.2
regex==2020.7.14
scikit-learn==0.22.1
scipy==1.5.4
auto-sklearn==0.14.7
dask==2022.8.1
```

### Step 1: Setup Kaggle CLI
The `FraudDatasetBenchmark` object is going to load datasets from the source (which in most of the cases is Kaggle), and then it will modify/standardize on the fly, and provide train-test splits. So, the first step is to setup Kaggle CLI in the machine being used to run Python.

Use intructions from [How to Use Kaggle](https://www.kaggle.com/docs/api) guide. The steps include:

Remember to download the authentication token from "My Account" on Kaggle, and save token at `~/.kaggle/kaggle.json` on Linux, OSX and at `C:\Users<Windows-username>.kaggle\kaggle.json` on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder.
  
    
#### Step 1.2. [Join IEEE-CIS competetion](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call `fdb.datasets` with `ieeecis`. Otherwise you will get <span style="color:red">ApiException: (403)</span>.
  
  
### Step 2: Clone Repo
Once Kaggle CLI is setup and installed, clone the github repo using `git clone https://github.com/amazon-research/fraud-dataset-benchmark.git` if using HTTPS, or `git clone git@github.com:amazon-research/fraud-dataset-benchmark.git` if using SSH. 

### Step 3: Install
Once repo is cloned, from your terminal, `cd` to the repo and type `pip install .`, which will install the required classes and methods.


## FraudDatasetBenchmark Usage
The usage is straightforward, where you create a `dataset` object of `FraudDatasetBenchmark` class, and extract useful goodies like train/test splits and eval_metrics.   

**Important note**: If you are running multiple experiments that require re-loading dataframes multiple times, default setting of downloading from Kaggle before loading into dataframe exceed the account level API limits. So, use the setting to persist the downloaded dataset and then load from the persisted data. During the first call of FraudDatasetBenchmark(), use `load_pre_downloaded=False, delete_downloaded=False` and for subsequent calls, use `load_pre_downloaded=True, delete_downloaded=False`. The default setting is 
`load_pre_downloaded=False, delete_downloaded=True`
```
from fdb.datasets import FraudDatasetBenchmark

# all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud', 'fraudecom', 'twitterbot', 'ipblock'] 
key = 'ipblock'

obj = FraudDatasetBenchmark(
    key=key,
    load_pre_downloaded=False,  # default
    delete_downloaded=True,  # default
    add_random_values_if_real_na = { 
        "EVENT_TIMESTAMP": True, 
        "LABEL_TIMESTAMP": True,
        "ENTITY_ID": True,
        "ENTITY_TYPE": True,
        "ENTITY_ID": True,
        "EVENT_ID": True
        } # default
    )
print(obj.key)

print('Train set: ')
display(obj.train.head())
print(len(obj.train.columns))
print(obj.train.shape)

print('Test set: ')
display(obj.test.head())
print(obj.test.shape)

print('Test scores')
display(obj.test_labels.head())
print(obj.test_labels['EVENT_LABEL'].value_counts())
print(obj.train['EVENT_LABEL'].value_counts(normalize=True))
print('=========')

``` 
Notebook template to load dataset using FDB data-loader is available at [scripts/examples/Test_FDB_Loader.ipynb](scripts/examples/Test_FDB_Loader.ipynb)

## Reproducibility
Reproducibility scripts are available at [scripts/reproducibility/](scripts/reproducibility/) in respective folders for [afd](scripts/reproducibility/afd), [autogluon](scripts/reproducibility/autogluon) and [h2o](scripts/reproducibility/h2o). Each folder also had README with steps to reproduce.


## Benchmark Results

<!-- | **Dataset key** | **AUC-ROC** |             |               |                  |                  | **Recall at 1% FPR** |             |               |                  |                  |
|:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:|:--------------------:|:-----------:|:-------------:|:----------------:|:----------------:|
|                 | **AFD OFI** | **AFD TFI** | **AutoGluon** |      **H2O**     | **Auto-sklearn** |      **AFD OFI**     | **AFD TFI** | **AutoGluon** |      **H2O**     | **Auto-sklearn** |
|     ccfraud     |    0.985    |     0.99    |      0.99     |     **0.992**    |       0.988      |         0.88         |     0.88    |      0.88     |       0.853      |       0.88       |
|     fakejob     |    0.987    |      -      |   **0.998**   |       0.99       |       0.983      |         0.786        |      -      |     0.925     |       0.781      |       0.781      |
|    fraudecom    |    0.519    |  **0.636**  |     0.522     |       0.518      |       0.515      |         0.011        |    0.099    |     0.012     |       0.009      |       0.012      |
|     ieeecis     |    0.938    |   **0.94**  |     0.855     |       0.89       |       0.932      |         0.587        |     0.56    |     0.425     |       0.442      |       0.569      |
|      malurl     |    0.985    |      -      |   **0.998**   | Training failure |        0.5       |         0.868        |      -      |     0.976     | Training failure |       0.01       |
|     sparknov    |  **0.998**  |      -      |     0.997     |       0.997      |       0.995      |           1          |      -      |     0.927     |       0.896      |       0.868      |
|    twitterbot   |    0.934    |      -      |   **0.943**   |       0.938      |       0.936      |         0.518        |      -      |     0.419     |       0.382      |       0.369      |
|   vehicleloan   |  **0.673**  |      -      |     0.669     |       0.67       |       0.664      |         0.036        |      -      |      0.04     |       0.037      |       0.035      |
|     ipblock     |  **0.937**  |      -      |     0.804     | Training failure |        0.5       |         0.466        |      -      |      0.32     | Training failure |       0.01       | -->

| **Dataset key** | **AUC-ROC** |             |               |                  |                  |
|:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:|
|                 | **AFD OFI** | **AFD TFI** | **AutoGluon** |      **H2O**     | **Auto-sklearn** |
|     ccfraud     |    0.985    |     0.99    |      0.99     |     **0.992**    |       0.988      |
|     fakejob     |    0.987    |      -      |   **0.998**   |       0.99       |       0.983      |
|    fraudecom    |    0.519    |  **0.636**  |     0.522     |       0.518      |       0.515      |
|     ieeecis     |    0.938    |   **0.94**  |     0.855     |       0.89       |       0.932      |
|      malurl     |    0.985    |      -      |   **0.998**   | Training failure |        0.5       |
|     sparknov    |  **0.998**  |      -      |     0.997     |       0.997      |       0.995      |
|    twitterbot   |    0.934    |      -      |   **0.943**   |       0.938      |       0.936      |
|   vehicleloan   |  **0.673**  |      -      |     0.669     |       0.67       |       0.664      |
|     ipblock     |  **0.937**  |      -      |     0.804     | Training failure |        0.5       |

### ROC Curves

The numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML.  
![roc curves](images/all_fdb.png)


## Data Sources


1. **IEEE-CIS Fraud Detection**
    - Source URL: https://www.kaggle.com/c/ieee-fraud-detection/overview
    - Source license: https://www.kaggle.com/competitions/ieee-fraud-detection/rules
    - Variables: Anonymized product, card, address, email domain, device, transaction date information. Numeric columns with name prefixes as V, C, D and M, and meaning hidden from public.
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Vesta Corporation](https://www.vesta.io/)
    - Release date: 2019-10-03
    - Description: Prepared by IEEE Computational Intelligence Society, this card-non-present transaction fraud dataset was launched during IEEE-CIS Fraud Detection Kaggle competition, and was provided by Vesta Corporation. The original dataset contains 393 features which are reduced to 67 features in the benchmark. Feature selection was performed based on highly voted Kaggle kernels. The fraud rate in training segment of source dataset is 3.5%. We only used training files (train transaction and train identity) containing 590,540 transactions in the benchmark, and split that into train (95%) and test (5%) segments based on time. Based on the insights from a Kaggle kernel written by the competition winner, we added UUID (called it as ENTITY_ID) that represents a fingerprint and was created using card, address, time and D1 features.

2. **Credit Card Fraud Detection**
    - Source URL: https://www.kaggle.com/mlg-ulb/creditcardfraud/
    - Source license: https://opendatacommons.org/licenses/dbcl/1-0/
    - Variables: PCA transformed features, time, amount (highly imbalanced)
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Machine Learning Group - ULB](https://mlg.ulb.ac.be/)
    - Release date: 2018-03-23
    - Description: This dataset contains anonymized credit card transactions by European cardholders in September 2013. The dataset contains 492 frauds out of 284,807 transactions over 2 days. Data only contains numerical features that are the result of a PCA transformation, plus non transformed time and amount.

3. **Fraud ecommerce**
    - Source URL: https://www.kaggle.com/vbinh002/fraud-ecommerce
    - Source license: None
    - Variables: The features include sign up time, purchase time, purchase value, device id, user id, browser, and IP address. We added a new feature that measured the time difference between sign up and purchase, as the age of an account is often an important variable in fraud detection.
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Binh Vu](https://www.kaggle.com/vbinh002) 
    - Release date: 2018-12-09
    - Description: This dataset contains ~150k e-commerce transactions.

4. **Simulated Credit Card Transactions generated using Sparkov**
    - Source URL: https://www.kaggle.com/kartik2112/fraud-detection
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: Transaction date, credit card number, merchant, category, amount, name, street, gender. All variables are synthetically generated using the Sparknov tool.
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Kartik Shenoy](https://www.kaggle.com/kartik2112)
    - Release date: 2020-08-05
    - Description: This is a simulated credit card transaction dataset. The dataset was generated using Sparkov Data Generation tool and we modified a version of dataset created for Kaggle. It covers transactions of 1000 customers with a pool of 800 merchants over 6 months. We used both train and test segments directly from the source and randomly down sampled test segment.

5. **Twitter Bots Accounts**
    - Source URL: https://www.kaggle.com/code/davidmartngutirrez/bots-accounts-eda/data?select=twitter_human_bots_dataset.csv
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: Features like account creation date, follower and following counts, profile description, account age, meta data about profile picture and account activity, and a label indicating whether the account is human or bot.
    - Fraud category: Bot Attacks
    - Provider: [David Martín Gutiérrez](https://www.kaggle.com/davidmartngutirrez)
    - Release date: 2020-08-20
    - Description: The dataset composes of 37,438 rows corresponding to different user accounts from Twitter.

6. **Malicious URLs dataset**
    - Source URL: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label.
    - Fraud category: Malicious Traffic
    - Provider: [Manu Siddhartha](https://www.kaggle.com/sid321axn) 
    - Release date: 2021-07-23
    - Description: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. There is no timestamp information from the source. Therefore, we generate a dummy timestamp column for consistency.

7. **Real / Fake Job Posting Prediction**
    - Source URL: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: Title, location, department, company, salary range, requirements, description, benefits, telecommuting. Most of the variables are categorical and free form text in nature.
    - Fraud category: Content Moderation
    - Provider: [Shivam Bansal](https://www.kaggle.com/shivamb) 
    - Release date: 2020-02-29
    - Description: This Kaggle dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The task is to train classification model to detect which job posts are fraudulent.

8. **Vehicle Loan Default Prediction**
    - Source URL: https://www.kaggle.com/avikpaul4u/vehicle-loan-default-prediction
    - Source license: Unknown
    - Variables: Loanee information, loan information, credit bureau data, and history.
    - Fraud category: Credit Risk
    - Provider: [Avik Paul](https://www.kaggle.com/avikpaul4u) 
    - Release date: 2019-11-12
    - Description: The task in this dataset is to determine the probability of vehicle loan default, particularly the risk of default on the first monthly installments. It contains data for 233k loans with 21.7% default rate.
    
9. **IP Blocklist**
    - Source URL: http://cinsscore.com/list/ci-badguys.txt
    - Source license: Unknown
    - Variables: The dataset contains IP address and label telling malicious or fake. A dummy categorical variable that has no relation label is added.
    - Fraud category: Malicious Traffic
    - Provider: [CINSscore.com](http://cinsscore.com)
    - Release date: 2017-09-25
    - Description: This dataset is made up from malicious IP address from cinsscore.com. To the list of malicious IP addresses, we added randomly generated IP address using Faker labeled as benign.
    

## Citation
```
@misc{grover2023fraud,
      title={Fraud Dataset Benchmark and Applications}, 
      author={Prince Grover and Julia Xu and Justin Tittelfitz and Anqi Cheng and Zheng Li and Jakub Zablocki and Jianbo Liu and Hao Zhou},
      year={2023},
      eprint={2208.14417},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

## License
This project is licensed under the MIT-0 License.


## Acknowledgement
We thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes. 


================================================
FILE: scripts/examples/Test_FDB_Loader.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('../../src/')\n",
    "from fdb.datasets import FraudDatasetBenchmark\n",
    "from fdb.kaggle_configs import KAGGLE_CONFIGS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container { width:90% }</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Notebook setups\n",
    "\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from io import StringIO\n",
    "\n",
    "from IPython.core.display import display, HTML\n",
    "from IPython.display import clear_output\n",
    "display(HTML(\"<style>.container { width:90% }</style>\"))\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_colwidth', 200)\n",
    "pd.set_option('display.max_rows', 500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "\n",
    "if os.path.exists('tmp'):\n",
    "    shutil.rmtree('tmp')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNCOMMENT IF YOU NEED TO UPLOAD DATA TO AN S3 BUCKET IN YOUR ACCOUNT\n",
    "\n",
    "# import boto3\n",
    "# BUCKET='<ADD S3 BUCKET NAME IF YOU WANT TO UPLOAD DATA TO YOUR ACCOUNT>'\n",
    "\n",
    "# def _s3_upload(df):\n",
    "#     csv_memory=StringIO()\n",
    "#     df.to_csv(csv_memory, index=False)\n",
    "#     content = csv_memory.getvalue()\n",
    "#     s3_client.put_object(\n",
    "#         Body=content,\n",
    "#         Bucket=BUCKET,\n",
    "#         Key=KEY,\n",
    "#        ACL='bucket-owner-full-control')\n",
    "# s3_client = boto3.client('s3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# All options for keys\n",
    "all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud','fraudecom', 'twitterbot', 'ipblock']\n",
    "# all_keys = ['ipblock']\n",
    "# all_keys = ['twitterbot']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Default setting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Default setting pulls data from the source in your system, modified the data and adds random values for columns that are missing, if add_random_values_if_real_na flags are True.\n",
    "\n",
    "Defalt parameters: \n",
    "- load_pre_downloaded: False\n",
    "- delete_downloaded: True\n",
    "- add_random_values_if_real_na = ```\n",
    "{\n",
    "\"EVENT_TIMESTAMP\": True,\n",
    "\"LABEL_TIMESTAMP\": True,\n",
    "\"ENTITY_ID\": True,\n",
    "\"ENTITY_TYPE\": True,\n",
    "\"EVENT_ID\": True\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "fakejob\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>title</th>\n",
       "      <th>location</th>\n",
       "      <th>department</th>\n",
       "      <th>salary_range</th>\n",
       "      <th>company_profile</th>\n",
       "      <th>description</th>\n",
       "      <th>requirements</th>\n",
       "      <th>benefits</th>\n",
       "      <th>telecommuting</th>\n",
       "      <th>has_company_logo</th>\n",
       "      <th>has_questions</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>required_experience</th>\n",
       "      <th>required_education</th>\n",
       "      <th>industry</th>\n",
       "      <th>function</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5736</th>\n",
       "      <td>5737</td>\n",
       "      <td>Jr. Business Analyst &amp; Quality Analyst (entry level)</td>\n",
       "      <td>US, NJ, PISCATAWAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Duration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial &amp;amp; Health care clients.Candidate should have knowledge or experience in ...</td>\n",
       "      <td>What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Entry level</td>\n",
       "      <td>Master's Degree</td>\n",
       "      <td>Financial Services</td>\n",
       "      <td>Finance</td>\n",
       "      <td>0</td>\n",
       "      <td>382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b</td>\n",
       "      <td>2022-12-13T13:05:21Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7106</th>\n",
       "      <td>7107</td>\n",
       "      <td>English Teacher Abroad</td>\n",
       "      <td>US, PA, Scranton</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>We help teachers get safe &amp;amp; secure jobs abroad :)</td>\n",
       "      <td>Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr...</td>\n",
       "      <td>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only</td>\n",
       "      <td>See job description</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Contract</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Education Management</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>deadb697-08d2-4dca-83ec-a15d5e501a5b</td>\n",
       "      <td>2022-07-26T01:40:53Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11978</th>\n",
       "      <td>11979</td>\n",
       "      <td>SQL Server Database Developer Job opportunity at Barrington, IL</td>\n",
       "      <td>US, IL, Barrington</td>\n",
       "      <td>NaN</td>\n",
       "      <td>90000-100000</td>\n",
       "      <td>We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc...</td>\n",
       "      <td>Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...</td>\n",
       "      <td>Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...</td>\n",
       "      <td>Benefits - FullBonus Eligible - Yes</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Mid-Senior level</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Information Technology and Services</td>\n",
       "      <td>Information Technology</td>\n",
       "      <td>0</td>\n",
       "      <td>f5fcea87-6798-4529-a6c7-205d893b9b24</td>\n",
       "      <td>2023-03-09T13:06:59Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9374</th>\n",
       "      <td>9375</td>\n",
       "      <td>Legal Analyst - 12 Month FTC</td>\n",
       "      <td>GB, LND, London</td>\n",
       "      <td>Legal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>MarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo...</td>\n",
       "      <td>DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki...</td>\n",
       "      <td>Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col...</td>\n",
       "      <td>Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible.</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Associate</td>\n",
       "      <td>Professional</td>\n",
       "      <td>Financial Services</td>\n",
       "      <td>Legal</td>\n",
       "      <td>0</td>\n",
       "      <td>114fbd01-0573-42cf-9365-78729264e1aa</td>\n",
       "      <td>2022-12-09T08:17:07Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1300</th>\n",
       "      <td>1301</td>\n",
       "      <td>Part-Time Finance Assistant</td>\n",
       "      <td>GB, LND,</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Salary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a...</td>\n",
       "      <td>Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec...</td>\n",
       "      <td>Salary:£9 - £10 per hour</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Part-time</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>05a5dbdb-9778-4e4a-b967-7850dd483a54</td>\n",
       "      <td>2022-08-28T17:32:28Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      EVENT_ID  \\\n",
       "5736      5737   \n",
       "7106      7107   \n",
       "11978    11979   \n",
       "9374      9375   \n",
       "1300      1301   \n",
       "\n",
       "                                                                 title  \\\n",
       "5736              Jr. Business Analyst & Quality Analyst (entry level)   \n",
       "7106                                           English Teacher Abroad    \n",
       "11978  SQL Server Database Developer Job opportunity at Barrington, IL   \n",
       "9374                                      Legal Analyst - 12 Month FTC   \n",
       "1300                                       Part-Time Finance Assistant   \n",
       "\n",
       "                 location department  salary_range  \\\n",
       "5736   US, NJ, PISCATAWAY        NaN           NaN   \n",
       "7106    US, PA, Scranton         NaN           NaN   \n",
       "11978  US, IL, Barrington        NaN  90000-100000   \n",
       "9374      GB, LND, London      Legal           NaN   \n",
       "1300            GB, LND,         NaN           NaN   \n",
       "\n",
       "                                                                                                                                                                                               company_profile  \\\n",
       "5736                                                                                                                                                                                                       NaN   \n",
       "7106                                                                                                                                                     We help teachers get safe &amp; secure jobs abroad :)   \n",
       "11978  We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc...   \n",
       "9374   MarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo...   \n",
       "1300                                                                                                                                                                                                       NaN   \n",
       "\n",
       "                                                                                                                                                                                                   description  \\\n",
       "5736   Duration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial &amp; Health care clients.Candidate should have knowledge or experience in ...   \n",
       "7106   Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr...   \n",
       "11978  Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...   \n",
       "9374   DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki...   \n",
       "1300   Salary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a...   \n",
       "\n",
       "                                                                                                                                                                                                  requirements  \\\n",
       "5736   What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad...   \n",
       "7106                                                                        University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only   \n",
       "11978  Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...   \n",
       "9374   Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col...   \n",
       "1300   Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec...   \n",
       "\n",
       "                                                                                                                                                                                                benefits  \\\n",
       "5736                                                                                                                                                                                                 NaN   \n",
       "7106                                                                                                                                                                                 See job description   \n",
       "11978                                                                                                                                                                Benefits - FullBonus Eligible - Yes   \n",
       "9374   Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible.    \n",
       "1300                                                                                                                                                                           Salary:£9 - £10 per hour    \n",
       "\n",
       "      telecommuting has_company_logo has_questions employment_type  \\\n",
       "5736              0                0             0       Full-time   \n",
       "7106              0                1             1        Contract   \n",
       "11978             0                0             0       Full-time   \n",
       "9374              0                1             0       Full-time   \n",
       "1300              0                0             0       Part-time   \n",
       "\n",
       "      required_experience required_education  \\\n",
       "5736          Entry level    Master's Degree   \n",
       "7106                  NaN  Bachelor's Degree   \n",
       "11978    Mid-Senior level  Bachelor's Degree   \n",
       "9374            Associate       Professional   \n",
       "1300                  NaN                NaN   \n",
       "\n",
       "                                  industry                function  \\\n",
       "5736                    Financial Services                 Finance   \n",
       "7106                  Education Management                     NaN   \n",
       "11978  Information Technology and Services  Information Technology   \n",
       "9374                    Financial Services                   Legal   \n",
       "1300                            Accounting                     NaN   \n",
       "\n",
       "       EVENT_LABEL                             ENTITY_ID  \\\n",
       "5736             0  382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b   \n",
       "7106             0  deadb697-08d2-4dca-83ec-a15d5e501a5b   \n",
       "11978            0  f5fcea87-6798-4529-a6c7-205d893b9b24   \n",
       "9374             0  114fbd01-0573-42cf-9365-78729264e1aa   \n",
       "1300             0  05a5dbdb-9778-4e4a-b967-7850dd483a54   \n",
       "\n",
       "            EVENT_TIMESTAMP       LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "5736   2022-12-13T13:05:21Z  2023-05-05T08:46:09Z        user  \n",
       "7106   2022-07-26T01:40:53Z  2023-05-05T08:46:09Z        user  \n",
       "11978  2023-03-09T13:06:59Z  2023-05-05T08:46:09Z        user  \n",
       "9374   2022-12-09T08:17:07Z  2023-05-05T08:46:09Z        user  \n",
       "1300   2022-08-28T17:32:28Z  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22\n",
      "(14304, 22)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>title</th>\n",
       "      <th>location</th>\n",
       "      <th>department</th>\n",
       "      <th>salary_range</th>\n",
       "      <th>company_profile</th>\n",
       "      <th>description</th>\n",
       "      <th>requirements</th>\n",
       "      <th>benefits</th>\n",
       "      <th>telecommuting</th>\n",
       "      <th>has_company_logo</th>\n",
       "      <th>has_questions</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>required_experience</th>\n",
       "      <th>required_education</th>\n",
       "      <th>industry</th>\n",
       "      <th>function</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>Customer Service Associate - Part Time</td>\n",
       "      <td>US, AZ, Phoenix</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr...</td>\n",
       "      <td>The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai...</td>\n",
       "      <td>Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Part-time</td>\n",
       "      <td>Entry level</td>\n",
       "      <td>High School or equivalent</td>\n",
       "      <td>Financial Services</td>\n",
       "      <td>Customer Service</td>\n",
       "      <td>1743dd4b-f989-4227-8480-cbafa760b4de</td>\n",
       "      <td>2022-12-31T18:14:06Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>Account Executive - Sydney</td>\n",
       "      <td>AU, NSW, Sydney</td>\n",
       "      <td>Sales</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Adthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ...</td>\n",
       "      <td>Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th...</td>\n",
       "      <td>You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication...</td>\n",
       "      <td>In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Associate</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Internet</td>\n",
       "      <td>Sales</td>\n",
       "      <td>d5a82588-fcff-495b-aeda-20a8de0737d0</td>\n",
       "      <td>2022-06-20T15:25:47Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>16</td>\n",
       "      <td>VP of Sales - Vault Dragon</td>\n",
       "      <td>SG, 01, Singapore</td>\n",
       "      <td>Sales</td>\n",
       "      <td>120000-150000</td>\n",
       "      <td>Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ...</td>\n",
       "      <td>About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count...</td>\n",
       "      <td>Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ...</td>\n",
       "      <td>Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Executive</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Facilities Services</td>\n",
       "      <td>Sales</td>\n",
       "      <td>298d3508-76bb-4362-9ad4-f843fa3f99fa</td>\n",
       "      <td>2022-10-30T20:49:56Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>19</td>\n",
       "      <td>Visual Designer</td>\n",
       "      <td>US, NY, New York</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Kettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer...</td>\n",
       "      <td>Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>cad2f705-4b22-4110-bb06-b34a47c62a6d</td>\n",
       "      <td>2022-05-30T19:30:26Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>Marketing Assistant</td>\n",
       "      <td>US, TX, Austin</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>IntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ...</td>\n",
       "      <td>IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential...</td>\n",
       "      <td>Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Marketing</td>\n",
       "      <td>24c31ad9-95a9-479c-87c5-de6af06ddef6</td>\n",
       "      <td>2022-12-05T07:48:39Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  EVENT_ID                                    title           location  \\\n",
       "0       10  Customer Service Associate - Part Time     US, AZ, Phoenix   \n",
       "1       15               Account Executive - Sydney    AU, NSW, Sydney   \n",
       "2       16               VP of Sales - Vault Dragon  SG, 01, Singapore   \n",
       "3       19                          Visual Designer   US, NY, New York   \n",
       "4       21                      Marketing Assistant     US, TX, Austin   \n",
       "\n",
       "  department   salary_range  \\\n",
       "0        NaN            NaN   \n",
       "1      Sales            NaN   \n",
       "2      Sales  120000-150000   \n",
       "3        NaN            NaN   \n",
       "4        NaN            NaN   \n",
       "\n",
       "                                                                                                                                                                                           company_profile  \\\n",
       "0  Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr...   \n",
       "1  Adthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ...   \n",
       "2  Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ...   \n",
       "3  Kettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer...   \n",
       "4  IntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ...   \n",
       "\n",
       "                                                                                                                                                                                               description  \\\n",
       "0  The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai...   \n",
       "1  Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th...   \n",
       "2  About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count...   \n",
       "3  Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ...   \n",
       "4  IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential...   \n",
       "\n",
       "                                                                                                                                                                                              requirements  \\\n",
       "0  Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre...   \n",
       "1  You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication...   \n",
       "2  Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ...   \n",
       "3                                                                                                                                                                                                      NaN   \n",
       "4  Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website...   \n",
       "\n",
       "                                                                                                                                                                                                  benefits  \\\n",
       "0                                                                                                                                                                                                      NaN   \n",
       "1  In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ...   \n",
       "2  Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa...   \n",
       "3                                                                                                                                                                                                      NaN   \n",
       "4                                                                                                                                                                                                      NaN   \n",
       "\n",
       "  telecommuting has_company_logo has_questions employment_type  \\\n",
       "0             0                1             0       Part-time   \n",
       "1             0                1             0       Full-time   \n",
       "2             0                1             1       Full-time   \n",
       "3             0                1             0             NaN   \n",
       "4             0                1             0             NaN   \n",
       "\n",
       "  required_experience         required_education             industry  \\\n",
       "0         Entry level  High School or equivalent   Financial Services   \n",
       "1           Associate          Bachelor's Degree             Internet   \n",
       "2           Executive          Bachelor's Degree  Facilities Services   \n",
       "3                 NaN                        NaN                  NaN   \n",
       "4                 NaN                        NaN                  NaN   \n",
       "\n",
       "           function                             ENTITY_ID  \\\n",
       "0  Customer Service  1743dd4b-f989-4227-8480-cbafa760b4de   \n",
       "1             Sales  d5a82588-fcff-495b-aeda-20a8de0737d0   \n",
       "2             Sales  298d3508-76bb-4362-9ad4-f843fa3f99fa   \n",
       "3               NaN  cad2f705-4b22-4110-bb06-b34a47c62a6d   \n",
       "4         Marketing  24c31ad9-95a9-479c-87c5-de6af06ddef6   \n",
       "\n",
       "        EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "0  2022-12-31T18:14:06Z        user  \n",
       "1  2022-06-20T15:25:47Z        user  \n",
       "2  2022-10-30T20:49:56Z        user  \n",
       "3  2022-05-30T19:30:26Z        user  \n",
       "4  2022-12-05T07:48:39Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3576, 20)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL\n",
       "0            0\n",
       "1            0\n",
       "2            0\n",
       "3            0\n",
       "4            0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    3389\n",
      "1     187\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.952531\n",
      "1    0.047469\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "vehicleloan\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>disbursed_amount</th>\n",
       "      <th>asset_cost</th>\n",
       "      <th>ltv</th>\n",
       "      <th>branch_id</th>\n",
       "      <th>supplier_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>current_pincode_id</th>\n",
       "      <th>date_of_birth</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>state_id</th>\n",
       "      <th>employee_code_id</th>\n",
       "      <th>mobileno_avl_flag</th>\n",
       "      <th>aadhar_flag</th>\n",
       "      <th>pan_flag</th>\n",
       "      <th>voterid_flag</th>\n",
       "      <th>driving_flag</th>\n",
       "      <th>passport_flag</th>\n",
       "      <th>perform_cns_score</th>\n",
       "      <th>perform_cns_score_description</th>\n",
       "      <th>pri_no_of_accts</th>\n",
       "      <th>pri_active_accts</th>\n",
       "      <th>pri_overdue_accts</th>\n",
       "      <th>pri_current_balance</th>\n",
       "      <th>pri_sanctioned_amount</th>\n",
       "      <th>pri_disbursed_amount</th>\n",
       "      <th>sec_no_of_accts</th>\n",
       "      <th>sec_active_accts</th>\n",
       "      <th>sec_overdue_accts</th>\n",
       "      <th>sec_current_balance</th>\n",
       "      <th>sec_sanctioned_amount</th>\n",
       "      <th>sec_disbursed_amount</th>\n",
       "      <th>primary_instal_amt</th>\n",
       "      <th>sec_instal_amt</th>\n",
       "      <th>new_accts_in_last_six_months</th>\n",
       "      <th>delinquent_accts_in_last_six_months</th>\n",
       "      <th>average_acct_age</th>\n",
       "      <th>credit_history_length</th>\n",
       "      <th>no_of_inquiries</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>8976</th>\n",
       "      <td>462711</td>\n",
       "      <td>33484</td>\n",
       "      <td>62644</td>\n",
       "      <td>55.23</td>\n",
       "      <td>67</td>\n",
       "      <td>22727</td>\n",
       "      <td>45</td>\n",
       "      <td>1511</td>\n",
       "      <td>16-06-1991</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1201</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>743</td>\n",
       "      <td>C-Very Low Risk</td>\n",
       "      <td>9</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>160423</td>\n",
       "      <td>230489</td>\n",
       "      <td>194538</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9149</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 7mon</td>\n",
       "      <td>1yrs 4mon</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>27b9d5e1-69de-47f2-a559-cfba34dffb5f</td>\n",
       "      <td>2022-09-20T06:58:09Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76007</th>\n",
       "      <td>558674</td>\n",
       "      <td>66882</td>\n",
       "      <td>81187</td>\n",
       "      <td>84.37</td>\n",
       "      <td>2</td>\n",
       "      <td>23508</td>\n",
       "      <td>86</td>\n",
       "      <td>1708</td>\n",
       "      <td>15-09-1994</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>4</td>\n",
       "      <td>1060</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1c58aced-df31-4170-8f85-e0dd95d1ff21</td>\n",
       "      <td>2022-08-25T18:27:59Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>77677</th>\n",
       "      <td>528251</td>\n",
       "      <td>59113</td>\n",
       "      <td>71757</td>\n",
       "      <td>84.87</td>\n",
       "      <td>48</td>\n",
       "      <td>21478</td>\n",
       "      <td>86</td>\n",
       "      <td>6322</td>\n",
       "      <td>01-01-1995</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>5</td>\n",
       "      <td>1189</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>738</td>\n",
       "      <td>C-Very Low Risk</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>45828</td>\n",
       "      <td>58582</td>\n",
       "      <td>58582</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4240</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 2mon</td>\n",
       "      <td>0yrs 4mon</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>fa383d19-de52-4a71-8222-77e328fcf387</td>\n",
       "      <td>2022-10-13T07:51:51Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>209438</th>\n",
       "      <td>633950</td>\n",
       "      <td>56059</td>\n",
       "      <td>71307</td>\n",
       "      <td>81.34</td>\n",
       "      <td>146</td>\n",
       "      <td>18317</td>\n",
       "      <td>86</td>\n",
       "      <td>2989</td>\n",
       "      <td>01-01-1971</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>14</td>\n",
       "      <td>2964</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>6aa0b3ef-8fff-4094-bc16-2a7ec4c00e37</td>\n",
       "      <td>2022-08-09T09:25:01Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143261</th>\n",
       "      <td>476747</td>\n",
       "      <td>56759</td>\n",
       "      <td>67100</td>\n",
       "      <td>85.69</td>\n",
       "      <td>136</td>\n",
       "      <td>17783</td>\n",
       "      <td>86</td>\n",
       "      <td>3793</td>\n",
       "      <td>03-12-1975</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>8</td>\n",
       "      <td>1295</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>e00bb721-ce37-4d32-99e8-84f8a46cf82f</td>\n",
       "      <td>2022-06-27T20:32:23Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       EVENT_ID disbursed_amount asset_cost    ltv branch_id supplier_id  \\\n",
       "8976     462711            33484      62644  55.23        67       22727   \n",
       "76007    558674            66882      81187  84.37         2       23508   \n",
       "77677    528251            59113      71757  84.87        48       21478   \n",
       "209438   633950            56059      71307  81.34       146       18317   \n",
       "143261   476747            56759      67100  85.69       136       17783   \n",
       "\n",
       "       manufacturer_id current_pincode_id date_of_birth employment_type  \\\n",
       "8976                45               1511    16-06-1991        Salaried   \n",
       "76007               86               1708    15-09-1994        Salaried   \n",
       "77677               86               6322    01-01-1995   Self employed   \n",
       "209438              86               2989    01-01-1971        Salaried   \n",
       "143261              86               3793    03-12-1975   Self employed   \n",
       "\n",
       "       state_id employee_code_id mobileno_avl_flag aadhar_flag pan_flag  \\\n",
       "8976          6             1201                 1           1        0   \n",
       "76007         4             1060                 1           1        0   \n",
       "77677         5             1189                 1           1        0   \n",
       "209438       14             2964                 1           1        0   \n",
       "143261        8             1295                 1           1        0   \n",
       "\n",
       "       voterid_flag driving_flag passport_flag perform_cns_score  \\\n",
       "8976              0            0             0               743   \n",
       "76007             0            0             0                 0   \n",
       "77677             0            0             0               738   \n",
       "209438            0            0             0                 0   \n",
       "143261            0            0             0                 0   \n",
       "\n",
       "       perform_cns_score_description pri_no_of_accts pri_active_accts  \\\n",
       "8976                 C-Very Low Risk               9                5   \n",
       "76007    No Bureau History Available               0                0   \n",
       "77677                C-Very Low Risk               3                3   \n",
       "209438   No Bureau History Available               0                0   \n",
       "143261   No Bureau History Available               0                0   \n",
       "\n",
       "       pri_overdue_accts pri_current_balance pri_sanctioned_amount  \\\n",
       "8976                   0              160423                230489   \n",
       "76007                  0                   0                     0   \n",
       "77677                  0               45828                 58582   \n",
       "209438                 0                   0                     0   \n",
       "143261                 0                   0                     0   \n",
       "\n",
       "       pri_disbursed_amount sec_no_of_accts sec_active_accts  \\\n",
       "8976                 194538               0                0   \n",
       "76007                     0               0                0   \n",
       "77677                 58582               0                0   \n",
       "209438                    0               0                0   \n",
       "143261                    0               0                0   \n",
       "\n",
       "       sec_overdue_accts sec_current_balance sec_sanctioned_amount  \\\n",
       "8976                   0                   0                     0   \n",
       "76007                  0                   0                     0   \n",
       "77677                  0                   0                     0   \n",
       "209438                 0                   0                     0   \n",
       "143261                 0                   0                     0   \n",
       "\n",
       "       sec_disbursed_amount primary_instal_amt sec_instal_amt  \\\n",
       "8976                      0               9149              0   \n",
       "76007                     0                  0              0   \n",
       "77677                     0               4240              0   \n",
       "209438                    0                  0              0   \n",
       "143261                    0                  0              0   \n",
       "\n",
       "       new_accts_in_last_six_months delinquent_accts_in_last_six_months  \\\n",
       "8976                              4                                   0   \n",
       "76007                             0                                   0   \n",
       "77677                             3                                   0   \n",
       "209438                            0                                   0   \n",
       "143261                            0                                   0   \n",
       "\n",
       "       average_acct_age credit_history_length no_of_inquiries  EVENT_LABEL  \\\n",
       "8976          0yrs 7mon             1yrs 4mon               1            0   \n",
       "76007         0yrs 0mon             0yrs 0mon               0            0   \n",
       "77677         0yrs 2mon             0yrs 4mon               0            1   \n",
       "209438        0yrs 0mon             0yrs 0mon               0            1   \n",
       "143261        0yrs 0mon             0yrs 0mon               0            0   \n",
       "\n",
       "                                   ENTITY_ID       EVENT_TIMESTAMP  \\\n",
       "8976    27b9d5e1-69de-47f2-a559-cfba34dffb5f  2022-09-20T06:58:09Z   \n",
       "76007   1c58aced-df31-4170-8f85-e0dd95d1ff21  2022-08-25T18:27:59Z   \n",
       "77677   fa383d19-de52-4a71-8222-77e328fcf387  2022-10-13T07:51:51Z   \n",
       "209438  6aa0b3ef-8fff-4094-bc16-2a7ec4c00e37  2022-08-09T09:25:01Z   \n",
       "143261  e00bb721-ce37-4d32-99e8-84f8a46cf82f  2022-06-27T20:32:23Z   \n",
       "\n",
       "             LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "8976    2023-05-05T08:46:09Z        user  \n",
       "76007   2023-05-05T08:46:09Z        user  \n",
       "77677   2023-05-05T08:46:09Z        user  \n",
       "209438  2023-05-05T08:46:09Z        user  \n",
       "143261  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "44\n",
      "(186523, 44)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>disbursed_amount</th>\n",
       "      <th>asset_cost</th>\n",
       "      <th>ltv</th>\n",
       "      <th>branch_id</th>\n",
       "      <th>supplier_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>current_pincode_id</th>\n",
       "      <th>date_of_birth</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>state_id</th>\n",
       "      <th>employee_code_id</th>\n",
       "      <th>mobileno_avl_flag</th>\n",
       "      <th>aadhar_flag</th>\n",
       "      <th>pan_flag</th>\n",
       "      <th>voterid_flag</th>\n",
       "      <th>driving_flag</th>\n",
       "      <th>passport_flag</th>\n",
       "      <th>perform_cns_score</th>\n",
       "      <th>perform_cns_score_description</th>\n",
       "      <th>pri_no_of_accts</th>\n",
       "      <th>pri_active_accts</th>\n",
       "      <th>pri_overdue_accts</th>\n",
       "      <th>pri_current_balance</th>\n",
       "      <th>pri_sanctioned_amount</th>\n",
       "      <th>pri_disbursed_amount</th>\n",
       "      <th>sec_no_of_accts</th>\n",
       "      <th>sec_active_accts</th>\n",
       "      <th>sec_overdue_accts</th>\n",
       "      <th>sec_current_balance</th>\n",
       "      <th>sec_sanctioned_amount</th>\n",
       "      <th>sec_disbursed_amount</th>\n",
       "      <th>primary_instal_amt</th>\n",
       "      <th>sec_instal_amt</th>\n",
       "      <th>new_accts_in_last_six_months</th>\n",
       "      <th>delinquent_accts_in_last_six_months</th>\n",
       "      <th>average_acct_age</th>\n",
       "      <th>credit_history_length</th>\n",
       "      <th>no_of_inquiries</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>420825</td>\n",
       "      <td>50578</td>\n",
       "      <td>58400</td>\n",
       "      <td>89.55</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1441</td>\n",
       "      <td>01-01-1984</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>03cf53e2-5c0b-4809-8333-04560101987b</td>\n",
       "      <td>2022-12-29T10:25:40Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>518279</td>\n",
       "      <td>54513</td>\n",
       "      <td>61900</td>\n",
       "      <td>89.66</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1501</td>\n",
       "      <td>08-09-1990</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>825</td>\n",
       "      <td>A-Very Low Risk</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1347</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1yrs 9mon</td>\n",
       "      <td>2yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>03166b12-ee18-4144-aa73-10a3d2ac999a</td>\n",
       "      <td>2022-08-07T20:17:18Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>510278</td>\n",
       "      <td>43894</td>\n",
       "      <td>61900</td>\n",
       "      <td>71.89</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1501</td>\n",
       "      <td>04-10-1989</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "      <td>Not Scored: Not Enough Info available on the customer</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>72879</td>\n",
       "      <td>74500</td>\n",
       "      <td>74500</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 2mon</td>\n",
       "      <td>0yrs 2mon</td>\n",
       "      <td>0</td>\n",
       "      <td>ff0fc8f9-c524-45cc-99b4-139dd726d7cd</td>\n",
       "      <td>2022-11-03T09:35:54Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>510980</td>\n",
       "      <td>52603</td>\n",
       "      <td>61300</td>\n",
       "      <td>86.95</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1492</td>\n",
       "      <td>01-06-1968</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>818</td>\n",
       "      <td>A-Very Low Risk</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2608</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1yrs 7mon</td>\n",
       "      <td>1yrs 7mon</td>\n",
       "      <td>0</td>\n",
       "      <td>8955bac7-5812-4e5f-b3ae-22738ee5e701</td>\n",
       "      <td>2023-02-19T06:55:03Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>513916</td>\n",
       "      <td>57713</td>\n",
       "      <td>65750</td>\n",
       "      <td>89.28</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1440</td>\n",
       "      <td>01-06-1976</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>300</td>\n",
       "      <td>M-Very High Risk</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>29069</td>\n",
       "      <td>1067200</td>\n",
       "      <td>1067200</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>47100</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2yrs 6mon</td>\n",
       "      <td>5yrs 6mon</td>\n",
       "      <td>0</td>\n",
       "      <td>a8154baa-1407-493a-bbc2-4bc1fd30d1f9</td>\n",
       "      <td>2022-08-14T11:20:39Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  EVENT_ID disbursed_amount asset_cost    ltv branch_id supplier_id  \\\n",
       "0   420825            50578      58400  89.55        67       22807   \n",
       "1   518279            54513      61900  89.66        67       22807   \n",
       "2   510278            43894      61900  71.89        67       22807   \n",
       "3   510980            52603      61300  86.95        67       22807   \n",
       "4   513916            57713      65750  89.28        67       22807   \n",
       "\n",
       "  manufacturer_id current_pincode_id date_of_birth employment_type state_id  \\\n",
       "0              45               1441    01-01-1984        Salaried        6   \n",
       "1              45               1501    08-09-1990   Self employed        6   \n",
       "2              45               1501    04-10-1989        Salaried        6   \n",
       "3              45               1492    01-06-1968        Salaried        6   \n",
       "4              45               1440    01-06-1976   Self employed        6   \n",
       "\n",
       "  employee_code_id mobileno_avl_flag aadhar_flag pan_flag voterid_flag  \\\n",
       "0             1998                 1           1        0            0   \n",
       "1             1998                 1           1        0            0   \n",
       "2             1998                 1           1        0            0   \n",
       "3             1998                 1           0        0            1   \n",
       "4             1998                 1           1        0            0   \n",
       "\n",
       "  driving_flag passport_flag perform_cns_score  \\\n",
       "0            0             0                 0   \n",
       "1            0             0               825   \n",
       "2            0             0                17   \n",
       "3            0             0               818   \n",
       "4            0             0               300   \n",
       "\n",
       "                           perform_cns_score_description pri_no_of_accts  \\\n",
       "0                            No Bureau History Available               0   \n",
       "1                                        A-Very Low Risk               2   \n",
       "2  Not Scored: Not Enough Info available on the customer               1   \n",
       "3                                        A-Very Low Risk               1   \n",
       "4                                       M-Very High Risk               6   \n",
       "\n",
       "  pri_active_accts pri_overdue_accts pri_current_balance  \\\n",
       "0                0                 0                   0   \n",
       "1                0                 0                   0   \n",
       "2                1                 0               72879   \n",
       "3                0                 0                   0   \n",
       "4                4                 2               29069   \n",
       "\n",
       "  pri_sanctioned_amount pri_disbursed_amount sec_no_of_accts sec_active_accts  \\\n",
       "0                     0                    0               0                0   \n",
       "1                     0                    0               0                0   \n",
       "2                 74500                74500               0                0   \n",
       "3                     0                    0               0                0   \n",
       "4               1067200              1067200               0                0   \n",
       "\n",
       "  sec_overdue_accts sec_current_balance sec_sanctioned_amount  \\\n",
       "0                 0                   0                     0   \n",
       "1                 0                   0                     0   \n",
       "2                 0                   0                     0   \n",
       "3                 0                   0                     0   \n",
       "4                 0                   0                     0   \n",
       "\n",
       "  sec_disbursed_amount primary_instal_amt sec_instal_amt  \\\n",
       "0                    0                  0              0   \n",
       "1                    0               1347              0   \n",
       "2                    0                  0              0   \n",
       "3                    0               2608              0   \n",
       "4                    0              47100              0   \n",
       "\n",
       "  new_accts_in_last_six_months delinquent_accts_in_last_six_months  \\\n",
       "0                            0                                   0   \n",
       "1                            0                                   0   \n",
       "2                            0                                   0   \n",
       "3                            0                                   0   \n",
       "4                            1                                   1   \n",
       "\n",
       "  average_acct_age credit_history_length no_of_inquiries  \\\n",
       "0        0yrs 0mon             0yrs 0mon               0   \n",
       "1        1yrs 9mon             2yrs 0mon               0   \n",
       "2        0yrs 2mon             0yrs 2mon               0   \n",
       "3        1yrs 7mon             1yrs 7mon               0   \n",
       "4        2yrs 6mon             5yrs 6mon               0   \n",
       "\n",
       "                              ENTITY_ID       EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "0  03cf53e2-5c0b-4809-8333-04560101987b  2022-12-29T10:25:40Z        user  \n",
       "1  03166b12-ee18-4144-aa73-10a3d2ac999a  2022-08-07T20:17:18Z        user  \n",
       "2  ff0fc8f9-c524-45cc-99b4-139dd726d7cd  2022-11-03T09:35:54Z        user  \n",
       "3  8955bac7-5812-4e5f-b3ae-22738ee5e701  2023-02-19T06:55:03Z        user  \n",
       "4  a8154baa-1407-493a-bbc2-4bc1fd30d1f9  2022-08-14T11:20:39Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(46631, 42)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL\n",
       "0            0\n",
       "1            0\n",
       "2            0\n",
       "3            0\n",
       "4            0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    36323\n",
      "1    10308\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.783925\n",
      "1    0.216075\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "malurl\n",
      "Train set: \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>dummy_cat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>167113</th>\n",
       "      <td>apolloduck.co.za/</td>\n",
       "      <td>0</td>\n",
       "      <td>d16773dd-0077-4129-a39d-f935464bd07f</td>\n",
       "      <td>5e694594-fcfa-418e-8417-21c5e99b8d8a</td>\n",
       "      <td>2022-05-15T15:36:37Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>87edb1a6-7936-4afa-b7be-4c35b7f1a5c6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>387680</th>\n",
       "      <td>acronyms.thefreedictionary.com/WDOM</td>\n",
       "      <td>0</td>\n",
       "      <td>b40b1f9e-9218-4a65-8b8e-870d45feb368</td>\n",
       "      <td>8d1aea20-97bb-46c4-bf56-3dc935f5c116</td>\n",
       "      <td>2022-06-28T06:32:21Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>864a0704-ab05-49c3-8a0c-5b0b23b3eeef</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>528900</th>\n",
       "      <td>https://nepan.org.np/Alibaba/Alibaba.com/Login.htm</td>\n",
       "      <td>1</td>\n",
       "      <td>86c52fda-2f6f-41ee-aa15-a7b682138cc9</td>\n",
       "      <td>fce90a90-3ce2-475c-ac7d-a0d6c8fa784a</td>\n",
       "      <td>2022-06-11T21:40:20Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>7ef071fc-a143-4d52-bd88-2a21f2b16c56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>251286</th>\n",
       "      <td>soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes</td>\n",
       "      <td>0</td>\n",
       "      <td>447529b9-923c-43e0-afed-c570e037f1aa</td>\n",
       "      <td>c4a96aba-24b1-4cc4-a7b8-f9c0a9a34546</td>\n",
       "      <td>2022-08-15T12:11:14Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>2709ea1a-f5a7-4ecc-8dbe-767910778226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>433650</th>\n",
       "      <td>ottawakiosk.com/hill_cam.html</td>\n",
       "      <td>0</td>\n",
       "      <td>976080b6-500f-4de3-95c4-a4c2679e672b</td>\n",
       "      <td>21497a05-52ce-4a25-a4d4-361b8298dbc1</td>\n",
       "      <td>2022-08-19T15:47:51Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>752bff63-ad3b-4845-b975-7f6f7302402c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                        url  \\\n",
       "167113                                                    apolloduck.co.za/   \n",
       "387680                                  acronyms.thefreedictionary.com/WDOM   \n",
       "528900                   https://nepan.org.np/Alibaba/Alibaba.com/Login.htm   \n",
       "251286  soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes   \n",
       "433650                                        ottawakiosk.com/hill_cam.html   \n",
       "\n",
       "        EVENT_LABEL                              EVENT_ID  \\\n",
       "167113            0  d16773dd-0077-4129-a39d-f935464bd07f   \n",
       "387680            0  b40b1f9e-9218-4a65-8b8e-870d45feb368   \n",
       "528900            1  86c52fda-2f6f-41ee-aa15-a7b682138cc9   \n",
       "251286            0  447529b9-923c-43e0-afed-c570e037f1aa   \n",
       "433650            0  976080b6-500f-4de3-95c4-a4c2679e672b   \n",
       "\n",
       "                                   ENTITY_ID       EVENT_TIMESTAMP  \\\n",
       "167113  5e694594-fcfa-418e-8417-21c5e99b8d8a  2022-05-15T15:36:37Z   \n",
       "387680  8d1aea20-97bb-46c4-bf56-3dc935f5c116  2022-06-28T06:32:21Z   \n",
       "528900  fce90a90-3ce2-475c-ac7d-a0d6c8fa784a  2022-06-11T21:40:20Z   \n",
       "251286  c4a96aba-24b1-4cc4-a7b8-f9c0a9a34546  2022-08-15T12:11:14Z   \n",
       "433650  21497a05-52ce-4a25-a4d4-361b8298dbc1  2022-08-19T15:47:51Z   \n",
       "\n",
       "             LABEL_TIMESTAMP ENTITY_TYPE                             dummy_cat  \n",
       "167113  2023-05-05T08:46:09Z        user  87edb1a6-7936-4afa-b7be-4c35b7f1a5c6  \n",
       "387680  2023-05-05T08:46:09Z        user  864a0704-ab05-49c3-8a0c-5b0b23b3eeef  \n",
       "528900  2023-05-05T08:46:09Z        user  7ef071fc-a143-4d52-bd88-2a21f2b16c56  \n",
       "251286  2023-05-05T08:46:09Z        user  2709ea1a-f5a7-4ecc-8dbe-767910778226  \n",
       "433650  2023-05-05T08:46:09Z        user  752bff63-ad3b-4845-b975-7f6f7302402c  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8\n",
      "(586072, 8)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>dummy_cat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.html</td>\n",
       "      <td>b4233390-3167-401d-a85f-27331078ff27</td>\n",
       "      <td>3fd82c9f-b26a-44dc-ac26-4a635690938c</td>\n",
       "      <td>2022-11-20T12:29:18Z</td>\n",
       "      <td>user</td>\n",
       "      <td>f45a2001-81b6-4b29-bba9-e376cc9a4ca9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>cyndislist.com/us/pa/counties</td>\n",
       "      <td>77d73435-251f-43fa-a82c-cc6ab4dbce6b</td>\n",
       "      <td>7ac20b7a-ee66-46ce-83da-703e095e9c87</td>\n",
       "      <td>2022-12-26T07:01:46Z</td>\n",
       "      <td>user</td>\n",
       "      <td>a54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ</td>\n",
       "      <td>87a47093-0039-445f-8002-87b6af3e709d</td>\n",
       "      <td>eaea621e-895d-43cf-8bbb-93acac029c47</td>\n",
       "      <td>2022-06-25T00:29:41Z</td>\n",
       "      <td>user</td>\n",
       "      <td>20e00a79-d5fc-49d1-b563-173e69f09434</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery</td>\n",
       "      <td>3143022e-ce02-441b-8ad0-5ebbf3c1c829</td>\n",
       "      <td>ba97f126-6159-4655-9c11-807c99807059</td>\n",
       "      <td>2023-03-07T14:27:10Z</td>\n",
       "      <td>user</td>\n",
       "      <td>5398bd49-ce09-4438-bfc3-24fce419c612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>kitsapsun.com/photos/2011/feb/25/177999/</td>\n",
       "      <td>8885745c-4494-4f04-92a0-bb57006fe7aa</td>\n",
       "      <td>b51cdf46-1467-45f0-9c9c-62233be01d0e</td>\n",
       "      <td>2022-12-07T01:31:11Z</td>\n",
       "      <td>user</td>\n",
       "      <td>0ac04255-86df-47bc-8990-557f4c65fe0d</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                      url  \\\n",
       "0  http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.html   \n",
       "1                                                                                           cyndislist.com/us/pa/counties   \n",
       "2                                 https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ   \n",
       "3                     articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery   \n",
       "4                                                                                kitsapsun.com/photos/2011/feb/25/177999/   \n",
       "\n",
       "                               EVENT_ID                             ENTITY_ID  \\\n",
       "0  b4233390-3167-401d-a85f-27331078ff27  3fd82c9f-b26a-44dc-ac26-4a635690938c   \n",
       "1  77d73435-251f-43fa-a82c-cc6ab4dbce6b  7ac20b7a-ee66-46ce-83da-703e095e9c87   \n",
       "2  87a47093-0039-445f-8002-87b6af3e709d  eaea621e-895d-43cf-8bbb-93acac029c47   \n",
       "3  3143022e-ce02-441b-8ad0-5ebbf3c1c829  ba97f126-6159-4655-9c11-807c99807059   \n",
       "4  8885745c-4494-4f04-92a0-bb57006fe7aa  b51cdf46-1467-45f0-9c9c-62233be01d0e   \n",
       "\n",
       "        EVENT_TIMESTAMP ENTITY_TYPE                             dummy_cat  \n",
       "0  2022-11-20T12:29:18Z        user  f45a2001-81b6-4b29-bba9-e376cc9a4ca9  \n",
       "1  2022-12-26T07:01:46Z        user  a54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2  \n",
       "2  2022-06-25T00:29:41Z        user  20e00a79-d5fc-49d1-b563-173e69f09434  \n",
       "3  2023-03-07T14:27:10Z        user  5398bd49-ce09-4438-bfc3-24fce419c612  \n",
       "4  2022-12-07T01:31:11Z        user  0ac04255-86df-47bc-8990-557f4c65fe0d  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(65119, 6)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>b4233390-3167-401d-a85f-27331078ff27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>77d73435-251f-43fa-a82c-cc6ab4dbce6b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>87a47093-0039-445f-8002-87b6af3e709d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>3143022e-ce02-441b-8ad0-5ebbf3c1c829</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>8885745c-4494-4f04-92a0-bb57006fe7aa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL                              EVENT_ID\n",
       "0            0  b4233390-3167-401d-a85f-27331078ff27\n",
       "1            0  77d73435-251f-43fa-a82c-cc6ab4dbce6b\n",
       "2            1  87a47093-0039-445f-8002-87b6af3e709d\n",
       "3            0  3143022e-ce02-441b-8ad0-5ebbf3c1c829\n",
       "4            0  8885745c-4494-4f04-92a0-bb57006fe7aa"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    42695\n",
      "1    22424\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.657612\n",
      "1    0.342388\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ieeecis\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>dist1</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>v62</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v82</th>\n",
       "      <th>v91</th>\n",
       "      <th>v127</th>\n",
       "      <th>v130</th>\n",
       "      <th>v139</th>\n",
       "      <th>v160</th>\n",
       "      <th>v165</th>\n",
       "      <th>v187</th>\n",
       "      <th>v203</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v221</th>\n",
       "      <th>v234</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v283</th>\n",
       "      <th>v285</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TransactionID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2987000.0</th>\n",
       "      <td>0</td>\n",
       "      <td>68.5</td>\n",
       "      <td>W</td>\n",
       "      <td>13926.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>150.0</td>\n",
       "      <td>142.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>315.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>117.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>c5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f7</td>\n",
       "      <td>13926.0_315.0_-13.0</td>\n",
       "      <td>2021-01-02T00:00:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987001.0</th>\n",
       "      <td>0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>W</td>\n",
       "      <td>2755.0</td>\n",
       "      <td>404.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>325.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9aa1d670-7446-4979-8c09-87f02311d2ca</td>\n",
       "      <td>2755.0_325.0_1.0</td>\n",
       "      <td>2021-01-02T00:00:01Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987002.0</th>\n",
       "      <td>0</td>\n",
       "      <td>59.0</td>\n",
       "      <td>W</td>\n",
       "      <td>4663.0</td>\n",
       "      <td>490.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>debit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>287.0</td>\n",
       "      <td>outlook.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4cdb1e2e-3c63-4e96-80a6-382d0ec97fe3</td>\n",
       "      <td>4663.0_330.0_1.0</td>\n",
       "      <td>2021-01-02T00:01:09Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987003.0</th>\n",
       "      <td>0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>W</td>\n",
       "      <td>18132.0</td>\n",
       "      <td>567.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>117.0</td>\n",
       "      <td>debit</td>\n",
       "      <td>476.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1758.0</td>\n",
       "      <td>354.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>d3e3803c-b1a3-4dfd-841d-30b8d2611364</td>\n",
       "      <td>18132.0_476.0_-111.0</td>\n",
       "      <td>2021-01-02T00:01:39Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987004.0</th>\n",
       "      <td>0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>H</td>\n",
       "      <td>4497.0</td>\n",
       "      <td>514.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>420.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>169690.796875</td>\n",
       "      <td>5155.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>70787.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>166.0</td>\n",
       "      <td>542.0</td>\n",
       "      <td>144.0</td>\n",
       "      <td>mobile</td>\n",
       "      <td>SAMSUNG SM-G892A Build/NRD90M</td>\n",
       "      <td>2c013afb-7779-45db-a330-a5808d531372</td>\n",
       "      <td>4497.0_420.0_1.0</td>\n",
       "      <td>2021-01-02T00:01:46Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               EVENT_LABEL  transactionamt productcd    card1  card2  card3  \\\n",
       "TransactionID                                                                 \n",
       "2987000.0                0            68.5         W  13926.0    NaN  150.0   \n",
       "2987001.0                0            29.0         W   2755.0  404.0  150.0   \n",
       "2987002.0                0            59.0         W   4663.0  490.0  150.0   \n",
       "2987003.0                0            50.0         W  18132.0  567.0  150.0   \n",
       "2987004.0                0            50.0         H   4497.0  514.0  150.0   \n",
       "\n",
       "               card5   card6  addr1  dist1 p_emaildomain r_emaildomain   c1  \\\n",
       "TransactionID                                                                 \n",
       "2987000.0      142.0  credit  315.0   19.0           NaN           NaN  1.0   \n",
       "2987001.0      102.0  credit  325.0    NaN     gmail.com           NaN  1.0   \n",
       "2987002.0      166.0   debit  330.0  287.0   outlook.com           NaN  1.0   \n",
       "2987003.0      117.0   debit  476.0    NaN     yahoo.com           NaN  2.0   \n",
       "2987004.0      102.0  credit  420.0    NaN     gmail.com           NaN  1.0   \n",
       "\n",
       "                c2   c4   c5   c6   c7   c8   c9  c10  c11  c12   c13  c14  \\\n",
       "TransactionID                                                                \n",
       "2987000.0      1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  2.0  0.0   1.0  1.0   \n",
       "2987001.0      1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0   1.0  1.0   \n",
       "2987002.0      1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  1.0  0.0   1.0  1.0   \n",
       "2987003.0      5.0  0.0  0.0  4.0  0.0  0.0  1.0  0.0  1.0  0.0  25.0  1.0   \n",
       "2987004.0      1.0  0.0  0.0  1.0  0.0  1.0  0.0  1.0  1.0  0.0   1.0  1.0   \n",
       "\n",
       "               v62  v70  v76  v78  v82  v91    v127   v130  v139  \\\n",
       "TransactionID                                                      \n",
       "2987000.0      1.0  0.0  1.0  1.0  0.0  0.0   117.0    0.0   NaN   \n",
       "2987001.0      1.0  0.0  0.0  1.0  1.0  0.0     0.0    0.0   NaN   \n",
       "2987002.0      1.0  0.0  1.0  1.0  1.0  0.0     0.0    0.0   NaN   \n",
       "2987003.0      1.0  0.0  1.0  1.0  1.0  0.0  1758.0  354.0   NaN   \n",
       "2987004.0      NaN  NaN  NaN  NaN  NaN  NaN     0.0    0.0   0.0   \n",
       "\n",
       "                        v160    v165  v187  v203  v207  v209  v210  v221  \\\n",
       "TransactionID                                                              \n",
       "2987000.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987001.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987002.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987003.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987004.0      169690.796875  5155.0   1.0   0.0   0.0   0.0   0.0   1.0   \n",
       "\n",
       "               v234  v257  v258  v261  v264  v266  v267  v271  v274  v277  \\\n",
       "TransactionID                                                               \n",
       "2987000.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987001.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987002.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987003.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987004.0       0.0   1.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   \n",
       "\n",
       "               v283  v285  v289  v291  v294  id_01    id_02  id_05  id_06  \\\n",
       "TransactionID                                                               \n",
       "2987000.0       1.0   0.0   0.0   1.0   1.0    NaN      NaN    NaN    NaN   \n",
       "2987001.0       1.0   0.0   0.0   1.0   0.0    NaN      NaN    NaN    NaN   \n",
       "2987002.0       1.0   0.0   0.0   1.0   0.0    NaN      NaN    NaN    NaN   \n",
       "2987003.0       0.0  10.0   0.0   1.0  38.0    NaN      NaN    NaN    NaN   \n",
       "2987004.0       1.0   0.0   0.0   1.0   0.0    0.0  70787.0    NaN    NaN   \n",
       "\n",
       "               id_09  id_13  id_17  id_19  id_20 devicetype  \\\n",
       "TransactionID                                                 \n",
       "2987000.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987001.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987002.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987003.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987004.0        NaN    NaN  166.0  542.0  144.0     mobile   \n",
       "\n",
       "                                  deviceinfo  \\\n",
       "TransactionID                                  \n",
       "2987000.0                                NaN   \n",
       "2987001.0                                NaN   \n",
       "2987002.0                                NaN   \n",
       "2987003.0                                NaN   \n",
       "2987004.0      SAMSUNG SM-G892A Build/NRD90M   \n",
       "\n",
       "                                           EVENT_ID             ENTITY_ID  \\\n",
       "TransactionID                                                               \n",
       "2987000.0      c5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f7   13926.0_315.0_-13.0   \n",
       "2987001.0      9aa1d670-7446-4979-8c09-87f02311d2ca      2755.0_325.0_1.0   \n",
       "2987002.0      4cdb1e2e-3c63-4e96-80a6-382d0ec97fe3      4663.0_330.0_1.0   \n",
       "2987003.0      d3e3803c-b1a3-4dfd-841d-30b8d2611364  18132.0_476.0_-111.0   \n",
       "2987004.0      2c013afb-7779-45db-a330-a5808d531372      4497.0_420.0_1.0   \n",
       "\n",
       "                    EVENT_TIMESTAMP       LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "TransactionID                                                          \n",
       "2987000.0      2021-01-02T00:00:00Z  2023-05-05T08:46:09Z        user  \n",
       "2987001.0      2021-01-02T00:00:01Z  2023-05-05T08:46:09Z        user  \n",
       "2987002.0      2021-01-02T00:01:09Z  2023-05-05T08:46:09Z        user  \n",
       "2987003.0      2021-01-02T00:01:39Z  2023-05-05T08:46:09Z        user  \n",
       "2987004.0      2021-01-02T00:01:46Z  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "73\n",
      "(561013, 73)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>dist1</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>v62</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v82</th>\n",
       "      <th>v91</th>\n",
       "      <th>v127</th>\n",
       "      <th>v130</th>\n",
       "      <th>v139</th>\n",
       "      <th>v160</th>\n",
       "      <th>v165</th>\n",
       "      <th>v187</th>\n",
       "      <th>v203</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v221</th>\n",
       "      <th>v234</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v283</th>\n",
       "      <th>v285</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TransactionID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3548013.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109411.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66104.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103183.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>926.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>1411.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>569c4257-3d62-466d-a806-e3b456b2b372</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:11:15Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548014.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109536.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66229.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103308.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>927.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>693.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>e951afe6-b895-42b8-adff-df0f812e9ee8</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:11:29Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548015.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109661.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66354.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103433.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>928.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>1116.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>cd69e301-8c15-42b3-9839-cc4c8b9d89db</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:11:45Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548016.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109786.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66479.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103558.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>929.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>1589.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>71431bc1-19ec-49b6-a00f-4e8c7d121b02</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:12:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548017.0</th>\n",
       "      <td>31.950001</td>\n",
       "      <td>W</td>\n",
       "      <td>9500.0</td>\n",
       "      <td>321.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>226.0</td>\n",
       "      <td>debit</td>\n",
       "      <td>204.0</td>\n",
       "      <td>74.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>27.950001</td>\n",
       "      <td>27.950001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>de297b4c-d372-4fd3-8c66-ab6ff0c19e16</td>\n",
       "      <td>9500.0_204.0_150.0</td>\n",
       "      <td>2021-06-21T23:12:11Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               transactionamt productcd    card1  card2  card3  card5   card6  \\\n",
       "TransactionID                                                                   \n",
       "3548013.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548014.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548015.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548016.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548017.0           31.950001         W   9500.0  321.0  150.0  226.0   debit   \n",
       "\n",
       "               addr1  dist1 p_emaildomain r_emaildomain   c1   c2   c4   c5  \\\n",
       "TransactionID                                                                 \n",
       "3548013.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548014.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548015.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548016.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548017.0      204.0   74.0           NaN           NaN  3.0  3.0  0.0  1.0   \n",
       "\n",
       "                c6   c7   c8   c9  c10  c11  c12   c13  c14  v62  v70  v76  \\\n",
       "TransactionID                                                                \n",
       "3548013.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548014.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548015.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548016.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548017.0      1.0  0.0  0.0  1.0  0.0  1.0  0.0   6.0  3.0  1.0  1.0  1.0   \n",
       "\n",
       "               v78  v82  v91           v127         v130  v139    v160  \\\n",
       "TransactionID                                                            \n",
       "3548013.0      NaN  NaN  NaN  109411.000000  2301.000000   0.0  2401.0   \n",
       "3548014.0      NaN  NaN  NaN  109536.000000  2301.000000   0.0  2401.0   \n",
       "3548015.0      NaN  NaN  NaN  109661.000000  2301.000000   0.0  2401.0   \n",
       "3548016.0      NaN  NaN  NaN  109786.000000  2301.000000   0.0  2401.0   \n",
       "3548017.0      2.0  1.0  1.0      27.950001    27.950001   NaN     NaN   \n",
       "\n",
       "                  v165  v187      v203   v207    v209   v210  v221  v234  \\\n",
       "TransactionID                                                              \n",
       "3548013.0      66104.0   1.0  103183.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548014.0      66229.0   1.0  103308.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548015.0      66354.0   1.0  103433.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548016.0      66479.0   1.0  103558.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548017.0          NaN   NaN       NaN    NaN     NaN    NaN   NaN   NaN   \n",
       "\n",
       "               v257  v258  v261  v264  v266  v267  v271  v274  v277  v283  \\\n",
       "TransactionID                                                               \n",
       "3548013.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548014.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548015.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548016.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548017.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   1.0   \n",
       "\n",
       "               v285  v289  v291   v294  id_01   id_02  id_05  id_06  id_09  \\\n",
       "TransactionID                                                                \n",
       "3548013.0      26.0   1.0   2.0  926.0  -10.0  1411.0    6.0    0.0    0.0   \n",
       "3548014.0      26.0   1.0   2.0  927.0  -10.0   693.0    6.0    0.0    0.0   \n",
       "3548015.0      26.0   1.0   2.0  928.0  -10.0  1116.0    6.0    0.0    0.0   \n",
       "3548016.0      26.0   1.0   2.0  929.0  -10.0  1589.0    6.0    0.0    0.0   \n",
       "3548017.0       1.0   1.0   1.0    0.0    NaN     NaN    NaN    NaN    NaN   \n",
       "\n",
       "               id_13  id_17  id_19  id_20 devicetype deviceinfo  \\\n",
       "TransactionID                                                     \n",
       "3548013.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548014.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548015.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548016.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548017.0        NaN    NaN    NaN    NaN        NaN        NaN   \n",
       "\n",
       "                                           EVENT_ID            ENTITY_ID  \\\n",
       "TransactionID                                                              \n",
       "3548013.0      569c4257-3d62-466d-a806-e3b456b2b372  15775.0_330.0_129.0   \n",
       "3548014.0      e951afe6-b895-42b8-adff-df0f812e9ee8  15775.0_330.0_129.0   \n",
       "3548015.0      cd69e301-8c15-42b3-9839-cc4c8b9d89db  15775.0_330.0_129.0   \n",
       "3548016.0      71431bc1-19ec-49b6-a00f-4e8c7d121b02  15775.0_330.0_129.0   \n",
       "3548017.0      de297b4c-d372-4fd3-8c66-ab6ff0c19e16   9500.0_204.0_150.0   \n",
       "\n",
       "                    EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "TransactionID                                    \n",
       "3548013.0      2021-06-21T23:11:15Z        user  \n",
       "3548014.0      2021-06-21T23:11:29Z        user  \n",
       "3548015.0      2021-06-21T23:11:45Z        user  \n",
       "3548016.0      2021-06-21T23:12:00Z        user  \n",
       "3548017.0      2021-06-21T23:12:11Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(29527, 71)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TransactionID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3548013.0</th>\n",
       "      <td>0</td>\n",
       "      <td>569c4257-3d62-466d-a806-e3b456b2b372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548014.0</th>\n",
       "      <td>0</td>\n",
       "      <td>e951afe6-b895-42b8-adff-df0f812e9ee8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548015.0</th>\n",
       "      <td>0</td>\n",
       "      <td>cd69e301-8c15-42b3-9839-cc4c8b9d89db</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548016.0</th>\n",
       "      <td>0</td>\n",
       "      <td>71431bc1-19ec-49b6-a00f-4e8c7d121b02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548017.0</th>\n",
       "      <td>0</td>\n",
       "      <td>de297b4c-d372-4fd3-8c66-ab6ff0c19e16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               EVENT_LABEL                              EVENT_ID\n",
       "TransactionID                                                   \n",
       "3548013.0                0  569c4257-3d62-466d-a806-e3b456b2b372\n",
       "3548014.0                0  e951afe6-b895-42b8-adff-df0f812e9ee8\n",
       "3548015.0                0  cd69e301-8c15-42b3-9839-cc4c8b9d89db\n",
       "3548016.0                0  71431bc1-19ec-49b6-a00f-4e8c7d121b02\n",
       "3548017.0                0  de297b4c-d372-4fd3-8c66-ab6ff0c19e16"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    28358\n",
      "1     1169\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.965252\n",
      "1    0.034748\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ccfraud\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>v1</th>\n",
       "      <th>v2</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v5</th>\n",
       "      <th>v6</th>\n",
       "      <th>v7</th>\n",
       "      <th>v8</th>\n",
       "      <th>v9</th>\n",
       "      <th>v10</th>\n",
       "      <th>v11</th>\n",
       "      <th>v12</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v15</th>\n",
       "      <th>v16</th>\n",
       "      <th>v17</th>\n",
       "      <th>v18</th>\n",
       "      <th>v19</th>\n",
       "      <th>v20</th>\n",
       "      <th>v21</th>\n",
       "      <th>v22</th>\n",
       "      <th>v23</th>\n",
       "      <th>v24</th>\n",
       "      <th>v25</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v28</th>\n",
       "      <th>amount</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.3598071336738</td>\n",
       "      <td>-0.0727811733098497</td>\n",
       "      <td>2.53634673796914</td>\n",
       "      <td>1.37815522427443</td>\n",
       "      <td>-0.338320769942518</td>\n",
       "      <td>0.462387777762292</td>\n",
       "      <td>0.239598554061257</td>\n",
       "      <td>0.0986979012610507</td>\n",
       "      <td>0.363786969611213</td>\n",
       "      <td>0.0907941719789316</td>\n",
       "      <td>-0.551599533260813</td>\n",
       "      <td>-0.617800855762348</td>\n",
       "      <td>-0.991389847235408</td>\n",
       "      <td>-0.311169353699879</td>\n",
       "      <td>1.46817697209427</td>\n",
       "      <td>-0.470400525259478</td>\n",
       "      <td>0.207971241929242</td>\n",
       "      <td>0.0257905801985591</td>\n",
       "      <td>0.403992960255733</td>\n",
       "      <td>0.251412098239705</td>\n",
       "      <td>-0.018306777944153</td>\n",
       "      <td>0.277837575558899</td>\n",
       "      <td>-0.110473910188767</td>\n",
       "      <td>0.0669280749146731</td>\n",
       "      <td>0.128539358273528</td>\n",
       "      <td>-0.189114843888824</td>\n",
       "      <td>0.133558376740387</td>\n",
       "      <td>-0.0210530534538215</td>\n",
       "      <td>149.62</td>\n",
       "      <td>0</td>\n",
       "      <td>f8e77dc0-44ef-490c-b0de-8b4054b5a031</td>\n",
       "      <td>266103ff-71f2-4057-981d-a54821367237</td>\n",
       "      <td>2021-09-01T00:00:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.19185711131486</td>\n",
       "      <td>0.26615071205963</td>\n",
       "      <td>0.16648011335321</td>\n",
       "      <td>0.448154078460911</td>\n",
       "      <td>0.0600176492822243</td>\n",
       "      <td>-0.0823608088155687</td>\n",
       "      <td>-0.0788029833323113</td>\n",
       "      <td>0.0851016549148104</td>\n",
       "      <td>-0.255425128109186</td>\n",
       "      <td>-0.166974414004614</td>\n",
       "      <td>1.61272666105479</td>\n",
       "      <td>1.06523531137287</td>\n",
       "      <td>0.48909501589608</td>\n",
       "      <td>-0.143772296441519</td>\n",
       "      <td>0.635558093258208</td>\n",
       "      <td>0.463917041022171</td>\n",
       "      <td>-0.114804663102346</td>\n",
       "      <td>-0.183361270123994</td>\n",
       "      <td>-0.145783041325259</td>\n",
       "      <td>-0.0690831352230203</td>\n",
       "      <td>-0.225775248033138</td>\n",
       "      <td>-0.638671952771851</td>\n",
       "      <td>0.101288021253234</td>\n",
       "      <td>-0.339846475529127</td>\n",
       "      <td>0.167170404418143</td>\n",
       "      <td>0.125894532368176</td>\n",
       "      <td>-0.00898309914322813</td>\n",
       "      <td>0.0147241691924927</td>\n",
       "      <td>2.69</td>\n",
       "      <td>0</td>\n",
       "      <td>b557449e-6b35-4be0-991e-337f764f5e21</td>\n",
       "      <td>f85083b2-d31f-4b9e-9d49-eb85c0476f6e</td>\n",
       "      <td>2021-09-01T00:00:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.35835406159823</td>\n",
       "      <td>-1.34016307473609</td>\n",
       "      <td>1.77320934263119</td>\n",
       "      <td>0.379779593034328</td>\n",
       "      <td>-0.503198133318193</td>\n",
       "      <td>1.80049938079263</td>\n",
       "      <td>0.791460956450422</td>\n",
       "      <td>0.247675786588991</td>\n",
       "      <td>-1.51465432260583</td>\n",
       "      <td>0.207642865216696</td>\n",
       "      <td>0.624501459424895</td>\n",
       "      <td>0.066083685268831</td>\n",
       "      <td>0.717292731410831</td>\n",
       "      <td>-0.165945922763554</td>\n",
       "      <td>2.34586494901581</td>\n",
       "      <td>-2.89008319444231</td>\n",
       "      <td>1.10996937869599</td>\n",
       "      <td>-0.121359313195888</td>\n",
       "      <td>-2.26185709530414</td>\n",
       "      <td>0.524979725224404</td>\n",
       "      <td>0.247998153469754</td>\n",
       "      <td>0.771679401917229</td>\n",
       "      <td>0.909412262347719</td>\n",
       "      <td>-0.689280956490685</td>\n",
       "      <td>-0.327641833735251</td>\n",
       "      <td>-0.139096571514147</td>\n",
       "      <td>-0.0553527940384261</td>\n",
       "      <td>-0.0597518405929204</td>\n",
       "      <td>378.66</td>\n",
       "      <td>0</td>\n",
       "      <td>d78d879c-eb7c-455d-8fde-6b1205080a4a</td>\n",
       "      <td>237ca488-c695-402c-b30f-0544554ea96c</td>\n",
       "      <td>2021-09-01T00:01:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.966271711572087</td>\n",
       "      <td>-0.185226008082898</td>\n",
       "      <td>1.79299333957872</td>\n",
       "      <td>-0.863291275036453</td>\n",
       "      <td>-0.0103088796030823</td>\n",
       "      <td>1.24720316752486</td>\n",
       "      <td>0.23760893977178</td>\n",
       "      <td>0.377435874652262</td>\n",
       "      <td>-1.38702406270197</td>\n",
       "      <td>-0.0549519224713749</td>\n",
       "      <td>-0.226487263835401</td>\n",
       "      <td>0.178228225877303</td>\n",
       "      <td>0.507756869957169</td>\n",
       "      <td>-0.28792374549456</td>\n",
       "      <td>-0.631418117709045</td>\n",
       "      <td>-1.0596472454325</td>\n",
       "      <td>-0.684092786345479</td>\n",
       "      <td>1.96577500349538</td>\n",
       "      <td>-1.2326219700892</td>\n",
       "      <td>-0.208037781160366</td>\n",
       "      <td>-0.108300452035545</td>\n",
       "      <td>0.00527359678253453</td>\n",
       "      <td>-0.190320518742841</td>\n",
       "      <td>-1.17557533186321</td>\n",
       "      <td>0.647376034602038</td>\n",
       "      <td>-0.221928844458407</td>\n",
       "      <td>0.0627228487293033</td>\n",
       "      <td>0.0614576285006353</td>\n",
       "      <td>123.5</td>\n",
       "      <td>0</td>\n",
       "      <td>ef448a36-2763-449c-a54a-a9e05af20967</td>\n",
       "      <td>9964b305-b591-4ed0-bff1-8adca81d0194</td>\n",
       "      <td>2021-09-01T00:01:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.15823309349523</td>\n",
       "      <td>0.877736754848451</td>\n",
       "      <td>1.548717846511</td>\n",
       "      <td>0.403033933955121</td>\n",
       "      <td>-0.407193377311653</td>\n",
       "      <td>0.0959214624684256</td>\n",
       "      <td>0.592940745385545</td>\n",
       "      <td>-0.270532677192282</td>\n",
       "      <td>0.817739308235294</td>\n",
       "      <td>0.753074431976354</td>\n",
       "      <td>-0.822842877946363</td>\n",
       "      <td>0.53819555014995</td>\n",
       "      <td>1.3458515932154</td>\n",
       "      <td>-1.11966983471731</td>\n",
       "      <td>0.175121130008994</td>\n",
       "      <td>-0.451449182813529</td>\n",
       "      <td>-0.237033239362776</td>\n",
       "      <td>-0.0381947870352842</td>\n",
       "      <td>0.803486924960175</td>\n",
       "      <td>0.408542360392758</td>\n",
       "      <td>-0.00943069713232919</td>\n",
       "      <td>0.79827849458971</td>\n",
       "      <td>-0.137458079619063</td>\n",
       "      <td>0.141266983824769</td>\n",
       "      <td>-0.206009587619756</td>\n",
       "      <td>0.502292224181569</td>\n",
       "      <td>0.219422229513348</td>\n",
       "      <td>0.215153147499206</td>\n",
       "      <td>69.99</td>\n",
       "      <td>0</td>\n",
       "      <td>e333b3c0-83ae-42dc-a865-178496653029</td>\n",
       "      <td>87b2fbf2-5b7d-479c-85f5-d989bd701f36</td>\n",
       "      <td>2021-09-01T00:02:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   v1                   v2                v3  \\\n",
       "0    -1.3598071336738  -0.0727811733098497  2.53634673796914   \n",
       "1    1.19185711131486     0.26615071205963  0.16648011335321   \n",
       "2   -1.35835406159823    -1.34016307473609  1.77320934263119   \n",
       "3  -0.966271711572087   -0.185226008082898  1.79299333957872   \n",
       "4   -1.15823309349523    0.877736754848451    1.548717846511   \n",
       "\n",
       "                   v4                   v5                   v6  \\\n",
       "0    1.37815522427443   -0.338320769942518    0.462387777762292   \n",
       "1   0.448154078460911   0.0600176492822243  -0.0823608088155687   \n",
       "2   0.379779593034328   -0.503198133318193     1.80049938079263   \n",
       "3  -0.863291275036453  -0.0103088796030823     1.24720316752486   \n",
       "4   0.403033933955121   -0.407193377311653   0.0959214624684256   \n",
       "\n",
       "                    v7                  v8                  v9  \\\n",
       "0    0.239598554061257  0.0986979012610507   0.363786969611213   \n",
       "1  -0.0788029833323113  0.0851016549148104  -0.255425128109186   \n",
       "2    0.791460956450422   0.247675786588991   -1.51465432260583   \n",
       "3     0.23760893977178   0.377435874652262   -1.38702406270197   \n",
       "4    0.592940745385545  -0.270532677192282   0.817739308235294   \n",
       "\n",
       "                   v10                 v11                 v12  \\\n",
       "0   0.0907941719789316  -0.551599533260813  -0.617800855762348   \n",
       "1   -0.166974414004614    1.61272666105479    1.06523531137287   \n",
       "2    0.207642865216696   0.624501459424895   0.066083685268831   \n",
       "3  -0.0549519224713749  -0.226487263835401   0.178228225877303   \n",
       "4    0.753074431976354  -0.822842877946363    0.53819555014995   \n",
       "\n",
       "                  v13                 v14                 v15  \\\n",
       "0  -0.991389847235408  -0.311169353699879    1.46817697209427   \n",
       "1    0.48909501589608  -0.143772296441519   0.635558093258208   \n",
       "2   0.717292731410831  -0.165945922763554    2.34586494901581   \n",
       "3   0.507756869957169   -0.28792374549456  -0.631418117709045   \n",
       "4     1.3458515932154   -1.11966983471731   0.175121130008994   \n",
       "\n",
       "                  v16                 v17                  v18  \\\n",
       "0  -0.470400525259478   0.207971241929242   0.0257905801985591   \n",
       "1   0.463917041022171  -0.114804663102346   -0.183361270123994   \n",
       "2   -2.89008319444231    1.10996937869599   -0.121359313195888   \n",
       "3    -1.0596472454325  -0.684092786345479     1.96577500349538   \n",
       "4  -0.451449182813529  -0.237033239362776  -0.0381947870352842   \n",
       "\n",
       "                  v19                  v20                   v21  \\\n",
       "0   0.403992960255733    0.251412098239705    -0.018306777944153   \n",
       "1  -0.145783041325259  -0.0690831352230203    -0.225775248033138   \n",
       "2   -2.26185709530414    0.524979725224404     0.247998153469754   \n",
       "3    -1.2326219700892   -0.208037781160366    -0.108300452035545   \n",
       "4   0.803486924960175    0.408542360392758  -0.00943069713232919   \n",
       "\n",
       "                   v22                 v23                 v24  \\\n",
       "0    0.277837575558899  -0.110473910188767  0.0669280749146731   \n",
       "1   -0.638671952771851   0.101288021253234  -0.339846475529127   \n",
       "2    0.771679401917229   0.909412262347719  -0.689280956490685   \n",
       "3  0.00527359678253453  -0.190320518742841   -1.17557533186321   \n",
       "4     0.79827849458971  -0.137458079619063   0.141266983824769   \n",
       "\n",
       "                  v25                 v26                   v27  \\\n",
       "0   0.128539358273528  -0.189114843888824     0.133558376740387   \n",
       "1   0.167170404418143   0.125894532368176  -0.00898309914322813   \n",
       "2  -0.327641833735251  -0.139096571514147   -0.0553527940384261   \n",
       "3   0.647376034602038  -0.221928844458407    0.0627228487293033   \n",
       "4  -0.206009587619756   0.502292224181569     0.219422229513348   \n",
       "\n",
       "                   v28  amount  EVENT_LABEL  \\\n",
       "0  -0.0210530534538215  149.62            0   \n",
       "1   0.0147241691924927    2.69            0   \n",
       "2  -0.0597518405929204  378.66            0   \n",
       "3   0.0614576285006353   123.5            0   \n",
       "4    0.215153147499206   69.99            0   \n",
       "\n",
       "                               EVENT_ID                             ENTITY_ID  \\\n",
       "0  f8e77dc0-44ef-490c-b0de-8b4054b5a031  266103ff-71f2-4057-981d-a54821367237   \n",
       "1  b557449e-6b35-4be0-991e-337f764f5e21  f85083b2-d31f-4b9e-9d49-eb85c0476f6e   \n",
       "2  d78d879c-eb7c-455d-8fde-6b1205080a4a  237ca488-c695-402c-b30f-0544554ea96c   \n",
       "3  ef448a36-2763-449c-a54a-a9e05af20967  9964b305-b591-4ed0-bff1-8adca81d0194   \n",
       "4  e333b3c0-83ae-42dc-a865-178496653029  87b2fbf2-5b7d-479c-85f5-d989bd701f36   \n",
       "\n",
       "        EVENT_TIMESTAMP       LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "0  2021-09-01T00:00:00Z  2023-05-05T08:46:09Z        user  \n",
       "1  2021-09-01T00:00:00Z  2023-05-05T08:46:09Z        user  \n",
       "2  2021-09-01T00:01:00Z  2023-05-05T08:46:09Z        user  \n",
       "3  2021-09-01T00:01:00Z  2023-05-05T08:46:09Z        user  \n",
       "4  2021-09-01T00:02:00Z  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "35\n",
      "(227845, 35)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>v1</th>\n",
       "      <th>v2</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v5</th>\n",
       "      <th>v6</th>\n",
       "      <th>v7</th>\n",
       "      <th>v8</th>\n",
       "      <th>v9</th>\n",
       "      <th>v10</th>\n",
       "      <th>v11</th>\n",
       "      <th>v12</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v15</th>\n",
       "      <th>v16</th>\n",
       "      <th>v17</th>\n",
       "      <th>v18</th>\n",
       "      <th>v19</th>\n",
       "      <th>v20</th>\n",
       "      <th>v21</th>\n",
       "      <th>v22</th>\n",
       "      <th>v23</th>\n",
       "      <th>v24</th>\n",
       "      <th>v25</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v28</th>\n",
       "      <th>amount</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>227845</th>\n",
       "      <td>1.91402682161454</td>\n",
       "      <td>-0.490067987909997</td>\n",
       "      <td>-0.326111312515118</td>\n",
       "      <td>0.604710739174721</td>\n",
       "      <td>-0.8501359998436</td>\n",
       "      <td>-0.736318677031096</td>\n",
       "      <td>-0.524057962475328</td>\n",
       "      <td>-0.0886141066361987</td>\n",
       "      <td>1.09112510472248</td>\n",
       "      <td>0.093484357816225</td>\n",
       "      <td>-0.892304625856107</td>\n",
       "      <td>0.0272205159068718</td>\n",
       "      <td>-0.243790209618721</td>\n",
       "      <td>0.0317740067189187</td>\n",
       "      <td>0.900623897113791</td>\n",
       "      <td>0.536032161644219</td>\n",
       "      <td>-0.648408094097169</td>\n",
       "      <td>0.183072340001028</td>\n",
       "      <td>-0.48632249422331</td>\n",
       "      <td>-0.13957876335222</td>\n",
       "      <td>0.210958428878652</td>\n",
       "      <td>0.639337879054097</td>\n",
       "      <td>0.147522551988298</td>\n",
       "      <td>0.0736542664022496</td>\n",
       "      <td>-0.318378246601246</td>\n",
       "      <td>0.350612262707235</td>\n",
       "      <td>-0.0238434747433154</td>\n",
       "      <td>-0.0371393315055126</td>\n",
       "      <td>50</td>\n",
       "      <td>bd64c6f1-1c1d-49ea-8561-6cc56bd2a173</td>\n",
       "      <td>ee6232a9-6ba4-4654-b406-72e582f01031</td>\n",
       "      <td>2021-12-10T20:48:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227846</th>\n",
       "      <td>2.15269624649984</td>\n",
       "      <td>-0.036160786158066</td>\n",
       "      <td>-2.23181098049803</td>\n",
       "      <td>0.0917658435583919</td>\n",
       "      <td>0.537612206488446</td>\n",
       "      <td>-1.36810250972644</td>\n",
       "      <td>0.613326738349479</td>\n",
       "      <td>-0.455251954849699</td>\n",
       "      <td>0.29181359004335</td>\n",
       "      <td>0.253161344559488</td>\n",
       "      <td>-1.50188197076942</td>\n",
       "      <td>-0.870607641524177</td>\n",
       "      <td>-1.44173756499372</td>\n",
       "      <td>0.988756626201074</td>\n",
       "      <td>0.496349234837293</td>\n",
       "      <td>-0.0686989613348823</td>\n",
       "      <td>-0.454073497932566</td>\n",
       "      <td>-0.299095262736551</td>\n",
       "      <td>0.267443131415241</td>\n",
       "      <td>-0.275777914750361</td>\n",
       "      <td>0.0171533555339963</td>\n",
       "      <td>0.0632416225359206</td>\n",
       "      <td>-0.0345611249491173</td>\n",
       "      <td>-0.626866212626912</td>\n",
       "      <td>0.249213129413917</td>\n",
       "      <td>0.773930519516097</td>\n",
       "      <td>-0.137114784582898</td>\n",
       "      <td>-0.0906106088420727</td>\n",
       "      <td>14.95</td>\n",
       "      <td>6728a9b7-ab9c-404e-93a8-fcf76baf7e8e</td>\n",
       "      <td>3dc93b80-f110-4355-b516-5174a0cd214d</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227847</th>\n",
       "      <td>-4.03479516717275</td>\n",
       "      <td>2.30507905571504</td>\n",
       "      <td>-1.46169292457709</td>\n",
       "      <td>-0.729887055238227</td>\n",
       "      <td>-1.5287503399573</td>\n",
       "      <td>-1.22567909778369</td>\n",
       "      <td>-0.893353679497868</td>\n",
       "      <td>1.62252199369554</td>\n",
       "      <td>1.29199841774415</td>\n",
       "      <td>-0.0409558359937061</td>\n",
       "      <td>-0.971425287697512</td>\n",
       "      <td>0.574743695630458</td>\n",
       "      <td>0.155656078919204</td>\n",
       "      <td>-0.729054997889385</td>\n",
       "      <td>0.477438947999659</td>\n",
       "      <td>1.06171851569252</td>\n",
       "      <td>0.93469475367536</td>\n",
       "      <td>0.403768792198479</td>\n",
       "      <td>-0.494929851777981</td>\n",
       "      <td>-0.0810925858921718</td>\n",
       "      <td>-0.392556502541116</td>\n",
       "      <td>-0.78759906251576</td>\n",
       "      <td>0.343467795972994</td>\n",
       "      <td>-0.0903313999840935</td>\n",
       "      <td>0.248286972151669</td>\n",
       "      <td>-0.238523845342424</td>\n",
       "      <td>0.26648354183946</td>\n",
       "      <td>-0.0622361634691654</td>\n",
       "      <td>7.7</td>\n",
       "      <td>1f4a3cae-3a95-48b7-8cc9-dd2258689f37</td>\n",
       "      <td>58879cd9-4053-4e16-9144-3b04c276f74e</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227848</th>\n",
       "      <td>-1.66874106862583</td>\n",
       "      <td>1.16805471760364</td>\n",
       "      <td>0.249642461553748</td>\n",
       "      <td>-1.26849748925032</td>\n",
       "      <td>0.785922573014156</td>\n",
       "      <td>-0.663958562166729</td>\n",
       "      <td>0.859432973616895</td>\n",
       "      <td>0.0681106263347446</td>\n",
       "      <td>-0.144183044927318</td>\n",
       "      <td>0.0432880841287975</td>\n",
       "      <td>0.542013736060061</td>\n",
       "      <td>1.00202450469061</td>\n",
       "      <td>0.400759595743433</td>\n",
       "      <td>0.136412487776037</td>\n",
       "      <td>-1.28964902448879</td>\n",
       "      <td>0.276827961550432</td>\n",
       "      <td>-0.868491702025561</td>\n",
       "      <td>-0.366839507131127</td>\n",
       "      <td>-0.187391599008302</td>\n",
       "      <td>-0.0335233340620367</td>\n",
       "      <td>-0.247543775399679</td>\n",
       "      <td>-0.592536769878023</td>\n",
       "      <td>-0.286693549546811</td>\n",
       "      <td>-0.378855664973759</td>\n",
       "      <td>-0.0774289041638705</td>\n",
       "      <td>0.0676084004301294</td>\n",
       "      <td>-0.27896200360197</td>\n",
       "      <td>-0.0641926690992577</td>\n",
       "      <td>6.99</td>\n",
       "      <td>930cd5cb-b226-4af5-8dda-574340d05a12</td>\n",
       "      <td>bb616582-e509-4c77-9154-755ca81039c4</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227849</th>\n",
       "      <td>-0.550678353341949</td>\n",
       "      <td>-0.429004102182237</td>\n",
       "      <td>-1.29189255347072</td>\n",
       "      <td>-0.414409226593379</td>\n",
       "      <td>-0.292228538671312</td>\n",
       "      <td>0.071842939235058</td>\n",
       "      <td>2.42606795091335</td>\n",
       "      <td>-0.212729758223082</td>\n",
       "      <td>0.412374372851086</td>\n",
       "      <td>-1.93996940549555</td>\n",
       "      <td>-1.81011838293809</td>\n",
       "      <td>-1.22351031687552</td>\n",
       "      <td>-1.32491464932768</td>\n",
       "      <td>-1.46239178995552</td>\n",
       "      <td>-0.31164055759838</td>\n",
       "      <td>0.506707760378257</td>\n",
       "      <td>0.739932584638577</td>\n",
       "      <td>0.892422017204659</td>\n",
       "      <td>0.195042529037103</td>\n",
       "      <td>0.791126747715284</td>\n",
       "      <td>0.00303193944814891</td>\n",
       "      <td>-0.645782978858753</td>\n",
       "      <td>0.877016475964068</td>\n",
       "      <td>-1.22852893747944</td>\n",
       "      <td>-0.0362812174160739</td>\n",
       "      <td>-0.110609895882901</td>\n",
       "      <td>-0.0983803135271981</td>\n",
       "      <td>0.0959849443846813</td>\n",
       "      <td>460.71</td>\n",
       "      <td>2e909126-def3-4d82-9485-03798817c942</td>\n",
       "      <td>88ea4bc9-29fd-4302-913d-e6788cb7e6ab</td>\n",
       "      <td>2021-12-10T20:50:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        v1                  v2                  v3  \\\n",
       "227845    1.91402682161454  -0.490067987909997  -0.326111312515118   \n",
       "227846    2.15269624649984  -0.036160786158066   -2.23181098049803   \n",
       "227847   -4.03479516717275    2.30507905571504   -1.46169292457709   \n",
       "227848   -1.66874106862583    1.16805471760364   0.249642461553748   \n",
       "227849  -0.550678353341949  -0.429004102182237   -1.29189255347072   \n",
       "\n",
       "                        v4                  v5                  v6  \\\n",
       "227845   0.604710739174721    -0.8501359998436  -0.736318677031096   \n",
       "227846  0.0917658435583919   0.537612206488446   -1.36810250972644   \n",
       "227847  -0.729887055238227    -1.5287503399573   -1.22567909778369   \n",
       "227848   -1.26849748925032   0.785922573014156  -0.663958562166729   \n",
       "227849  -0.414409226593379  -0.292228538671312   0.071842939235058   \n",
       "\n",
       "                        v7                   v8                  v9  \\\n",
       "227845  -0.524057962475328  -0.0886141066361987    1.09112510472248   \n",
       "227846   0.613326738349479   -0.455251954849699    0.29181359004335   \n",
       "227847  -0.893353679497868     1.62252199369554    1.29199841774415   \n",
       "227848   0.859432973616895   0.0681106263347446  -0.144183044927318   \n",
       "227849    2.42606795091335   -0.212729758223082   0.412374372851086   \n",
       "\n",
       "                        v10                 v11                 v12  \\\n",
       "227845    0.093484357816225  -0.892304625856107  0.0272205159068718   \n",
       "227846    0.253161344559488   -1.50188197076942  -0.870607641524177   \n",
       "227847  -0.0409558359937061  -0.971425287697512   0.574743695630458   \n",
       "227848   0.0432880841287975   0.542013736060061    1.00202450469061   \n",
       "227849    -1.93996940549555   -1.81011838293809   -1.22351031687552   \n",
       "\n",
       "                       v13                 v14                v15  \\\n",
       "227845  -0.243790209618721  0.0317740067189187  0.900623897113791   \n",
       "227846   -1.44173756499372   0.988756626201074  0.496349234837293   \n",
       "227847   0.155656078919204  -0.729054997889385  0.477438947999659   \n",
       "227848   0.400759595743433   0.136412487776037  -1.28964902448879   \n",
       "227849   -1.32491464932768   -1.46239178995552  -0.31164055759838   \n",
       "\n",
       "                        v16                 v17                 v18  \\\n",
       "227845    0.536032161644219  -0.648408094097169   0.183072340001028   \n",
       "227846  -0.0686989613348823  -0.454073497932566  -0.299095262736551   \n",
       "227847     1.06171851569252    0.93469475367536   0.403768792198479   \n",
       "227848    0.276827961550432  -0.868491702025561  -0.366839507131127   \n",
       "227849    0.506707760378257   0.739932584638577   0.892422017204659   \n",
       "\n",
       "                       v19                  v20                  v21  \\\n",
       "227845   -0.48632249422331    -0.13957876335222    0.210958428878652   \n",
       "227846   0.267443131415241   -0.275777914750361   0.0171533555339963   \n",
       "227847  -0.494929851777981  -0.0810925858921718   -0.392556502541116   \n",
       "227848  -0.187391599008302  -0.0335233340620367   -0.247543775399679   \n",
       "227849   0.195042529037103    0.791126747715284  0.00303193944814891   \n",
       "\n",
       "                       v22                  v23                  v24  \\\n",
       "227845   0.639337879054097    0.147522551988298   0.0736542664022496   \n",
       "227846  0.0632416225359206  -0.0345611249491173   -0.626866212626912   \n",
       "227847   -0.78759906251576    0.343467795972994  -0.0903313999840935   \n",
       "227848  -0.592536769878023   -0.286693549546811   -0.378855664973759   \n",
       "227849  -0.645782978858753    0.877016475964068    -1.22852893747944   \n",
       "\n",
       "                        v25                 v26                  v27  \\\n",
       "227845   -0.318378246601246   0.350612262707235  -0.0238434747433154   \n",
       "227846    0.249213129413917   0.773930519516097   -0.137114784582898   \n",
       "227847    0.248286972151669  -0.238523845342424     0.26648354183946   \n",
       "227848  -0.0774289041638705  0.0676084004301294    -0.27896200360197   \n",
       "227849  -0.0362812174160739  -0.110609895882901  -0.0983803135271981   \n",
       "\n",
       "                        v28  amount                              EVENT_ID  \\\n",
       "227845  -0.0371393315055126      50  bd64c6f1-1c1d-49ea-8561-6cc56bd2a173   \n",
       "227846  -0.0906106088420727   14.95  6728a9b7-ab9c-404e-93a8-fcf76baf7e8e   \n",
       "227847  -0.0622361634691654     7.7  1f4a3cae-3a95-48b7-8cc9-dd2258689f37   \n",
       "227848  -0.0641926690992577    6.99  930cd5cb-b226-4af5-8dda-574340d05a12   \n",
       "227849   0.0959849443846813  460.71  2e909126-def3-4d82-9485-03798817c942   \n",
       "\n",
       "                                   ENTITY_ID       EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "227845  ee6232a9-6ba4-4654-b406-72e582f01031  2021-12-10T20:48:00Z        user  \n",
       "227846  3dc93b80-f110-4355-b516-5174a0cd214d  2021-12-10T20:49:00Z        user  \n",
       "227847  58879cd9-4053-4e16-9144-3b04c276f74e  2021-12-10T20:49:00Z        user  \n",
       "227848  bb616582-e509-4c77-9154-755ca81039c4  2021-12-10T20:49:00Z        user  \n",
       "227849  88ea4bc9-29fd-4302-913d-e6788cb7e6ab  2021-12-10T20:50:00Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(56962, 33)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>227845</th>\n",
       "      <td>0</td>\n",
       "      <td>bd64c6f1-1c1d-49ea-8561-6cc56bd2a173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227846</th>\n",
       "      <td>0</td>\n",
       "      <td>6728a9b7-ab9c-404e-93a8-fcf76baf7e8e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227847</th>\n",
       "      <td>0</td>\n",
       "      <td>1f4a3cae-3a95-48b7-8cc9-dd2258689f37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227848</th>\n",
       "      <td>0</td>\n",
       "      <td>930cd5cb-b226-4af5-8dda-574340d05a12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227849</th>\n",
       "      <td>0</td>\n",
       "      <td>2e909126-def3-4d82-9485-03798817c942</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        EVENT_LABEL                              EVENT_ID\n",
       "227845            0  bd64c6f1-1c1d-49ea-8561-6cc56bd2a173\n",
       "227846            0  6728a9b7-ab9c-404e-93a8-fcf76baf7e8e\n",
       "227847            0  1f4a3cae-3a95-48b7-8cc9-dd2258689f37\n",
       "227848            0  930cd5cb-b226-4af5-8dda-574340d05a12\n",
       "227849            0  2e909126-def3-4d82-9485-03798817c942"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    56887\n",
      "1       75\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.99817\n",
      "1    0.00183\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "fraudecom\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>purchase_value</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>source</th>\n",
       "      <th>browser</th>\n",
       "      <th>age</th>\n",
       "      <th>ip_address</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>time_since_signup</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>115086</th>\n",
       "      <td>309557</td>\n",
       "      <td>14</td>\n",
       "      <td>BBPACGBUVJUXF</td>\n",
       "      <td>Ads</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>38</td>\n",
       "      <td>119.75.87.223</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-01-01T00:00:44Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41990</th>\n",
       "      <td>124539</td>\n",
       "      <td>14</td>\n",
       "      <td>BBPACGBUVJUXF</td>\n",
       "      <td>Ads</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>38</td>\n",
       "      <td>119.75.87.223</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-01-01T00:00:45Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>134836</th>\n",
       "      <td>161246</td>\n",
       "      <td>14</td>\n",
       "      <td>BBPACGBUVJUXF</td>\n",
       "      <td>Ads</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>38</td>\n",
       "      <td>119.75.87.223</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-01-01T00:00:46Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24572</th>\n",
       "      <td>356414</td>\n",
       "      <td>14</td>\n",
       "      <td>BBPACGBUVJUXF</td>\n",
       "      <td>Ads</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>38</td>\n",
       "      <td>119.75.87.223</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-01-01T00:00:47Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>106160</th>\n",
       "      <td>338656</td>\n",
       "      <td>14</td>\n",
       "      <td>BBPACGBUVJUXF</td>\n",
       "      <td>Ads</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>38</td>\n",
       "      <td>119.75.87.223</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-01-01T00:00:48Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       EVENT_ID purchase_value      ENTITY_ID source browser age  \\\n",
       "115086   309557             14  BBPACGBUVJUXF    Ads  Chrome  38   \n",
       "41990    124539             14  BBPACGBUVJUXF    Ads  Chrome  38   \n",
       "134836   161246             14  BBPACGBUVJUXF    Ads  Chrome  38   \n",
       "24572    356414             14  BBPACGBUVJUXF    Ads  Chrome  38   \n",
       "106160   338656             14  BBPACGBUVJUXF    Ads  Chrome  38   \n",
       "\n",
       "           ip_address  EVENT_LABEL  time_since_signup       EVENT_TIMESTAMP  \\\n",
       "115086  119.75.87.223            1                  1  2021-01-01T00:00:44Z   \n",
       "41990   119.75.87.223            1                  1  2021-01-01T00:00:45Z   \n",
       "134836  119.75.87.223            1                  1  2021-01-01T00:00:46Z   \n",
       "24572   119.75.87.223            1                  1  2021-01-01T00:00:47Z   \n",
       "106160  119.75.87.223            1                  1  2021-01-01T00:00:48Z   \n",
       "\n",
       "             LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "115086  2023-05-05T08:46:09Z        user  \n",
       "41990   2023-05-05T08:46:09Z        user  \n",
       "134836  2023-05-05T08:46:09Z        user  \n",
       "24572   2023-05-05T08:46:09Z        user  \n",
       "106160  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "12\n",
      "(120889, 12)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>purchase_value</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>source</th>\n",
       "      <th>browser</th>\n",
       "      <th>age</th>\n",
       "      <th>ip_address</th>\n",
       "      <th>time_since_signup</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>69628</th>\n",
       "      <td>304435</td>\n",
       "      <td>50</td>\n",
       "      <td>EFASVBVKDGQKI</td>\n",
       "      <td>Ads</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>40</td>\n",
       "      <td>202.165.191.211</td>\n",
       "      <td>35310</td>\n",
       "      <td>2021-08-30T15:18:56Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120573</th>\n",
       "      <td>222177</td>\n",
       "      <td>30</td>\n",
       "      <td>LUAQDRQGTDVHQ</td>\n",
       "      <td>SEO</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>39</td>\n",
       "      <td>2.82.213.23</td>\n",
       "      <td>52655</td>\n",
       "      <td>2021-08-30T15:20:03Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105050</th>\n",
       "      <td>308836</td>\n",
       "      <td>35</td>\n",
       "      <td>ODWUMTCAPBLXP</td>\n",
       "      <td>Ads</td>\n",
       "      <td>FireFox</td>\n",
       "      <td>20</td>\n",
       "      <td>73.185.82.155</td>\n",
       "      <td>35083</td>\n",
       "      <td>2021-08-30T15:20:35Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118037</th>\n",
       "      <td>202515</td>\n",
       "      <td>20</td>\n",
       "      <td>LTOEZIQLNHGAC</td>\n",
       "      <td>Ads</td>\n",
       "      <td>IE</td>\n",
       "      <td>37</td>\n",
       "      <td>108.236.13.248</td>\n",
       "      <td>4032</td>\n",
       "      <td>2021-08-30T15:27:14Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6094</th>\n",
       "      <td>260389</td>\n",
       "      <td>46</td>\n",
       "      <td>GMTRBZCZVBKQC</td>\n",
       "      <td>Ads</td>\n",
       "      <td>Chrome</td>\n",
       "      <td>34</td>\n",
       "      <td>129.163.194.162</td>\n",
       "      <td>19237</td>\n",
       "      <td>2021-08-30T15:28:27Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       EVENT_ID purchase_value      ENTITY_ID source  browser age  \\\n",
       "69628    304435             50  EFASVBVKDGQKI    Ads   Chrome  40   \n",
       "120573   222177             30  LUAQDRQGTDVHQ    SEO   Chrome  39   \n",
       "105050   308836             35  ODWUMTCAPBLXP    Ads  FireFox  20   \n",
       "118037   202515             20  LTOEZIQLNHGAC    Ads       IE  37   \n",
       "6094     260389             46  GMTRBZCZVBKQC    Ads   Chrome  34   \n",
       "\n",
       "             ip_address  time_since_signup       EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "69628   202.165.191.211              35310  2021-08-30T15:18:56Z        user  \n",
       "120573      2.82.213.23              52655  2021-08-30T15:20:03Z        user  \n",
       "105050    73.185.82.155              35083  2021-08-30T15:20:35Z        user  \n",
       "118037   108.236.13.248               4032  2021-08-30T15:27:14Z        user  \n",
       "6094    129.163.194.162              19237  2021-08-30T15:28:27Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(30223, 10)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>69628</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120573</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105050</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118037</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6094</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        EVENT_LABEL\n",
       "69628             0\n",
       "120573            0\n",
       "105050            0\n",
       "118037            0\n",
       "6094              0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    28834\n",
      "1     1389\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.894432\n",
      "1    0.105568\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "twitterbot\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unnamed: 0</th>\n",
       "      <th>created_at</th>\n",
       "      <th>default_profile</th>\n",
       "      <th>default_profile_image</th>\n",
       "      <th>description</th>\n",
       "      <th>favourites_count</th>\n",
       "      <th>followers_count</th>\n",
       "      <th>friends_count</th>\n",
       "      <th>geo_enabled</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>lang</th>\n",
       "      <th>location</th>\n",
       "      <th>profile_background_image_url</th>\n",
       "      <th>profile_image_url</th>\n",
       "      <th>screen_name</th>\n",
       "      <th>statuses_count</th>\n",
       "      <th>verified</th>\n",
       "      <th>average_tweets_per_day</th>\n",
       "      <th>account_age_days</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20963</th>\n",
       "      <td>20963</td>\n",
       "      <td>2013-05-27 21:22:15</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.</td>\n",
       "      <td>32374</td>\n",
       "      <td>2395</td>\n",
       "      <td>2823</td>\n",
       "      <td>True</td>\n",
       "      <td>1463172686</td>\n",
       "      <td>en</td>\n",
       "      <td>Mount Morris, MI</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg</td>\n",
       "      <td>blerdwords</td>\n",
       "      <td>11448</td>\n",
       "      <td>False</td>\n",
       "      <td>4.336</td>\n",
       "      <td>2640</td>\n",
       "      <td>0</td>\n",
       "      <td>d300a2e5-86e1-488a-8ca6-6b49cc517164</td>\n",
       "      <td>2022-05-07T14:44:33Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6331</th>\n",
       "      <td>6331</td>\n",
       "      <td>2009-09-14 18:58:36</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo</td>\n",
       "      <td>68664</td>\n",
       "      <td>350789</td>\n",
       "      <td>1528</td>\n",
       "      <td>False</td>\n",
       "      <td>74231747</td>\n",
       "      <td>en</td>\n",
       "      <td>Los Angeles - Always a Texan</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme9/bg.gif</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg</td>\n",
       "      <td>JennyJohnsonHi5</td>\n",
       "      <td>18732</td>\n",
       "      <td>True</td>\n",
       "      <td>4.694</td>\n",
       "      <td>3991</td>\n",
       "      <td>0</td>\n",
       "      <td>c253258d-91c5-483c-ba5e-c357551adf16</td>\n",
       "      <td>2023-03-05T02:17:19Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17209</th>\n",
       "      <td>17209</td>\n",
       "      <td>2010-06-06 16:27:08</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>54</td>\n",
       "      <td>657</td>\n",
       "      <td>False</td>\n",
       "      <td>152688783</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Abu Dhabi</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png</td>\n",
       "      <td>AbnerJosh</td>\n",
       "      <td>161</td>\n",
       "      <td>False</td>\n",
       "      <td>0.043</td>\n",
       "      <td>3726</td>\n",
       "      <td>1</td>\n",
       "      <td>7af565e8-c19c-4132-b5f7-b017efad7951</td>\n",
       "      <td>2022-11-03T20:32:13Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23964</th>\n",
       "      <td>23964</td>\n",
       "      <td>2010-06-22 21:56:09</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.</td>\n",
       "      <td>1517</td>\n",
       "      <td>55881</td>\n",
       "      <td>991</td>\n",
       "      <td>True</td>\n",
       "      <td>158502985</td>\n",
       "      <td>en</td>\n",
       "      <td>Regina, SK Canada</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg</td>\n",
       "      <td>GlobalRegina</td>\n",
       "      <td>103379</td>\n",
       "      <td>True</td>\n",
       "      <td>27.865</td>\n",
       "      <td>3710</td>\n",
       "      <td>0</td>\n",
       "      <td>28e07864-c08a-43b8-9cc9-423c25254b0b</td>\n",
       "      <td>2022-06-30T20:31:18Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30569</th>\n",
       "      <td>30569</td>\n",
       "      <td>2009-03-10 02:26:45</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Detritus</td>\n",
       "      <td>2616</td>\n",
       "      <td>1118405</td>\n",
       "      <td>657</td>\n",
       "      <td>False</td>\n",
       "      <td>23544268</td>\n",
       "      <td>no</td>\n",
       "      <td>unknown</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme11/bg.gif</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg</td>\n",
       "      <td>OfficialKat</td>\n",
       "      <td>4980</td>\n",
       "      <td>True</td>\n",
       "      <td>1.191</td>\n",
       "      <td>4180</td>\n",
       "      <td>0</td>\n",
       "      <td>96724009-efa7-4558-8bad-3aeaa7bfdea5</td>\n",
       "      <td>2022-05-11T12:37:16Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      unnamed: 0           created_at default_profile default_profile_image  \\\n",
       "20963      20963  2013-05-27 21:22:15            True                 False   \n",
       "6331        6331  2009-09-14 18:58:36           False                 False   \n",
       "17209      17209  2010-06-06 16:27:08            True                 False   \n",
       "23964      23964  2010-06-22 21:56:09           False                 False   \n",
       "30569      30569  2009-03-10 02:26:45           False                 False   \n",
       "\n",
       "                                                                                                                                                           description  \\\n",
       "20963     WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.   \n",
       "6331   Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo   \n",
       "17209                                                                                                                                                              NaN   \n",
       "23964                    Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.   \n",
       "30569                                                                                                                                                         Detritus   \n",
       "\n",
       "      favourites_count followers_count friends_count geo_enabled    EVENT_ID  \\\n",
       "20963            32374            2395          2823        True  1463172686   \n",
       "6331             68664          350789          1528       False    74231747   \n",
       "17209               74              54           657       False   152688783   \n",
       "23964             1517           55881           991        True   158502985   \n",
       "30569             2616         1118405           657       False    23544268   \n",
       "\n",
       "      lang                      location  \\\n",
       "20963   en              Mount Morris, MI   \n",
       "6331    en  Los Angeles - Always a Texan   \n",
       "17209  NaN                     Abu Dhabi   \n",
       "23964   en             Regina, SK Canada   \n",
       "30569   no                       unknown   \n",
       "\n",
       "                            profile_background_image_url  \\\n",
       "20963   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "6331    http://abs.twimg.com/images/themes/theme9/bg.gif   \n",
       "17209   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "23964   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "30569  http://abs.twimg.com/images/themes/theme11/bg.gif   \n",
       "\n",
       "                                                                 profile_image_url  \\\n",
       "20963  http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg   \n",
       "6331    http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg   \n",
       "17209         http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png   \n",
       "23964   http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg   \n",
       "30569  http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg   \n",
       "\n",
       "           screen_name statuses_count verified average_tweets_per_day  \\\n",
       "20963       blerdwords          11448    False                  4.336   \n",
       "6331   JennyJohnsonHi5          18732     True                  4.694   \n",
       "17209        AbnerJosh            161    False                  0.043   \n",
       "23964     GlobalRegina         103379     True                 27.865   \n",
       "30569      OfficialKat           4980     True                  1.191   \n",
       "\n",
       "      account_age_days  EVENT_LABEL                             ENTITY_ID  \\\n",
       "20963             2640            0  d300a2e5-86e1-488a-8ca6-6b49cc517164   \n",
       "6331              3991            0  c253258d-91c5-483c-ba5e-c357551adf16   \n",
       "17209             3726            1  7af565e8-c19c-4132-b5f7-b017efad7951   \n",
       "23964             3710            0  28e07864-c08a-43b8-9cc9-423c25254b0b   \n",
       "30569             4180            0  96724009-efa7-4558-8bad-3aeaa7bfdea5   \n",
       "\n",
       "            EVENT_TIMESTAMP       LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "20963  2022-05-07T14:44:33Z  2023-05-05T08:46:09Z        user  \n",
       "6331   2023-03-05T02:17:19Z  2023-05-05T08:46:09Z        user  \n",
       "17209  2022-11-03T20:32:13Z  2023-05-05T08:46:09Z        user  \n",
       "23964  2022-06-30T20:31:18Z  2023-05-05T08:46:09Z        user  \n",
       "30569  2022-05-11T12:37:16Z  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "24\n",
      "(29950, 24)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unnamed: 0</th>\n",
       "      <th>created_at</th>\n",
       "      <th>default_profile</th>\n",
       "      <th>default_profile_image</th>\n",
       "      <th>description</th>\n",
       "      <th>favourites_count</th>\n",
       "      <th>followers_count</th>\n",
       "      <th>friends_count</th>\n",
       "      <th>geo_enabled</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>lang</th>\n",
       "      <th>location</th>\n",
       "      <th>profile_background_image_url</th>\n",
       "      <th>profile_image_url</th>\n",
       "      <th>screen_name</th>\n",
       "      <th>statuses_count</th>\n",
       "      <th>verified</th>\n",
       "      <th>average_tweets_per_day</th>\n",
       "      <th>account_age_days</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2016-11-09 05:01:30</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Photographing the American West since 1980. I specialize in location portraits &amp; events, both indoors &amp; outside, using natural light &amp; portable studio lighting.</td>\n",
       "      <td>536</td>\n",
       "      <td>860</td>\n",
       "      <td>880</td>\n",
       "      <td>False</td>\n",
       "      <td>796216118331310080</td>\n",
       "      <td>en</td>\n",
       "      <td>Estados Unidos</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/802329632838037504/CQN6gP7k_normal.jpg</td>\n",
       "      <td>CJRubinPhoto</td>\n",
       "      <td>252</td>\n",
       "      <td>False</td>\n",
       "      <td>0.183</td>\n",
       "      <td>1379</td>\n",
       "      <td>0a8d3859-dec4-4ba6-abae-74b5523b042f</td>\n",
       "      <td>2022-08-04T08:04:08Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>9</td>\n",
       "      <td>2012-02-14 15:33:48</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Man Utd fan. mostly here for football. Takes photos. Ex care worker, does stuff with computers often sarcastic. 🐝🇬🇧🇪🇺</td>\n",
       "      <td>36384</td>\n",
       "      <td>2130</td>\n",
       "      <td>3363</td>\n",
       "      <td>True</td>\n",
       "      <td>492306486</td>\n",
       "      <td>en</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme14/bg.gif</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1211318786512609281/e6UqYEa4_normal.jpg</td>\n",
       "      <td>GhamGraham</td>\n",
       "      <td>63376</td>\n",
       "      <td>False</td>\n",
       "      <td>20.391</td>\n",
       "      <td>3108</td>\n",
       "      <td>38dd52b1-b065-4328-a620-7b0f549f501c</td>\n",
       "      <td>2023-04-06T07:47:08Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10</td>\n",
       "      <td>2011-12-09 14:11:56</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Stay hungry, Stay foolish.</td>\n",
       "      <td>127</td>\n",
       "      <td>32</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>432537664</td>\n",
       "      <td>en</td>\n",
       "      <td>in the clouds.</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/809682832645824512/WOJIsCmg_normal.jpg</td>\n",
       "      <td>jainabdulaziz</td>\n",
       "      <td>921</td>\n",
       "      <td>False</td>\n",
       "      <td>0.29</td>\n",
       "      <td>3175</td>\n",
       "      <td>233d17cc-42f6-42f6-910f-0d0117ae8456</td>\n",
       "      <td>2022-07-18T17:09:36Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>14</td>\n",
       "      <td>2010-11-03 15:40:20</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Femminista, animalista, antiproibizionista... E altre -ista che ora non ricordo. Ora dirigo il sito de #LeIene 👊🏿</td>\n",
       "      <td>4071</td>\n",
       "      <td>252142</td>\n",
       "      <td>562</td>\n",
       "      <td>True</td>\n",
       "      <td>211550281</td>\n",
       "      <td>it</td>\n",
       "      <td>Italy</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1046488383684521988/KrnhpZyJ_normal.jpg</td>\n",
       "      <td>giuliainnocenzi</td>\n",
       "      <td>6029</td>\n",
       "      <td>True</td>\n",
       "      <td>1.686</td>\n",
       "      <td>3576</td>\n",
       "      <td>7925f5e0-c8d8-41f0-ad77-eeb5511a30d7</td>\n",
       "      <td>2022-05-23T16:54:39Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>15</td>\n",
       "      <td>2012-08-20 11:58:04</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Mi viene da vomitare.  \\n\\n                                                                                \\nDove sono le mie Jordan?</td>\n",
       "      <td>81967</td>\n",
       "      <td>5281</td>\n",
       "      <td>581</td>\n",
       "      <td>True</td>\n",
       "      <td>769392715</td>\n",
       "      <td>it</td>\n",
       "      <td>Oblio</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1271034506041085953/xWUZXxMm_normal.jpg</td>\n",
       "      <td>RichiMasu</td>\n",
       "      <td>106263</td>\n",
       "      <td>False</td>\n",
       "      <td>36.391</td>\n",
       "      <td>2920</td>\n",
       "      <td>9457b540-f911-4869-84a2-a6b758a7739e</td>\n",
       "      <td>2022-08-26T22:34:20Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  unnamed: 0           created_at default_profile default_profile_image  \\\n",
       "0          1  2016-11-09 05:01:30           False                 False   \n",
       "1          9  2012-02-14 15:33:48           False                 False   \n",
       "2         10  2011-12-09 14:11:56           False                 False   \n",
       "3         14  2010-11-03 15:40:20           False                 False   \n",
       "4         15  2012-08-20 11:58:04           False                 False   \n",
       "\n",
       "                                                                                                                                                        description  \\\n",
       "0  Photographing the American West since 1980. I specialize in location portraits & events, both indoors & outside, using natural light & portable studio lighting.   \n",
       "1                                             Man Utd fan. mostly here for football. Takes photos. Ex care worker, does stuff with computers often sarcastic. 🐝🇬🇧🇪🇺   \n",
       "2                                                                                                                                        Stay hungry, Stay foolish.   \n",
       "3                                                 Femminista, animalista, antiproibizionista... E altre -ista che ora non ricordo. Ora dirigo il sito de #LeIene 👊🏿   \n",
       "4                             Mi viene da vomitare.  \\n\\n                                                                                \\nDove sono le mie Jordan?   \n",
       "\n",
       "  favourites_count followers_count friends_count geo_enabled  \\\n",
       "0              536             860           880       False   \n",
       "1            36384            2130          3363        True   \n",
       "2              127              32             0       False   \n",
       "3             4071          252142           562        True   \n",
       "4            81967            5281           581        True   \n",
       "\n",
       "             EVENT_ID lang        location  \\\n",
       "0  796216118331310080   en  Estados Unidos   \n",
       "1           492306486   en  United Kingdom   \n",
       "2           432537664   en  in the clouds.   \n",
       "3           211550281   it           Italy   \n",
       "4           769392715   it           Oblio   \n",
       "\n",
       "                        profile_background_image_url  \\\n",
       "0   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "1  http://abs.twimg.com/images/themes/theme14/bg.gif   \n",
       "2   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "3   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "4   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "\n",
       "                                                             profile_image_url  \\\n",
       "0   http://pbs.twimg.com/profile_images/802329632838037504/CQN6gP7k_normal.jpg   \n",
       "1  http://pbs.twimg.com/profile_images/1211318786512609281/e6UqYEa4_normal.jpg   \n",
       "2   http://pbs.twimg.com/profile_images/809682832645824512/WOJIsCmg_normal.jpg   \n",
       "3  http://pbs.twimg.com/profile_images/1046488383684521988/KrnhpZyJ_normal.jpg   \n",
       "4  http://pbs.twimg.com/profile_images/1271034506041085953/xWUZXxMm_normal.jpg   \n",
       "\n",
       "       screen_name statuses_count verified average_tweets_per_day  \\\n",
       "0     CJRubinPhoto            252    False                  0.183   \n",
       "1       GhamGraham          63376    False                 20.391   \n",
       "2    jainabdulaziz            921    False                   0.29   \n",
       "3  giuliainnocenzi           6029     True                  1.686   \n",
       "4        RichiMasu         106263    False                 36.391   \n",
       "\n",
       "  account_age_days                             ENTITY_ID  \\\n",
       "0             1379  0a8d3859-dec4-4ba6-abae-74b5523b042f   \n",
       "1             3108  38dd52b1-b065-4328-a620-7b0f549f501c   \n",
       "2             3175  233d17cc-42f6-42f6-910f-0d0117ae8456   \n",
       "3             3576  7925f5e0-c8d8-41f0-ad77-eeb5511a30d7   \n",
       "4             2920  9457b540-f911-4869-84a2-a6b758a7739e   \n",
       "\n",
       "        EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "0  2022-08-04T08:04:08Z        user  \n",
       "1  2023-04-06T07:47:08Z        user  \n",
       "2  2022-07-18T17:09:36Z        user  \n",
       "3  2022-05-23T16:54:39Z        user  \n",
       "4  2022-08-26T22:34:20Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(7488, 22)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL\n",
       "0            0\n",
       "1            0\n",
       "2            0\n",
       "3            0\n",
       "4            0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    4987\n",
      "1    2501\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.668648\n",
      "1    0.331352\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "ipblock\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ip</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>dummy_cat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>128.1.248.44</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-11-16T04:03:42Z</td>\n",
       "      <td>2022-06-01T20:30:04Z</td>\n",
       "      <td>user</td>\n",
       "      <td>27dd3612-b997-4e9a-9442-eb08e0f7f923</td>\n",
       "      <td>068b7a8c-8d4a-49a3-ab3a-e4d905ace4cc</td>\n",
       "      <td>1253a4fb-cbfe-4e43-bc4c-4ecbe1cf58da</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>119.46.34.11</td>\n",
       "      <td>0</td>\n",
       "      <td>2022-04-22T04:24:50Z</td>\n",
       "      <td>2022-06-01T20:30:04Z</td>\n",
       "      <td>user</td>\n",
       "      <td>19474b6d-0af8-4610-b80e-485a43276e8a</td>\n",
       "      <td>9e41adf9-fc4d-4078-a005-c9a85950c858</td>\n",
       "      <td>63c81521-a604-4923-9c82-0e82878042d7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>186.172.135.47</td>\n",
       "      <td>0</td>\n",
       "      <td>2022-05-04T19:47:16Z</td>\n",
       "      <td>2022-06-01T20:30:04Z</td>\n",
       "      <td>user</td>\n",
       "      <td>0db63c1e-dd12-4b2b-a39c-254af2176a83</td>\n",
       "      <td>fea535c1-1b52-411d-a38e-adcbec57a95d</td>\n",
       "      <td>9ded7f8a-d6fe-414a-a2da-c5f09588ebce</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>181.133.0.112</td>\n",
       "      <td>0</td>\n",
       "      <td>2022-02-25T04:37:01Z</td>\n",
       "      <td>2022-06-01T20:30:04Z</td>\n",
       "      <td>user</td>\n",
       "      <td>8be34510-e76d-4b78-bb4f-b8f721b8abe5</td>\n",
       "      <td>10d04fd3-db0b-4096-bfbc-4262b280ed69</td>\n",
       "      <td>0d8f329f-dda5-47f0-bd16-430f8745b4d5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>51.4.204.17</td>\n",
       "      <td>0</td>\n",
       "      <td>2022-06-01T06:11:56Z</td>\n",
       "      <td>2022-06-01T20:30:04Z</td>\n",
       "      <td>user</td>\n",
       "      <td>1ae9a3e9-b410-4f36-a3bf-8b466fea97c1</td>\n",
       "      <td>27f40e1b-8a49-43ab-9b68-5ec24469845f</td>\n",
       "      <td>df1cb347-9712-4743-9801-d11eb415d823</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               ip  EVENT_LABEL       EVENT_TIMESTAMP       LABEL_TIMESTAMP  \\\n",
       "0    128.1.248.44            1  2021-11-16T04:03:42Z  2022-06-01T20:30:04Z   \n",
       "1    119.46.34.11            0  2022-04-22T04:24:50Z  2022-06-01T20:30:04Z   \n",
       "2  186.172.135.47            0  2022-05-04T19:47:16Z  2022-06-01T20:30:04Z   \n",
       "3   181.133.0.112            0  2022-02-25T04:37:01Z  2022-06-01T20:30:04Z   \n",
       "4     51.4.204.17            0  2022-06-01T06:11:56Z  2022-06-01T20:30:04Z   \n",
       "\n",
       "  ENTITY_TYPE                              EVENT_ID  \\\n",
       "0        user  27dd3612-b997-4e9a-9442-eb08e0f7f923   \n",
       "1        user  19474b6d-0af8-4610-b80e-485a43276e8a   \n",
       "2        user  0db63c1e-dd12-4b2b-a39c-254af2176a83   \n",
       "3        user  8be34510-e76d-4b78-bb4f-b8f721b8abe5   \n",
       "4        user  1ae9a3e9-b410-4f36-a3bf-8b466fea97c1   \n",
       "\n",
       "                              ENTITY_ID                             dummy_cat  \n",
       "0  068b7a8c-8d4a-49a3-ab3a-e4d905ace4cc  1253a4fb-cbfe-4e43-bc4c-4ecbe1cf58da  \n",
       "1  9e41adf9-fc4d-4078-a005-c9a85950c858  63c81521-a604-4923-9c82-0e82878042d7  \n",
       "2  fea535c1-1b52-411d-a38e-adcbec57a95d  9ded7f8a-d6fe-414a-a2da-c5f09588ebce  \n",
       "3  10d04fd3-db0b-4096-bfbc-4262b280ed69  0d8f329f-dda5-47f0-bd16-430f8745b4d5  \n",
       "4  27f40e1b-8a49-43ab-9b68-5ec24469845f  df1cb347-9712-4743-9801-d11eb415d823  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8\n",
      "(172000, 8)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ip</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>dummy_cat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.10.226.56</td>\n",
       "      <td>2022-03-25T13:09:37Z</td>\n",
       "      <td>user</td>\n",
       "      <td>c6bcbff7-7c2e-4780-8007-ed29cea7535b</td>\n",
       "      <td>1f18a08e-c6ab-4d32-a210-3aed5533c272</td>\n",
       "      <td>3a966adb-01d2-487a-b6ac-c4075be9aff0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.116.89.251</td>\n",
       "      <td>2021-12-19T04:06:53Z</td>\n",
       "      <td>user</td>\n",
       "      <td>3a1e89c3-9bd1-4f32-8b7e-69c97e2fba92</td>\n",
       "      <td>24625027-898a-4ed0-8e95-485b3fd39663</td>\n",
       "      <td>53e2a419-4652-4011-a9fb-6de2a4a1cfcd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.117.176.186</td>\n",
       "      <td>2021-10-02T02:10:34Z</td>\n",
       "      <td>user</td>\n",
       "      <td>1a634e22-b87c-4981-ad66-587a82bfa6e8</td>\n",
       "      <td>2767ff1e-a6d1-4f3e-8f99-dbd2d47152c5</td>\n",
       "      <td>d48e48ca-fcf2-4594-8d37-6511b8ff51c7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.117.207.86</td>\n",
       "      <td>2021-10-31T12:18:58Z</td>\n",
       "      <td>user</td>\n",
       "      <td>dc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c3</td>\n",
       "      <td>7cd5f610-28c1-451a-a86c-75ce1d6b0385</td>\n",
       "      <td>3ae02507-69ef-46ba-b435-2abced9560fa</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.13.17.184</td>\n",
       "      <td>2022-03-08T14:20:40Z</td>\n",
       "      <td>user</td>\n",
       "      <td>59012ee9-8f28-48f2-a133-10cd558ed319</td>\n",
       "      <td>98f6bf66-a87e-4f59-a2d3-e5f673023611</td>\n",
       "      <td>76ed0fbf-70be-4702-9c3e-5abfba2dd536</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              ip       EVENT_TIMESTAMP ENTITY_TYPE  \\\n",
       "0    1.10.226.56  2022-03-25T13:09:37Z        user   \n",
       "1   1.116.89.251  2021-12-19T04:06:53Z        user   \n",
       "2  1.117.176.186  2021-10-02T02:10:34Z        user   \n",
       "3   1.117.207.86  2021-10-31T12:18:58Z        user   \n",
       "4    1.13.17.184  2022-03-08T14:20:40Z        user   \n",
       "\n",
       "                               EVENT_ID                             ENTITY_ID  \\\n",
       "0  c6bcbff7-7c2e-4780-8007-ed29cea7535b  1f18a08e-c6ab-4d32-a210-3aed5533c272   \n",
       "1  3a1e89c3-9bd1-4f32-8b7e-69c97e2fba92  24625027-898a-4ed0-8e95-485b3fd39663   \n",
       "2  1a634e22-b87c-4981-ad66-587a82bfa6e8  2767ff1e-a6d1-4f3e-8f99-dbd2d47152c5   \n",
       "3  dc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c3  7cd5f610-28c1-451a-a86c-75ce1d6b0385   \n",
       "4  59012ee9-8f28-48f2-a133-10cd558ed319  98f6bf66-a87e-4f59-a2d3-e5f673023611   \n",
       "\n",
       "                              dummy_cat  \n",
       "0  3a966adb-01d2-487a-b6ac-c4075be9aff0  \n",
       "1  53e2a419-4652-4011-a9fb-6de2a4a1cfcd  \n",
       "2  d48e48ca-fcf2-4594-8d37-6511b8ff51c7  \n",
       "3  3ae02507-69ef-46ba-b435-2abced9560fa  \n",
       "4  76ed0fbf-70be-4702-9c3e-5abfba2dd536  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(43000, 6)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>c6bcbff7-7c2e-4780-8007-ed29cea7535b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>3a1e89c3-9bd1-4f32-8b7e-69c97e2fba92</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1a634e22-b87c-4981-ad66-587a82bfa6e8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>dc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>59012ee9-8f28-48f2-a133-10cd558ed319</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL                              EVENT_ID\n",
       "0            1  c6bcbff7-7c2e-4780-8007-ed29cea7535b\n",
       "1            1  3a1e89c3-9bd1-4f32-8b7e-69c97e2fba92\n",
       "2            1  1a634e22-b87c-4981-ad66-587a82bfa6e8\n",
       "3            1  dc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c3\n",
       "4            1  59012ee9-8f28-48f2-a133-10cd558ed319"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    40003\n",
      "1     2997\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.930215\n",
      "1    0.069785\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n"
     ]
    }
   ],
   "source": [
    "# for key, val in KAGGLE_CONFIGS.items():\n",
    "for key in all_keys:\n",
    "    obj = FraudDatasetBenchmark(key=key,  )\n",
    "    print(obj.key)\n",
    "    print('Train set: ')\n",
    "    display(obj.train.head())\n",
    "    print(len(obj.train.columns))\n",
    "    print(obj.train.shape)\n",
    "    print('Test set: ')\n",
    "    display(obj.test.head())\n",
    "    print(obj.test.shape)\n",
    "    print('Test scores')\n",
    "    display(obj.test_labels.head())\n",
    "    print(obj.test_labels['EVENT_LABEL'].value_counts())\n",
    "    print(obj.train['EVENT_LABEL'].value_counts(normalize=True))\n",
    "    print('=========','\\n')\n",
    "\n",
    "#         KEY= f'public/official-dataset-names/{val[\"name\"]}/train.csv'\n",
    "#         _s3_upload(obj.train)\n",
    "\n",
    "\n",
    "#         KEY= f'public/official-dataset-names/{val[\"name\"]}/test.csv'\n",
    "#         _s3_upload(obj.test)\n",
    "\n",
    "\n",
    "#         KEY= f'public/official-dataset-names/{val[\"name\"]}/test_labels.csv'\n",
    "#         _s3_upload(obj.test_labels)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Without random values in missing columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parameter settings\n",
    "\n",
    "- load_pre_downloaded: False\n",
    "- delete_downloaded: True\n",
    "- add_random_values_if_real_na = ```\n",
    "{\n",
    "\"EVENT_TIMESTAMP\": False,\n",
    "\"LABEL_TIMESTAMP\": False,\n",
    "\"ENTITY_ID\": False,\n",
    "\"ENTITY_TYPE\": False,\n",
    "\"EVENT_ID\": False\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_keys = ['ccfraud']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "ccfraud\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>v1</th>\n",
       "      <th>v2</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v5</th>\n",
       "      <th>v6</th>\n",
       "      <th>v7</th>\n",
       "      <th>v8</th>\n",
       "      <th>v9</th>\n",
       "      <th>v10</th>\n",
       "      <th>v11</th>\n",
       "      <th>v12</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v15</th>\n",
       "      <th>v16</th>\n",
       "      <th>v17</th>\n",
       "      <th>v18</th>\n",
       "      <th>v19</th>\n",
       "      <th>v20</th>\n",
       "      <th>v21</th>\n",
       "      <th>v22</th>\n",
       "      <th>v23</th>\n",
       "      <th>v24</th>\n",
       "      <th>v25</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v28</th>\n",
       "      <th>amount</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.3598071336738</td>\n",
       "      <td>-0.0727811733098497</td>\n",
       "      <td>2.53634673796914</td>\n",
       "      <td>1.37815522427443</td>\n",
       "      <td>-0.338320769942518</td>\n",
       "      <td>0.462387777762292</td>\n",
       "      <td>0.239598554061257</td>\n",
       "      <td>0.0986979012610507</td>\n",
       "      <td>0.363786969611213</td>\n",
       "      <td>0.0907941719789316</td>\n",
       "      <td>-0.551599533260813</td>\n",
       "      <td>-0.617800855762348</td>\n",
       "      <td>-0.991389847235408</td>\n",
       "      <td>-0.311169353699879</td>\n",
       "      <td>1.46817697209427</td>\n",
       "      <td>-0.470400525259478</td>\n",
       "      <td>0.207971241929242</td>\n",
       "      <td>0.0257905801985591</td>\n",
       "      <td>0.403992960255733</td>\n",
       "      <td>0.251412098239705</td>\n",
       "      <td>-0.018306777944153</td>\n",
       "      <td>0.277837575558899</td>\n",
       "      <td>-0.110473910188767</td>\n",
       "      <td>0.0669280749146731</td>\n",
       "      <td>0.128539358273528</td>\n",
       "      <td>-0.189114843888824</td>\n",
       "      <td>0.133558376740387</td>\n",
       "      <td>-0.0210530534538215</td>\n",
       "      <td>149.62</td>\n",
       "      <td>0</td>\n",
       "      <td>2021-09-01T00:00:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.19185711131486</td>\n",
       "      <td>0.26615071205963</td>\n",
       "      <td>0.16648011335321</td>\n",
       "      <td>0.448154078460911</td>\n",
       "      <td>0.0600176492822243</td>\n",
       "      <td>-0.0823608088155687</td>\n",
       "      <td>-0.0788029833323113</td>\n",
       "      <td>0.0851016549148104</td>\n",
       "      <td>-0.255425128109186</td>\n",
       "      <td>-0.166974414004614</td>\n",
       "      <td>1.61272666105479</td>\n",
       "      <td>1.06523531137287</td>\n",
       "      <td>0.48909501589608</td>\n",
       "      <td>-0.143772296441519</td>\n",
       "      <td>0.635558093258208</td>\n",
       "      <td>0.463917041022171</td>\n",
       "      <td>-0.114804663102346</td>\n",
       "      <td>-0.183361270123994</td>\n",
       "      <td>-0.145783041325259</td>\n",
       "      <td>-0.0690831352230203</td>\n",
       "      <td>-0.225775248033138</td>\n",
       "      <td>-0.638671952771851</td>\n",
       "      <td>0.101288021253234</td>\n",
       "      <td>-0.339846475529127</td>\n",
       "      <td>0.167170404418143</td>\n",
       "      <td>0.125894532368176</td>\n",
       "      <td>-0.00898309914322813</td>\n",
       "      <td>0.0147241691924927</td>\n",
       "      <td>2.69</td>\n",
       "      <td>0</td>\n",
       "      <td>2021-09-01T00:00:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.35835406159823</td>\n",
       "      <td>-1.34016307473609</td>\n",
       "      <td>1.77320934263119</td>\n",
       "      <td>0.379779593034328</td>\n",
       "      <td>-0.503198133318193</td>\n",
       "      <td>1.80049938079263</td>\n",
       "      <td>0.791460956450422</td>\n",
       "      <td>0.247675786588991</td>\n",
       "      <td>-1.51465432260583</td>\n",
       "      <td>0.207642865216696</td>\n",
       "      <td>0.624501459424895</td>\n",
       "      <td>0.066083685268831</td>\n",
       "      <td>0.717292731410831</td>\n",
       "      <td>-0.165945922763554</td>\n",
       "      <td>2.34586494901581</td>\n",
       "      <td>-2.89008319444231</td>\n",
       "      <td>1.10996937869599</td>\n",
       "      <td>-0.121359313195888</td>\n",
       "      <td>-2.26185709530414</td>\n",
       "      <td>0.524979725224404</td>\n",
       "      <td>0.247998153469754</td>\n",
       "      <td>0.771679401917229</td>\n",
       "      <td>0.909412262347719</td>\n",
       "      <td>-0.689280956490685</td>\n",
       "      <td>-0.327641833735251</td>\n",
       "      <td>-0.139096571514147</td>\n",
       "      <td>-0.0553527940384261</td>\n",
       "      <td>-0.0597518405929204</td>\n",
       "      <td>378.66</td>\n",
       "      <td>0</td>\n",
       "      <td>2021-09-01T00:01:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.966271711572087</td>\n",
       "      <td>-0.185226008082898</td>\n",
       "      <td>1.79299333957872</td>\n",
       "      <td>-0.863291275036453</td>\n",
       "      <td>-0.0103088796030823</td>\n",
       "      <td>1.24720316752486</td>\n",
       "      <td>0.23760893977178</td>\n",
       "      <td>0.377435874652262</td>\n",
       "      <td>-1.38702406270197</td>\n",
       "      <td>-0.0549519224713749</td>\n",
       "      <td>-0.226487263835401</td>\n",
       "      <td>0.178228225877303</td>\n",
       "      <td>0.507756869957169</td>\n",
       "      <td>-0.28792374549456</td>\n",
       "      <td>-0.631418117709045</td>\n",
       "      <td>-1.0596472454325</td>\n",
       "      <td>-0.684092786345479</td>\n",
       "      <td>1.96577500349538</td>\n",
       "      <td>-1.2326219700892</td>\n",
       "      <td>-0.208037781160366</td>\n",
       "      <td>-0.108300452035545</td>\n",
       "      <td>0.00527359678253453</td>\n",
       "      <td>-0.190320518742841</td>\n",
       "      <td>-1.17557533186321</td>\n",
       "      <td>0.647376034602038</td>\n",
       "      <td>-0.221928844458407</td>\n",
       "      <td>0.0627228487293033</td>\n",
       "      <td>0.0614576285006353</td>\n",
       "      <td>123.5</td>\n",
       "      <td>0</td>\n",
       "      <td>2021-09-01T00:01:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.15823309349523</td>\n",
       "      <td>0.877736754848451</td>\n",
       "      <td>1.548717846511</td>\n",
       "      <td>0.403033933955121</td>\n",
       "      <td>-0.407193377311653</td>\n",
       "      <td>0.0959214624684256</td>\n",
       "      <td>0.592940745385545</td>\n",
       "      <td>-0.270532677192282</td>\n",
       "      <td>0.817739308235294</td>\n",
       "      <td>0.753074431976354</td>\n",
       "      <td>-0.822842877946363</td>\n",
       "      <td>0.53819555014995</td>\n",
       "      <td>1.3458515932154</td>\n",
       "      <td>-1.11966983471731</td>\n",
       "      <td>0.175121130008994</td>\n",
       "      <td>-0.451449182813529</td>\n",
       "      <td>-0.237033239362776</td>\n",
       "      <td>-0.0381947870352842</td>\n",
       "      <td>0.803486924960175</td>\n",
       "      <td>0.408542360392758</td>\n",
       "      <td>-0.00943069713232919</td>\n",
       "      <td>0.79827849458971</td>\n",
       "      <td>-0.137458079619063</td>\n",
       "      <td>0.141266983824769</td>\n",
       "      <td>-0.206009587619756</td>\n",
       "      <td>0.502292224181569</td>\n",
       "      <td>0.219422229513348</td>\n",
       "      <td>0.215153147499206</td>\n",
       "      <td>69.99</td>\n",
       "      <td>0</td>\n",
       "      <td>2021-09-01T00:02:00Z</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   v1                   v2                v3  \\\n",
       "0    -1.3598071336738  -0.0727811733098497  2.53634673796914   \n",
       "1    1.19185711131486     0.26615071205963  0.16648011335321   \n",
       "2   -1.35835406159823    -1.34016307473609  1.77320934263119   \n",
       "3  -0.966271711572087   -0.185226008082898  1.79299333957872   \n",
       "4   -1.15823309349523    0.877736754848451    1.548717846511   \n",
       "\n",
       "                   v4                   v5                   v6  \\\n",
       "0    1.37815522427443   -0.338320769942518    0.462387777762292   \n",
       "1   0.448154078460911   0.0600176492822243  -0.0823608088155687   \n",
       "2   0.379779593034328   -0.503198133318193     1.80049938079263   \n",
       "3  -0.863291275036453  -0.0103088796030823     1.24720316752486   \n",
       "4   0.403033933955121   -0.407193377311653   0.0959214624684256   \n",
       "\n",
       "                    v7                  v8                  v9  \\\n",
       "0    0.239598554061257  0.0986979012610507   0.363786969611213   \n",
       "1  -0.0788029833323113  0.0851016549148104  -0.255425128109186   \n",
       "2    0.791460956450422   0.247675786588991   -1.51465432260583   \n",
       "3     0.23760893977178   0.377435874652262   -1.38702406270197   \n",
       "4    0.592940745385545  -0.270532677192282   0.817739308235294   \n",
       "\n",
       "                   v10                 v11                 v12  \\\n",
       "0   0.0907941719789316  -0.551599533260813  -0.617800855762348   \n",
       "1   -0.166974414004614    1.61272666105479    1.06523531137287   \n",
       "2    0.207642865216696   0.624501459424895   0.066083685268831   \n",
       "3  -0.0549519224713749  -0.226487263835401   0.178228225877303   \n",
       "4    0.753074431976354  -0.822842877946363    0.53819555014995   \n",
       "\n",
       "                  v13                 v14                 v15  \\\n",
       "0  -0.991389847235408  -0.311169353699879    1.46817697209427   \n",
       "1    0.48909501589608  -0.143772296441519   0.635558093258208   \n",
       "2   0.717292731410831  -0.165945922763554    2.34586494901581   \n",
       "3   0.507756869957169   -0.28792374549456  -0.631418117709045   \n",
       "4     1.3458515932154   -1.11966983471731   0.175121130008994   \n",
       "\n",
       "                  v16                 v17                  v18  \\\n",
       "0  -0.470400525259478   0.207971241929242   0.0257905801985591   \n",
       "1   0.463917041022171  -0.114804663102346   -0.183361270123994   \n",
       "2   -2.89008319444231    1.10996937869599   -0.121359313195888   \n",
       "3    -1.0596472454325  -0.684092786345479     1.96577500349538   \n",
       "4  -0.451449182813529  -0.237033239362776  -0.0381947870352842   \n",
       "\n",
       "                  v19                  v20                   v21  \\\n",
       "0   0.403992960255733    0.251412098239705    -0.018306777944153   \n",
       "1  -0.145783041325259  -0.0690831352230203    -0.225775248033138   \n",
       "2   -2.26185709530414    0.524979725224404     0.247998153469754   \n",
       "3    -1.2326219700892   -0.208037781160366    -0.108300452035545   \n",
       "4   0.803486924960175    0.408542360392758  -0.00943069713232919   \n",
       "\n",
       "                   v22                 v23                 v24  \\\n",
       "0    0.277837575558899  -0.110473910188767  0.0669280749146731   \n",
       "1   -0.638671952771851   0.101288021253234  -0.339846475529127   \n",
       "2    0.771679401917229   0.909412262347719  -0.689280956490685   \n",
       "3  0.00527359678253453  -0.190320518742841   -1.17557533186321   \n",
       "4     0.79827849458971  -0.137458079619063   0.141266983824769   \n",
       "\n",
       "                  v25                 v26                   v27  \\\n",
       "0   0.128539358273528  -0.189114843888824     0.133558376740387   \n",
       "1   0.167170404418143   0.125894532368176  -0.00898309914322813   \n",
       "2  -0.327641833735251  -0.139096571514147   -0.0553527940384261   \n",
       "3   0.647376034602038  -0.221928844458407    0.0627228487293033   \n",
       "4  -0.206009587619756   0.502292224181569     0.219422229513348   \n",
       "\n",
       "                   v28  amount  EVENT_LABEL       EVENT_TIMESTAMP  \n",
       "0  -0.0210530534538215  149.62            0  2021-09-01T00:00:00Z  \n",
       "1   0.0147241691924927    2.69            0  2021-09-01T00:00:00Z  \n",
       "2  -0.0597518405929204  378.66            0  2021-09-01T00:01:00Z  \n",
       "3   0.0614576285006353   123.5            0  2021-09-01T00:01:00Z  \n",
       "4    0.215153147499206   69.99            0  2021-09-01T00:02:00Z  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31\n",
      "(227845, 31)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>v1</th>\n",
       "      <th>v2</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v5</th>\n",
       "      <th>v6</th>\n",
       "      <th>v7</th>\n",
       "      <th>v8</th>\n",
       "      <th>v9</th>\n",
       "      <th>v10</th>\n",
       "      <th>v11</th>\n",
       "      <th>v12</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v15</th>\n",
       "      <th>v16</th>\n",
       "      <th>v17</th>\n",
       "      <th>v18</th>\n",
       "      <th>v19</th>\n",
       "      <th>v20</th>\n",
       "      <th>v21</th>\n",
       "      <th>v22</th>\n",
       "      <th>v23</th>\n",
       "      <th>v24</th>\n",
       "      <th>v25</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v28</th>\n",
       "      <th>amount</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>227845</th>\n",
       "      <td>1.91402682161454</td>\n",
       "      <td>-0.490067987909997</td>\n",
       "      <td>-0.326111312515118</td>\n",
       "      <td>0.604710739174721</td>\n",
       "      <td>-0.8501359998436</td>\n",
       "      <td>-0.736318677031096</td>\n",
       "      <td>-0.524057962475328</td>\n",
       "      <td>-0.0886141066361987</td>\n",
       "      <td>1.09112510472248</td>\n",
       "      <td>0.093484357816225</td>\n",
       "      <td>-0.892304625856107</td>\n",
       "      <td>0.0272205159068718</td>\n",
       "      <td>-0.243790209618721</td>\n",
       "      <td>0.0317740067189187</td>\n",
       "      <td>0.900623897113791</td>\n",
       "      <td>0.536032161644219</td>\n",
       "      <td>-0.648408094097169</td>\n",
       "      <td>0.183072340001028</td>\n",
       "      <td>-0.48632249422331</td>\n",
       "      <td>-0.13957876335222</td>\n",
       "      <td>0.210958428878652</td>\n",
       "      <td>0.639337879054097</td>\n",
       "      <td>0.147522551988298</td>\n",
       "      <td>0.0736542664022496</td>\n",
       "      <td>-0.318378246601246</td>\n",
       "      <td>0.350612262707235</td>\n",
       "      <td>-0.0238434747433154</td>\n",
       "      <td>-0.0371393315055126</td>\n",
       "      <td>50</td>\n",
       "      <td>2021-12-10T20:48:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227846</th>\n",
       "      <td>2.15269624649984</td>\n",
       "      <td>-0.036160786158066</td>\n",
       "      <td>-2.23181098049803</td>\n",
       "      <td>0.0917658435583919</td>\n",
       "      <td>0.537612206488446</td>\n",
       "      <td>-1.36810250972644</td>\n",
       "      <td>0.613326738349479</td>\n",
       "      <td>-0.455251954849699</td>\n",
       "      <td>0.29181359004335</td>\n",
       "      <td>0.253161344559488</td>\n",
       "      <td>-1.50188197076942</td>\n",
       "      <td>-0.870607641524177</td>\n",
       "      <td>-1.44173756499372</td>\n",
       "      <td>0.988756626201074</td>\n",
       "      <td>0.496349234837293</td>\n",
       "      <td>-0.0686989613348823</td>\n",
       "      <td>-0.454073497932566</td>\n",
       "      <td>-0.299095262736551</td>\n",
       "      <td>0.267443131415241</td>\n",
       "      <td>-0.275777914750361</td>\n",
       "      <td>0.0171533555339963</td>\n",
       "      <td>0.0632416225359206</td>\n",
       "      <td>-0.0345611249491173</td>\n",
       "      <td>-0.626866212626912</td>\n",
       "      <td>0.249213129413917</td>\n",
       "      <td>0.773930519516097</td>\n",
       "      <td>-0.137114784582898</td>\n",
       "      <td>-0.0906106088420727</td>\n",
       "      <td>14.95</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227847</th>\n",
       "      <td>-4.03479516717275</td>\n",
       "      <td>2.30507905571504</td>\n",
       "      <td>-1.46169292457709</td>\n",
       "      <td>-0.729887055238227</td>\n",
       "      <td>-1.5287503399573</td>\n",
       "      <td>-1.22567909778369</td>\n",
       "      <td>-0.893353679497868</td>\n",
       "      <td>1.62252199369554</td>\n",
       "      <td>1.29199841774415</td>\n",
       "      <td>-0.0409558359937061</td>\n",
       "      <td>-0.971425287697512</td>\n",
       "      <td>0.574743695630458</td>\n",
       "      <td>0.155656078919204</td>\n",
       "      <td>-0.729054997889385</td>\n",
       "      <td>0.477438947999659</td>\n",
       "      <td>1.06171851569252</td>\n",
       "      <td>0.93469475367536</td>\n",
       "      <td>0.403768792198479</td>\n",
       "      <td>-0.494929851777981</td>\n",
       "      <td>-0.0810925858921718</td>\n",
       "      <td>-0.392556502541116</td>\n",
       "      <td>-0.78759906251576</td>\n",
       "      <td>0.343467795972994</td>\n",
       "      <td>-0.0903313999840935</td>\n",
       "      <td>0.248286972151669</td>\n",
       "      <td>-0.238523845342424</td>\n",
       "      <td>0.26648354183946</td>\n",
       "      <td>-0.0622361634691654</td>\n",
       "      <td>7.7</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227848</th>\n",
       "      <td>-1.66874106862583</td>\n",
       "      <td>1.16805471760364</td>\n",
       "      <td>0.249642461553748</td>\n",
       "      <td>-1.26849748925032</td>\n",
       "      <td>0.785922573014156</td>\n",
       "      <td>-0.663958562166729</td>\n",
       "      <td>0.859432973616895</td>\n",
       "      <td>0.0681106263347446</td>\n",
       "      <td>-0.144183044927318</td>\n",
       "      <td>0.0432880841287975</td>\n",
       "      <td>0.542013736060061</td>\n",
       "      <td>1.00202450469061</td>\n",
       "      <td>0.400759595743433</td>\n",
       "      <td>0.136412487776037</td>\n",
       "      <td>-1.28964902448879</td>\n",
       "      <td>0.276827961550432</td>\n",
       "      <td>-0.868491702025561</td>\n",
       "      <td>-0.366839507131127</td>\n",
       "      <td>-0.187391599008302</td>\n",
       "      <td>-0.0335233340620367</td>\n",
       "      <td>-0.247543775399679</td>\n",
       "      <td>-0.592536769878023</td>\n",
       "      <td>-0.286693549546811</td>\n",
       "      <td>-0.378855664973759</td>\n",
       "      <td>-0.0774289041638705</td>\n",
       "      <td>0.0676084004301294</td>\n",
       "      <td>-0.27896200360197</td>\n",
       "      <td>-0.0641926690992577</td>\n",
       "      <td>6.99</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227849</th>\n",
       "      <td>-0.550678353341949</td>\n",
       "      <td>-0.429004102182237</td>\n",
       "      <td>-1.29189255347072</td>\n",
       "      <td>-0.414409226593379</td>\n",
       "      <td>-0.292228538671312</td>\n",
       "      <td>0.071842939235058</td>\n",
       "      <td>2.42606795091335</td>\n",
       "      <td>-0.212729758223082</td>\n",
       "      <td>0.412374372851086</td>\n",
       "      <td>-1.93996940549555</td>\n",
       "      <td>-1.81011838293809</td>\n",
       "      <td>-1.22351031687552</td>\n",
       "      <td>-1.32491464932768</td>\n",
       "      <td>-1.46239178995552</td>\n",
       "      <td>-0.31164055759838</td>\n",
       "      <td>0.506707760378257</td>\n",
       "      <td>0.739932584638577</td>\n",
       "      <td>0.892422017204659</td>\n",
       "      <td>0.195042529037103</td>\n",
       "      <td>0.791126747715284</td>\n",
       "      <td>0.00303193944814891</td>\n",
       "      <td>-0.645782978858753</td>\n",
       "      <td>0.877016475964068</td>\n",
       "      <td>-1.22852893747944</td>\n",
       "      <td>-0.0362812174160739</td>\n",
       "      <td>-0.110609895882901</td>\n",
       "      <td>-0.0983803135271981</td>\n",
       "      <td>0.0959849443846813</td>\n",
       "      <td>460.71</td>\n",
       "      <td>2021-12-10T20:50:00Z</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        v1                  v2                  v3  \\\n",
       "227845    1.91402682161454  -0.490067987909997  -0.326111312515118   \n",
       "227846    2.15269624649984  -0.036160786158066   -2.23181098049803   \n",
       "227847   -4.03479516717275    2.30507905571504   -1.46169292457709   \n",
       "227848   -1.66874106862583    1.16805471760364   0.249642461553748   \n",
       "227849  -0.550678353341949  -0.429004102182237   -1.29189255347072   \n",
       "\n",
       "                        v4                  v5                  v6  \\\n",
       "227845   0.604710739174721    -0.8501359998436  -0.736318677031096   \n",
       "227846  0.0917658435583919   0.537612206488446   -1.36810250972644   \n",
       "227847  -0.729887055238227    -1.5287503399573   -1.22567909778369   \n",
       "227848   -1.26849748925032   0.785922573014156  -0.663958562166729   \n",
       "227849  -0.414409226593379  -0.292228538671312   0.071842939235058   \n",
       "\n",
       "                        v7                   v8                  v9  \\\n",
       "227845  -0.524057962475328  -0.0886141066361987    1.09112510472248   \n",
       "227846   0.613326738349479   -0.455251954849699    0.29181359004335   \n",
       "227847  -0.893353679497868     1.62252199369554    1.29199841774415   \n",
       "227848   0.859432973616895   0.0681106263347446  -0.144183044927318   \n",
       "227849    2.42606795091335   -0.212729758223082   0.412374372851086   \n",
       "\n",
       "                        v10                 v11                 v12  \\\n",
       "227845    0.093484357816225  -0.892304625856107  0.0272205159068718   \n",
       "227846    0.253161344559488   -1.50188197076942  -0.870607641524177   \n",
       "227847  -0.0409558359937061  -0.971425287697512   0.574743695630458   \n",
       "227848   0.0432880841287975   0.542013736060061    1.00202450469061   \n",
       "227849    -1.93996940549555   -1.81011838293809   -1.22351031687552   \n",
       "\n",
       "                       v13                 v14                v15  \\\n",
       "227845  -0.243790209618721  0.0317740067189187  0.900623897113791   \n",
       "227846   -1.44173756499372   0.988756626201074  0.496349234837293   \n",
       "227847   0.155656078919204  -0.729054997889385  0.477438947999659   \n",
       "227848   0.400759595743433   0.136412487776037  -1.28964902448879   \n",
       "227849   -1.32491464932768   -1.46239178995552  -0.31164055759838   \n",
       "\n",
       "                        v16                 v17                 v18  \\\n",
       "227845    0.536032161644219  -0.648408094097169   0.183072340001028   \n",
       "227846  -0.0686989613348823  -0.454073497932566  -0.299095262736551   \n",
       "227847     1.06171851569252    0.93469475367536   0.403768792198479   \n",
       "227848    0.276827961550432  -0.868491702025561  -0.366839507131127   \n",
       "227849    0.506707760378257   0.739932584638577   0.892422017204659   \n",
       "\n",
       "                       v19                  v20                  v21  \\\n",
       "227845   -0.48632249422331    -0.13957876335222    0.210958428878652   \n",
       "227846   0.267443131415241   -0.275777914750361   0.0171533555339963   \n",
       "227847  -0.494929851777981  -0.0810925858921718   -0.392556502541116   \n",
       "227848  -0.187391599008302  -0.0335233340620367   -0.247543775399679   \n",
       "227849   0.195042529037103    0.791126747715284  0.00303193944814891   \n",
       "\n",
       "                       v22                  v23                  v24  \\\n",
       "227845   0.639337879054097    0.147522551988298   0.0736542664022496   \n",
       "227846  0.0632416225359206  -0.0345611249491173   -0.626866212626912   \n",
       "227847   -0.78759906251576    0.343467795972994  -0.0903313999840935   \n",
       "227848  -0.592536769878023   -0.286693549546811   -0.378855664973759   \n",
       "227849  -0.645782978858753    0.877016475964068    -1.22852893747944   \n",
       "\n",
       "                        v25                 v26                  v27  \\\n",
       "227845   -0.318378246601246   0.350612262707235  -0.0238434747433154   \n",
       "227846    0.249213129413917   0.773930519516097   -0.137114784582898   \n",
       "227847    0.248286972151669  -0.238523845342424     0.26648354183946   \n",
       "227848  -0.0774289041638705  0.0676084004301294    -0.27896200360197   \n",
       "227849  -0.0362812174160739  -0.110609895882901  -0.0983803135271981   \n",
       "\n",
       "                        v28  amount       EVENT_TIMESTAMP  \n",
       "227845  -0.0371393315055126      50  2021-12-10T20:48:00Z  \n",
       "227846  -0.0906106088420727   14.95  2021-12-10T20:49:00Z  \n",
       "227847  -0.0622361634691654     7.7  2021-12-10T20:49:00Z  \n",
       "227848  -0.0641926690992577    6.99  2021-12-10T20:49:00Z  \n",
       "227849   0.0959849443846813  460.71  2021-12-10T20:50:00Z  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(56962, 30)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>227845</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227846</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227847</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227848</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227849</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        EVENT_LABEL\n",
       "227845            0\n",
       "227846            0\n",
       "227847            0\n",
       "227848            0\n",
       "227849            0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    56887\n",
      "1       75\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.99817\n",
      "1    0.00183\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for key in all_keys:\n",
    "    obj = FraudDatasetBenchmark(key=key,  \n",
    "                                add_random_values_if_real_na = { \"EVENT_TIMESTAMP\": False, \n",
    "                                                                \"LABEL_TIMESTAMP\": False,\n",
    "                                                                \"ENTITY_ID\": False,\n",
    "                                                                \"ENTITY_TYPE\": False,\n",
    "                                                                \"EVENT_ID\": False})\n",
    "    print(obj.key)\n",
    "    print('Train set: ')\n",
    "    display(obj.train.head())\n",
    "    print(len(obj.train.columns))\n",
    "    print(obj.train.shape)\n",
    "    print('Test set: ')\n",
    "    display(obj.test.head())\n",
    "    print(obj.test.shape)\n",
    "    print('Test scores')\n",
    "    display(obj.test_labels.head())\n",
    "    print(obj.test_labels['EVENT_LABEL'].value_counts())\n",
    "    print(obj.train['EVENT_LABEL'].value_counts(normalize=True))\n",
    "    print('=========','\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Persisting downloaded data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Important**: If you are running multiple experiments, download from Kaggle multiple times might exceed account level API call limits. So persisting the downloaded dataset is recommended in such scenarios"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### First download but not delete the data "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parameter settings\n",
    "\n",
    "- load_pre_downloaded: False\n",
    "- delete_downloaded: False\n",
    "- add_random_values_if_real_na = ```\n",
    "{\n",
    "\"EVENT_TIMESTAMP\": False,\n",
    "\"LABEL_TIMESTAMP\": False,\n",
    "\"ENTITY_ID\": False,\n",
    "\"ENTITY_TYPE\": True,\n",
    "\"EVENT_ID\": True\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_keys = ['twitterbot']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "twitterbot\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unnamed: 0</th>\n",
       "      <th>created_at</th>\n",
       "      <th>default_profile</th>\n",
       "      <th>default_profile_image</th>\n",
       "      <th>description</th>\n",
       "      <th>favourites_count</th>\n",
       "      <th>followers_count</th>\n",
       "      <th>friends_count</th>\n",
       "      <th>geo_enabled</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>lang</th>\n",
       "      <th>location</th>\n",
       "      <th>profile_background_image_url</th>\n",
       "      <th>profile_image_url</th>\n",
       "      <th>screen_name</th>\n",
       "      <th>statuses_count</th>\n",
       "      <th>verified</th>\n",
       "      <th>average_tweets_per_day</th>\n",
       "      <th>account_age_days</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20963</th>\n",
       "      <td>20963</td>\n",
       "      <td>2013-05-27 21:22:15</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.</td>\n",
       "      <td>32374</td>\n",
       "      <td>2395</td>\n",
       "      <td>2823</td>\n",
       "      <td>True</td>\n",
       "      <td>1463172686</td>\n",
       "      <td>en</td>\n",
       "      <td>Mount Morris, MI</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg</td>\n",
       "      <td>blerdwords</td>\n",
       "      <td>11448</td>\n",
       "      <td>False</td>\n",
       "      <td>4.336</td>\n",
       "      <td>2640</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6331</th>\n",
       "      <td>6331</td>\n",
       "      <td>2009-09-14 18:58:36</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo</td>\n",
       "      <td>68664</td>\n",
       "      <td>350789</td>\n",
       "      <td>1528</td>\n",
       "      <td>False</td>\n",
       "      <td>74231747</td>\n",
       "      <td>en</td>\n",
       "      <td>Los Angeles - Always a Texan</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme9/bg.gif</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg</td>\n",
       "      <td>JennyJohnsonHi5</td>\n",
       "      <td>18732</td>\n",
       "      <td>True</td>\n",
       "      <td>4.694</td>\n",
       "      <td>3991</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17209</th>\n",
       "      <td>17209</td>\n",
       "      <td>2010-06-06 16:27:08</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>54</td>\n",
       "      <td>657</td>\n",
       "      <td>False</td>\n",
       "      <td>152688783</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Abu Dhabi</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png</td>\n",
       "      <td>AbnerJosh</td>\n",
       "      <td>161</td>\n",
       "      <td>False</td>\n",
       "      <td>0.043</td>\n",
       "      <td>3726</td>\n",
       "      <td>1</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23964</th>\n",
       "      <td>23964</td>\n",
       "      <td>2010-06-22 21:56:09</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.</td>\n",
       "      <td>1517</td>\n",
       "      <td>55881</td>\n",
       "      <td>991</td>\n",
       "      <td>True</td>\n",
       "      <td>158502985</td>\n",
       "      <td>en</td>\n",
       "      <td>Regina, SK Canada</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg</td>\n",
       "      <td>GlobalRegina</td>\n",
       "      <td>103379</td>\n",
       "      <td>True</td>\n",
       "      <td>27.865</td>\n",
       "      <td>3710</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30569</th>\n",
       "      <td>30569</td>\n",
       "      <td>2009-03-10 02:26:45</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Detritus</td>\n",
       "      <td>2616</td>\n",
       "      <td>1118405</td>\n",
       "      <td>657</td>\n",
       "      <td>False</td>\n",
       "      <td>23544268</td>\n",
       "      <td>no</td>\n",
       "      <td>unknown</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme11/bg.gif</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg</td>\n",
       "      <td>OfficialKat</td>\n",
       "      <td>4980</td>\n",
       "      <td>True</td>\n",
       "      <td>1.191</td>\n",
       "      <td>4180</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      unnamed: 0           created_at default_profile default_profile_image  \\\n",
       "20963      20963  2013-05-27 21:22:15            True                 False   \n",
       "6331        6331  2009-09-14 18:58:36           False                 False   \n",
       "17209      17209  2010-06-06 16:27:08            True                 False   \n",
       "23964      23964  2010-06-22 21:56:09           False                 False   \n",
       "30569      30569  2009-03-10 02:26:45           False                 False   \n",
       "\n",
       "                                                                                                                                                           description  \\\n",
       "20963     WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.   \n",
       "6331   Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo   \n",
       "17209                                                                                                                                                              NaN   \n",
       "23964                    Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.   \n",
       "30569                                                                                                                                                         Detritus   \n",
       "\n",
       "      favourites_count followers_count friends_count geo_enabled    EVENT_ID  \\\n",
       "20963            32374            2395          2823        True  1463172686   \n",
       "6331             68664          350789          1528       False    74231747   \n",
       "17209               74              54           657       False   152688783   \n",
       "23964             1517           55881           991        True   158502985   \n",
       "30569             2616         1118405           657       False    23544268   \n",
       "\n",
       "      lang                      location  \\\n",
       "20963   en              Mount Morris, MI   \n",
       "6331    en  Los Angeles - Always a Texan   \n",
       "17209  NaN                     Abu Dhabi   \n",
       "23964   en             Regina, SK Canada   \n",
       "30569   no                       unknown   \n",
       "\n",
       "                            profile_background_image_url  \\\n",
       "20963   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "6331    http://abs.twimg.com/images/themes/theme9/bg.gif   \n",
       "17209   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "23964   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "30569  http://abs.twimg.com/images/themes/theme11/bg.gif   \n",
       "\n",
       "                                                                 profile_image_url  \\\n",
       "20963  http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg   \n",
       "6331    http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg   \n",
       "17209         http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png   \n",
       "23964   http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg   \n",
       "30569  http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg   \n",
       "\n",
       "           screen_name statuses_count verified average_tweets_per_day  \\\n",
       "20963       blerdwords          11448    False                  4.336   \n",
       "6331   JennyJohnsonHi5          18732     True                  4.694   \n",
       "17209        AbnerJosh            161    False                  0.043   \n",
       "23964     GlobalRegina         103379     True                 27.865   \n",
       "30569      OfficialKat           4980     True                  1.191   \n",
       "\n",
       "      account_age_days  EVENT_LABEL ENTITY_TYPE  \n",
       "20963             2640            0        user  \n",
       "6331              3991            0        user  \n",
       "17209             3726            1        user  \n",
       "23964             3710            0        user  \n",
       "30569             4180            0        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21\n",
      "(29950, 21)\n",
      "========= \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for key in all_keys:\n",
    "    obj = FraudDatasetBenchmark(key=key,  \n",
    "                                delete_downloaded=False,\n",
    "                                add_random_values_if_real_na = { \"EVENT_TIMESTAMP\": False, \"LABEL_TIMESTAMP\": False, \"ENTITY_ID\": False, \"ENTITY_TYPE\": True })\n",
    "    print(obj.key)\n",
    "    print('Train set: ')\n",
    "    display(obj.train.head())\n",
    "    print(len(obj.train.columns))\n",
    "    print(obj.train.shape)\n",
    "    print('=========','\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Now load from previosly downloaded data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parameter settings\n",
    "\n",
    "- load_pre_downloaded: True\n",
    "- delete_downloaded: False\n",
    "- add_random_values_if_real_na = ```\n",
    "{\n",
    "\"EVENT_TIMESTAMP\": False,\n",
    "\"LABEL_TIMESTAMP\": False,\n",
    "\"ENTITY_ID\": False,\n",
    "\"ENTITY_TYPE\": True,\n",
    "\"EVENT_ID\": True\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_keys = ['twitterbot']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "twitterbot\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>unnamed: 0</th>\n",
       "      <th>created_at</th>\n",
       "      <th>default_profile</th>\n",
       "      <th>default_profile_image</th>\n",
       "      <th>description</th>\n",
       "      <th>favourites_count</th>\n",
       "      <th>followers_count</th>\n",
       "      <th>friends_count</th>\n",
       "      <th>geo_enabled</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>lang</th>\n",
       "      <th>location</th>\n",
       "      <th>profile_background_image_url</th>\n",
       "      <th>profile_image_url</th>\n",
       "      <th>screen_name</th>\n",
       "      <th>statuses_count</th>\n",
       "      <th>verified</th>\n",
       "      <th>average_tweets_per_day</th>\n",
       "      <th>account_age_days</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20963</th>\n",
       "      <td>20963</td>\n",
       "      <td>2013-05-27 21:22:15</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.</td>\n",
       "      <td>32374</td>\n",
       "      <td>2395</td>\n",
       "      <td>2823</td>\n",
       "      <td>True</td>\n",
       "      <td>1463172686</td>\n",
       "      <td>en</td>\n",
       "      <td>Mount Morris, MI</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg</td>\n",
       "      <td>blerdwords</td>\n",
       "      <td>11448</td>\n",
       "      <td>False</td>\n",
       "      <td>4.336</td>\n",
       "      <td>2640</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6331</th>\n",
       "      <td>6331</td>\n",
       "      <td>2009-09-14 18:58:36</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo</td>\n",
       "      <td>68664</td>\n",
       "      <td>350789</td>\n",
       "      <td>1528</td>\n",
       "      <td>False</td>\n",
       "      <td>74231747</td>\n",
       "      <td>en</td>\n",
       "      <td>Los Angeles - Always a Texan</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme9/bg.gif</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg</td>\n",
       "      <td>JennyJohnsonHi5</td>\n",
       "      <td>18732</td>\n",
       "      <td>True</td>\n",
       "      <td>4.694</td>\n",
       "      <td>3991</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17209</th>\n",
       "      <td>17209</td>\n",
       "      <td>2010-06-06 16:27:08</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>74</td>\n",
       "      <td>54</td>\n",
       "      <td>657</td>\n",
       "      <td>False</td>\n",
       "      <td>152688783</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Abu Dhabi</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png</td>\n",
       "      <td>AbnerJosh</td>\n",
       "      <td>161</td>\n",
       "      <td>False</td>\n",
       "      <td>0.043</td>\n",
       "      <td>3726</td>\n",
       "      <td>1</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23964</th>\n",
       "      <td>23964</td>\n",
       "      <td>2010-06-22 21:56:09</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.</td>\n",
       "      <td>1517</td>\n",
       "      <td>55881</td>\n",
       "      <td>991</td>\n",
       "      <td>True</td>\n",
       "      <td>158502985</td>\n",
       "      <td>en</td>\n",
       "      <td>Regina, SK Canada</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme1/bg.png</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg</td>\n",
       "      <td>GlobalRegina</td>\n",
       "      <td>103379</td>\n",
       "      <td>True</td>\n",
       "      <td>27.865</td>\n",
       "      <td>3710</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30569</th>\n",
       "      <td>30569</td>\n",
       "      <td>2009-03-10 02:26:45</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>Detritus</td>\n",
       "      <td>2616</td>\n",
       "      <td>1118405</td>\n",
       "      <td>657</td>\n",
       "      <td>False</td>\n",
       "      <td>23544268</td>\n",
       "      <td>no</td>\n",
       "      <td>unknown</td>\n",
       "      <td>http://abs.twimg.com/images/themes/theme11/bg.gif</td>\n",
       "      <td>http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg</td>\n",
       "      <td>OfficialKat</td>\n",
       "      <td>4980</td>\n",
       "      <td>True</td>\n",
       "      <td>1.191</td>\n",
       "      <td>4180</td>\n",
       "      <td>0</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      unnamed: 0           created_at default_profile default_profile_image  \\\n",
       "20963      20963  2013-05-27 21:22:15            True                 False   \n",
       "6331        6331  2009-09-14 18:58:36           False                 False   \n",
       "17209      17209  2010-06-06 16:27:08            True                 False   \n",
       "23964      23964  2010-06-22 21:56:09           False                 False   \n",
       "30569      30569  2009-03-10 02:26:45           False                 False   \n",
       "\n",
       "                                                                                                                                                           description  \\\n",
       "20963     WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.   \n",
       "6331   Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo   \n",
       "17209                                                                                                                                                              NaN   \n",
       "23964                    Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.   \n",
       "30569                                                                                                                                                         Detritus   \n",
       "\n",
       "      favourites_count followers_count friends_count geo_enabled    EVENT_ID  \\\n",
       "20963            32374            2395          2823        True  1463172686   \n",
       "6331             68664          350789          1528       False    74231747   \n",
       "17209               74              54           657       False   152688783   \n",
       "23964             1517           55881           991        True   158502985   \n",
       "30569             2616         1118405           657       False    23544268   \n",
       "\n",
       "      lang                      location  \\\n",
       "20963   en              Mount Morris, MI   \n",
       "6331    en  Los Angeles - Always a Texan   \n",
       "17209  NaN                     Abu Dhabi   \n",
       "23964   en             Regina, SK Canada   \n",
       "30569   no                       unknown   \n",
       "\n",
       "                            profile_background_image_url  \\\n",
       "20963   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "6331    http://abs.twimg.com/images/themes/theme9/bg.gif   \n",
       "17209   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "23964   http://abs.twimg.com/images/themes/theme1/bg.png   \n",
       "30569  http://abs.twimg.com/images/themes/theme11/bg.gif   \n",
       "\n",
       "                                                                 profile_image_url  \\\n",
       "20963  http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg   \n",
       "6331    http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg   \n",
       "17209         http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png   \n",
       "23964   http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg   \n",
       "30569  http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg   \n",
       "\n",
       "           screen_name statuses_count verified average_tweets_per_day  \\\n",
       "20963       blerdwords          11448    False                  4.336   \n",
       "6331   JennyJohnsonHi5          18732     True                  4.694   \n",
       "17209        AbnerJosh            161    False                  0.043   \n",
       "23964     GlobalRegina         103379     True                 27.865   \n",
       "30569      OfficialKat           4980     True                  1.191   \n",
       "\n",
       "      account_age_days  EVENT_LABEL ENTITY_TYPE  \n",
       "20963             2640            0        user  \n",
       "6331              3991            0        user  \n",
       "17209             3726            1        user  \n",
       "23964             3710            0        user  \n",
       "30569             4180            0        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21\n",
      "(29950, 21)\n",
      "========= \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for key in all_keys:\n",
    "    obj = FraudDatasetBenchmark(key=key,  \n",
    "                                load_pre_downloaded=True,\n",
    "                                delete_downloaded=False,\n",
    "                                add_random_values_if_real_na = { \"EVENT_TIMESTAMP\": False, \n",
    "                                                                \"LABEL_TIMESTAMP\": False,\n",
    "                                                                \"ENTITY_ID\": False,\n",
    "                                                                \"ENTITY_TYPE\": True,\n",
    "                                                                \"EVENT_ID\": True})\n",
    "    print(obj.key)\n",
    "    print('Train set: ')\n",
    "    display(obj.train.head())\n",
    "    print(len(obj.train.columns))\n",
    "    print(obj.train.shape)\n",
    "    print('=========','\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# End"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: scripts/reproducibility/afd/README.md
================================================
## Steps to reproduce AFD models
Amazon Fraud Detector (AFD) models can be either run via AWS Console or using API calls. In this folder, we provide scripts that make API calls to create model artifacts and then to score the model on test data.

High level steps to train and deploy model are:

![afd steps](../../../images/afd_steps.png)

You can use provided scripts to replicate performance shown in the benchmark.

1. Setup AWS credentials in terminal for the AWS account where you want to run AFD, and store the data. You can use environment variables as [following](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html)


2. Use the [template data-loader notebook](../../examples/Test_FDB_Loader.ipynb) to upload the benchmark data on S3. (AFD requires data to be saved in S3 and require an S3 path) 


3. Create AFD resources including entities, event types, and model. Update values in `IAM_ROLE`, `BUCKET`, `KEY` and `MODEL_NAME` in the `create_afd_resources.py`, then run following.

```
python create_afd_resources.py configs/{dataset-you-want-to-use}
```

You can keep `MODEL_TYPE` as **ONLINE_FRAUD_INSIGHTS** or **TRANSACTION_FRAUD_INSIGHTS** to run corresponding models.

This will initiate automatic model training. Wait for ~1 hour for models to train. You can check status in your console.

4. Create detector and use it to score on the test data. Update values in `IAM_ROLE`, `BUCKET`, `TEST_PATH`, `TEST_LABELS_PATH` and `MODEL_NAME` in the `score_afd_resources.py`, then run following.

```
python score_afd_model.py
```
This will print performance metrics in terminal as well as save in S3 location you provide in the script.

After a model training is completed, AFD console would show performance metrics like following (trained on `ieeecis` with ONLINE_FRAUD_INSIGHTS).

![ieee ofi sample](../../../images/ieee_ofi_sample.png)


**In order to fully deep dive into working of Amazon Fraud Detector, [here](https://d1.awsstatic.com/fraud-detector/afd-technical-guide-detecting-new-account-fraud.pdf) is the link to technical guide.**


================================================
FILE: scripts/reproducibility/afd/configs/CreditCardFraudDetection.json
================================================
{
    "dataset": "Credit Card Fraud Detection",
    "variable_mappings": [
        {
            "variable_name": "v1",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v2",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v3",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v4",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v5",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v6",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v7",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v8",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v9",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v10",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v11",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v12",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v13",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v14",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v15",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v16",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v17",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v18",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v19",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v20",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v21",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v22",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v23",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v24",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v25",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v26",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v27",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v28",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "amount",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "1"
        ],
        "LEGIT": [
            "0"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/FakeJobPostingPrediction.json
================================================
{
    "dataset": "Fake Job Posting Prediction", 
    "variable_mappings": [
        {
            "variable_name": "title",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "location",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "department",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "salary_range",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "company_profile",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "description",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "requirements",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "benefits",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "telecommuting",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "has_company_logo",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "has_questions",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "employment_type",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "required_experience",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "required_education",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "industry",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "function",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "1"
        ],
        "LEGIT": [
            "0"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/Fraudecommerce.json
================================================
{
    "dataset": "Fraud ecommerce",
    "variable_mappings": [
        {
            "variable_name": "purchase_value",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "source",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "browser",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "age",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "ip_address",
            "variable_type": "IP_ADDRESS",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "time_since_signup",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "1"
        ],
        "LEGIT": [
            "0"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/IEEECISFraudDetection.json
================================================
{
    "dataset": "IEEE-CIS Fraud Detection",
    "variable_mappings": [
        {
            "variable_name": "transactionamt",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "productcd",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "card1",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "card2",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "card3",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "card5",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "card6",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "addr1",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "dist1",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "p_emaildomain",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "r_emaildomain",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "c1",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c2",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c4",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c5",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c6",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c7",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c8",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c9",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c10",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c11",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c12",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c13",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "c14",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v62",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v70",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v76",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v78",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v82",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v91",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v127",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v130",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v139",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v160",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v165",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v187",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v203",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v207",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v209",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v210",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v221",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v234",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v257",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v258",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v261",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v264",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v266",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v267",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v271",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v274",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v277",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v283",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v285",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v289",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v291",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "v294",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_01",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_02",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_05",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_06",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_09",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_13",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_17",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_19",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "id_20",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "devicetype",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "deviceinfo",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "1"
        ],
        "LEGIT": [
            "0"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/IPBlocklist.json
================================================
{
    "dataset": "IP-BlockList",
    "variable_mappings": [
        {
            "variable_name": "ip",
            "variable_type": "IP_ADDRESS",
            "data_type": "STRING"
        },
        {
            "variable_name": "dummy_cat",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "1"
        ],
        "LEGIT": [
            "0"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/MaliciousURL.json
================================================
{
    "dataset": "Malicious URLs Dataset",
    "variable_mappings": [
        {
            "variable_name": "url",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "dummy_cat",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "malignant"
        ],
        "LEGIT": [
            "benign"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/SimulatedCreditCardTransactionsSparkov.json
================================================
{
    "dataset": "Simulated Credit Card Transactions generated using Sparkov",
    "variable_mappings": [
        {
            "variable_name": "cc_num",
            "variable_type": "CARD_BIN",
            "data_type": "INTEGER"
        },
        {
            "variable_name": "category",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "amt",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "first",
            "variable_type": "BILLING_NAME",
            "data_type": "STRING"
        },
        {
            "variable_name": "last",
            "variable_type": "BILLING_NAME",
            "data_type": "STRING"
        },
        {
            "variable_name": "gender",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "street",
            "variable_type": "BILLING_ADDRESS_L1",
            "data_type": "STRING"
        },
        {
            "variable_name": "city",
            "variable_type": "BILLING_CITY",
            "data_type": "STRING"
        },
        {
            "variable_name": "state",
            "variable_type": "BILLING_STATE",
            "data_type": "STRING"
        },
        {
            "variable_name": "zip",
            "variable_type": "BILLING_ZIP",
            "data_type": "STRING"
        },
        {
            "variable_name": "lat",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "long",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "city_pop",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "job",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "dob",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "merch_lat",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "merch_long",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "1"
        ],
        "LEGIT": [
            "0"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/TwitterBotAccounts.json
================================================
{
    "dataset": "Twitter Bots Accounts",
    "variable_mappings": [
        {
            "variable_name": "default_profile",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "default_profile_image",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "description",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "favourites_count",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "followers_count",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "friends_count",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "geo_enabled",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "lang",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "location",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "profile_background_image_url",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "profile_image_url",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "screen_name",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "statuses_count",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "verified",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "average_tweets_per_day",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "account_age_days",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "bot"
        ],
        "LEGIT": [
            "human"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/configs/VehicleLoanDefaultPrediction.json
================================================
{
    "dataset": "Vehicle Loan Default Prediction",
    "variable_mappings": [
        {
            "variable_name": "disbursed_amount",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "asset_cost",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "ltv",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "branch_id",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "supplier_id",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "manufacturer_id",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "current_pincode_id",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "date_of_birth",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "employment_type",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "state_id",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "employee_code_id",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "mobileno_avl_flag",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "aadhar_flag",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "pan_flag",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "voterid_flag",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "driving_flag",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "passport_flag",
            "variable_type": "CATEGORICAL",
            "data_type": "STRING"
        },
        {
            "variable_name": "perform_cns_score",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "perform_cns_score_description",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "pri_no_of_accts",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "pri_active_accts",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "pri_overdue_accts",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "pri_current_balance",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "pri_sanctioned_amount",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "pri_disbursed_amount",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "sec_no_of_accts",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "sec_active_accts",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "sec_overdue_accts",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "sec_current_balance",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "sec_sanctioned_amount",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "sec_disbursed_amount",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "primary_instal_amt",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "sec_instal_amt",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "new_accts_in_last_six_months",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "delinquent_accts_in_last_six_months",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "average_acct_age",
            "variable_type": "FREE_FORM_TEXT",
            "data_type": "STRING"
        },
        {
            "variable_name": "credit_history_length",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        },
        {
            "variable_name": "no_of_inquiries",
            "variable_type": "NUMERIC",
            "data_type": "FLOAT"
        }
    ],
    "label_mappings": {
        "FRAUD": [
            "1"
        ],
        "LEGIT": [
            "0"
        ]
    }
}

================================================
FILE: scripts/reproducibility/afd/create_afd_resources.py
================================================
# TO BE UPDATED BY USER
IAM_ROLE = "<IAM ROLE with acceess to S3 bucket containing the data and access to Amazon Fraud Detector>"
BUCKET = "<S3 bucket containing the data>"
KEY = "<Path of S3 file containing train from FDB data loader>"
MODEL_NAME = "<Model name that you want to give>"  # lower case alphanumeric only, only _ allowed as delimiter
MODEL_TYPE     = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS

import os
import time
import json
import boto3
import click
import string
import random
import logging
import pandas as pd


MODEL_DESC     = "Benchmarking model"
EVENT_DESC     = "Event for benchmarking model"
ENTITY_TYPE    = "user"  # this is provided in the dummy data. Will need to change if using different data
ENTITY_DESC    = "Entity for benchmarking model"

BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME  # Others are kept same as model name

# boto3 connections
client = boto3.client('frauddetector') 
s3 = boto3.client('s3')

@click.command()
@click.argument("config", type=click.Path(exists=True))
def afd_train_model_demo(config):
    
    #############################################
    #####               Setup               #####
    with open(config, "r") as f:
        config_file = json.load(f)
    
    
    EVENT_VARIABLES = [variable["variable_name"] for variable in config_file["variable_mappings"]]
    EVENT_LABELS = [v for k,v in config_file["label_mappings"].items()]
    EVENT_LABELS = [item for sublist in EVENT_LABELS for item in sublist]  # flattening list of lists

    # Variable mappings of demo data in this use case.  Important to teach this to customer
    click.echo(f'{pd.DataFrame(config_file["variable_mappings"])}')
    click.echo(f'{pd.DataFrame(config_file["label_mappings"])}')

    S3_DATA_PATH = "s3://" + os.path.join(BUCKET, KEY)
       
    #############################################
    ##### Create event variables and labels #####
    
    # -- create variable  --
    for variable in config_file["variable_mappings"]:
        
        DEFAULT_VALUE = '0.0' if variable["data_type"] == "FLOAT" else '<null>'
        
        try:
            resp = client.get_variables(name = variable["variable_name"])
            click.echo("{0} exists, data type: {1}".format(variable["variable_name"], resp['variables'][0]['dataType']))
        except:
            click.echo("Creating variable: {0}".format(variable["variable_name"]))
            resp = client.create_variable(
                    name         = variable["variable_name"],
                    dataType     = variable["data_type"],
                    dataSource   ='EVENT',
                    defaultValue = DEFAULT_VALUE, 
                    description  = variable["variable_name"],
                    variableType = variable["variable_type"])
    # Putting FRAUD
    for f in config_file["label_mappings"]["FRAUD"]:
        response = client.put_label(
            name = f,
            description = "FRAUD")
    # Putting LEGIT
    for f in config_file["label_mappings"]["LEGIT"]:
        response = client.put_label(
            name = f,
            description = "LEGIT")

    #############################################
    #####   Define Entity and Event Types   #####
    
    # -- create entity type --
    try:
        response = client.get_entity_types(name = ENTITY_TYPE)
        click.echo("-- entity type exists --")
        click.echo(response)
    except:
        response = client.put_entity_type(
            name        = ENTITY_TYPE,
            description = ENTITY_DESC
        )
        click.echo("-- create entity type --")
        click.echo(response)


    # -- create event type --
    try:
        response = client.get_event_types(name = EVENT_TYPE)
        click.echo("\n-- event type exists --")
        click.echo(response)
    except:
        response = client.put_event_type (
            name           = EVENT_TYPE,
            eventVariables = EVENT_VARIABLES,
            labels         = EVENT_LABELS,
            entityTypes    = [ENTITY_TYPE])
        click.echo("\n-- create event type --")
        click.echo(response)

    #############################################
    #####   Batch import training file for TFI  #####
    if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS":
        try:
            response = client.create_batch_import_job(
                jobId = BATCH_PREDICTION_JOB,
                inputPath = S3_DATA_PATH,
                outputPath = "s3://" + BUCKET,
                eventTypeName = EVENT_TYPE,
                iamRoleArn = IAM_ROLE
            )   
        except Exception:
            pass

        # -- wait until batch import is finished --
        print("--- waiting until batch import is finished ")
        stime = time.time()
        while True:
            response = client.get_batch_import_jobs(jobId=BATCH_PREDICTION_JOB)
            if 'IN_PROGRESS' in response['batchImports'][0]['status']:
                print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
                time.sleep(60)  # sleep for 1 minute 
            else:
                print("Batch Impoort status : " +  response['batchImports'][0]['status'])
                break

        etime = time.time()
        print(f"Elapsed time: {(etime - stime)/60:{3}.{3}} minutes \n"  )
        print(response)


    #############################################
    #####   Create and train your model     #####
    try:
        response = client.create_model(
           description   = MODEL_DESC,
           eventTypeName = EVENT_TYPE,
           modelId       = MODEL_NAME,
           modelType     = MODEL_TYPE)
        click.echo("-- initalize model --")
        click.echo(response)
    except Exception:
        pass
    
    # -- initalized the model, it's now ready to train --
    
    # -- first define training_data_schema for model to use --

    
    if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS": 
        training_data_schema = {
            'modelVariables' : EVENT_VARIABLES,
            'labelSchema'    : {
                'labelMapper' : config_file["label_mappings"],
                'unlabeledEventsTreatment': 'IGNORE'
            }
        }
        response = client.create_model_version(
            modelId             = MODEL_NAME,
            modelType           = MODEL_TYPE,
            trainingDataSource  = 'INGESTED_EVENTS',
            trainingDataSchema  = training_data_schema,
            ingestedEventsDetail={  # This needs to be changed
                  'ingestedEventsTimeWindow': {
                      'startTime': '2020-12-10T00:00:00Z', # '2021-08-28T00:00:00Z',
                      'endTime': '2022-06-07T00:00:00Z'  #'2022-05-10T00:00:00Z'
                  }
    }
        )
    else:
        training_data_schema = {
            'modelVariables' : EVENT_VARIABLES,
            'labelSchema'    : {
                'labelMapper' : config_file["label_mappings"]
            }
        }
        response = client.create_model_version(
            modelId             = MODEL_NAME,
            modelType           = MODEL_TYPE,
            trainingDataSource  = 'EXTERNAL_EVENTS',
            trainingDataSchema  = training_data_schema,
            externalEventsDetail = {
                'dataLocation'     : S3_DATA_PATH,
                'dataAccessRoleArn': IAM_ROLE
            }
        )
    model_version = response['modelVersionNumber']
    click.echo("-- model training --")
    click.echo(response)


if __name__=="__main__":
    afd_train_model_demo()


================================================
FILE: scripts/reproducibility/afd/score_afd_model.py
================================================
# TO BE UPDATED BY USER
IAM_ROLE = "<IAM ROLE with acceess to S3 bucket containing the data and access to Amazon Fraud Detector>"
BUCKET = "<S3 BUCKET>"
TEST_PATH = "<Path of S3 file containing test from FDB data loader>"
TEST_LABELS_PATH = "<Path of S3 file containing test_labels from FDB data loader>"
MODEL_NAME = "<Name of trained model to be used for scoring on the test data>"  # lower case alphanumeric only, only _ allowed as delimiter
MODEL_TYPE     = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS

import os
import ast
import time
import json
import boto3
import click
import string
import random
import logging
import numpy as np
import pandas as pd
from sklearn.metrics import roc_curve, auc

# boto3 connections
client = boto3.client('frauddetector') 
s3 = boto3.client('s3')

BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME
model_version = '1.0'
DETECTOR_DESC = "Benchmarking detector"


def create_outcomes(outcomes):
    """ 
    Create Fraud Detector Outcomes 
    """   
    for outcome in outcomes:
        print("creating outcome variable: {0} ".format(outcome))
        response = client.put_outcome(name = outcome, description = outcome)


def create_rules(score_cuts, outcomes):
    """
    Creating rules 
    
    Arguments:
        score_cuts  - list of score cuts to create rules
        outcomes    - list of outcomes associated with the rules
    
    Returns:
        a rule list to used when create detector
    """
    
    if len(score_cuts)+1 != len(outcomes):
        logging.error('Your socre cuts and outcomes are not matched.')
    
    rule_list = []
    for i in range(len(outcomes)):
        # rule expression
        if i < (len(outcomes)-1):
            rule = "${0}_insightscore > {1}".format(MODEL_NAME,score_cuts[i])
        else:
            rule = "${0}_insightscore <= {1}".format(MODEL_NAME,score_cuts[i-1])
    
        # append to rule_list (used when create detector)
        rule_id = "rules{0}_{1}".format(i, MODEL_NAME[:9])
        
        rule_list.append({
            "ruleId": rule_id, 
            "ruleVersion" : '1',
            "detectorId"  : DETECTOR_NAME
        })
        
        # create rules
        print("creating rule: {0}: IF {1} THEN {2}".format(rule_id, rule, outcomes[i]))
        try:
            response = client.create_rule(
                ruleId = rule_id,
                detectorId = DETECTOR_NAME,
                expression = rule,
                language = 'DETECTORPL',
                outcomes = [outcomes[i]]
                )
        except:
            print("this rule already exists in this detector")
            
    return rule_list


def ast_with_nan(x):
    try:
        return ast.literal_eval(x)
    except:
        return np.nan


def afd_train_model_demo():

    # -- activate the model version --
    try:
        response = client.update_model_version_status (
            modelId            = MODEL_NAME,
            modelType          = MODEL_TYPE,
            modelVersionNumber = model_version,
            status             = 'ACTIVE'
        )
        print("-- activating model --")
        print(response)
    except Exception:
        print("First train the model")
    
    # -- wait until model is active --
    print("--- waiting until model status is active ")
    stime = time.time()
    while True:
        response = client.get_model_version(modelId=MODEL_NAME, modelType = MODEL_TYPE, modelVersionNumber = model_version)
        if response['status'] != 'ACTIVE':
            print(response['status'])
            print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
            time.sleep(60)  # sleep for 1 minute 
        if response['status'] == 'ACTIVE':
            print("Model status : " +  response['status'])
            break
            
    etime = time.time()
    print("Elapsed time : %s" % (etime - stime) + " seconds \n"  )
    print(response)

    # -- put detector, initalizes your detector -- 
    response = client.put_detector(
        detectorId    = DETECTOR_NAME, 
        description   = DETECTOR_DESC,
        eventTypeName = EVENT_TYPE )

    # -- decide what threshold and corresponding outcome you want to add -- 
    # here, we create three simple rules by cutting the score at [950,750], and create three outcome ['fraud', 'investigate', 'approve'] 
    # it will create 3 rules:
    #    score > 950: fraud
    #    score <= 750: approve

    score_cuts = [750]                          # recommended to fine tune this based on your business use case
    outcomes = ['fraud', 'approve']  # recommended to define this based on your business use case

    # -- create outcomes -- 
    print(" -- create outcomes --")
    create_outcomes(outcomes)

    # -- create rules --
    print(" -- create rules --")
    rule_list = create_rules(score_cuts, outcomes)

    # -- create detector version --
    client.create_detector_version(
        detectorId    = DETECTOR_NAME,
        rules         = rule_list,
        modelVersions = [{"modelId": MODEL_NAME, 
                        "modelType": MODEL_TYPE,
                        "modelVersionNumber": model_version}],
        # there are 2 options for ruleExecutionMode:
        #   'ALL_MATCHED'    - return all matched rules' outcome
        #   'FIRST_MATCHED'  - return first matched rule's outcome
        ruleExecutionMode = 'FIRST_MATCHED'
    )

    print("\n -- detector created -- ")
    print(response) 

    response = client.update_detector_version_status(
        detectorId        = DETECTOR_NAME,
        detectorVersionId = '1',
        status            = 'ACTIVE'
    )
    print("\n -- detector activated -- ")
    print(response)

    # -- wait until detector is active --
    print("\n --- waiting until detector status is active ")
    stime = time.time()
    while True:
        response = client.describe_detector(
            detectorId        = DETECTOR_NAME,
        )
        if response['detectorVersionSummaries'][0]['status'] != 'ACTIVE':
            print(response['detectorVersionSummaries'][0]['status'])
            print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
            time.sleep(60)
        if response['detectorVersionSummaries'][0]['status'] == 'ACTIVE':
            break
    etime = time.time()
    print("Elapsed time : %s" % (etime - stime) + " seconds \n"  )
    print(response)

    # -- create detector evaluation --
    try:
        client.create_batch_prediction_job (
        jobId = BATCH_PREDICTION_JOB,
        inputPath = os.path.join('s3://', BUCKET, TEST_PATH),
        outputPath =os.path.join('s3://', BUCKET),
        eventTypeName = EVENT_TYPE,
        detectorName = DETECTOR_NAME,
        detectorVersion = '1',
        iamRoleArn = IAM_ROLE)
    except Exception as e:
        print(e)
        print("batch prediction job already exists")

    # -- wait until batch prediction job is completed --
    print("\n --- waiting until batch prediction job is completed ")
    stime = time.time()
    while True:
        response = client.get_batch_prediction_jobs(jobId=BATCH_PREDICTION_JOB)
        response = response['batchPredictions'][0]
        if (response['status'] != 'COMPLETE') and (response['status'] != 'FAILED'):
            print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes")
            time.sleep(60)
        if response['status'] == 'COMPLETE':
            break
    etime = time.time()
    print("Elapsed time : %s" % (etime - stime) + " seconds \n"  )
    print(response)

    # -- get batch prediction job result --
    contents = s3.list_objects_v2(Bucket=BUCKET, Prefix=os.path.join(TEST_PATH))['Contents']
    print(contents)
    S3_SCORE_PATH = sorted([c['Key'] for c in contents if c['Key'].endswith('output.csv')])[-1]
    print(S3_SCORE_PATH)

    # -- get test performance --
    # Predictions
    print(os.path.join('s3://', BUCKET, S3_SCORE_PATH))
    predictions = pd.read_csv(os.path.join('s3://', BUCKET, S3_SCORE_PATH))
    predictions = predictions.copy()[~predictions.MODEL_SCORES.isna()]

    predictions['scores'] = predictions['MODEL_SCORES'].\
    apply(lambda x: ast_with_nan(x)).\
    apply(lambda x: x.get(MODEL_NAME))

    # Labels
    labels = pd.read_csv(os.path.join('s3://', BUCKET, TEST_LABELS_PATH))
#     labels['EVENT_LABEL'] = labels['EVENT_LABEL'].map({'benign': 0, 'malignant': 1})
    predictions = predictions.merge(labels, on='EVENT_ID', how='left')
    print('Test size: ', predictions.shape)

    fpr, tpr, threshold = roc_curve(predictions['EVENT_LABEL'], predictions['scores'])
    test_auc = auc(fpr,tpr)
    print('AUC: ', test_auc)

    test_metrics = {}
    test_metrics['auc'] = test_auc
    test_metrics['fpr'] = list(fpr)
    test_metrics['tpr'] = list(tpr)
    test_metrics['threshold'] = list(threshold)

    # -- put test metrics in s3 --
    s3.put_object(
        Body=json.dumps(test_metrics), 
        Bucket=BUCKET, 
        Key='test_metrics.json') 

    print("\n -- test metrics saved -- ")

if __name__ == "__main__":
    afd_train_model_demo()

        
================================================
FILE: scripts/reproducibility/autogluon/README.md
================================================
 - benchmark_ag.py: a script for autogluon benchmarking
 - example-ag-ieeecis.ipynb: an example notebook using benchmark_ag.py 

Note that autogluon is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper.


================================================
FILE: scripts/reproducibility/autogluon/benchmark_ag.py
================================================
import pandas as pd
import os
import gc
import joblib
import datetime

import matplotlib as mpl
from sklearn.metrics import roc_auc_score, roc_curve

mpl.rcParams['figure.dpi'] = 150
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

import logging
FORMAT = "%(levelname)s: %(name)s: %(message)s"
DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
logger = logging.getLogger(os.path.basename(__file__))
logger.setLevel(logging.DEBUG)

import sys
sys.path.append('../')
from benchmark_utils import load_data, get_recall

from autogluon.tabular import TabularPredictor

def run_ag(dataset, base_path, time_limit=3600, presets=None, hyperparameters=None, feature_metadata='infer', verbosity=2):
    gc.collect()
    features, df_train, df_test = load_data(dataset, base_path)

    dateTimeObj = datetime.datetime.now()
    timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S")
    
    suffix = (f"_{presets}" if presets is not None else "") \
            + (f"_{hyperparameters}" if hyperparameters is not None else "") \
            + ("_feature_metadata" if feature_metadata != 'infer' else "") 
    folder = f"ag-{timestampStr}" \
            + suffix

    predictor = TabularPredictor(label='EVENT_LABEL', eval_metric='roc_auc', path=f"{base_path}/{dataset}/AutogluonModels/{folder}/", 
                                 verbosity=verbosity)
    predictor.fit(df_train[features + ['EVENT_LABEL'] ], 
                  time_limit=time_limit, presets=presets, hyperparameters=hyperparameters, feature_metadata=feature_metadata)

    leaderboard = predictor.leaderboard(df_test[features + ['EVENT_LABEL'] ])

    leaderboard_file = "leaderboard" \
                        + suffix \
                        + ".csv"
    leaderboard.to_csv(f"{base_path}/{dataset}/{leaderboard_file}", index=False)
    
    df_pred = predictor.predict_proba(df_test[ features ], 
                                                            as_multiclass=False)
    
    auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred)
    logger.info(f"auc on test data: {auc}")
    pos_label = predictor.positive_class
    fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred, 
                                     pos_label=pos_label)
    
    y_true = df_test['EVENT_LABEL']
    y_true = (y_true==pos_label)
    
    recall = get_recall(fpr, tpr, fpr_target=0.01)
    logger.info(f"tpr@1%fpr on test data: {recall}")
    
    test_metrics_ag_bq = {
    "labels": df_test['EVENT_LABEL'],
    "pred_prob": df_pred,    
    "auc": auc,
    "tpr@1%fpr": recall,
    "fpr": fpr,
    "tpr": tpr,
    "thresholds": thresholds
    }
    metrics_file = "test_metrics_ag" \
                    + suffix \
                    + ".joblib"
    joblib.dump(test_metrics_ag_bq, f"{base_path}/{dataset}/{metrics_file}")

================================================
FILE: scripts/reproducibility/autogluon/example-ag-ieeecis.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7d350d0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1d6a8c41",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container { width:90% }</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "from IPython.display import clear_output\n",
    "display(HTML(\"<style>.container { width:90% }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "611127d9",
   "metadata": {},
   "source": [
    "## Step 1: pip install required packages if not installed already"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "321cb018",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install autogluon\n",
    "import benchmark_ag\n",
    "from benchmark_ag import load_data, run_ag"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d191102",
   "metadata": {},
   "source": [
    "## Step 2: download data using fdb\n",
    "Example: https://github.com/amazon-research/fraud-dataset-benchmark/blob/main/scripts/examples/Test_FDB_Loader.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "33fd8a7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is where datasets are stored: {BASE_PATH}/{dataset}/\n",
    "BASE_PATH = \"/home/ec2-user/SageMaker/official-dataset-names\"\n",
    "dataset = \"IEEE-CIS Fraud Detection\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4bca656",
   "metadata": {},
   "source": [
    "Make sure three files are downloaded:\n",
    "1. {BASE_PATH}/{dataset}/train.csv\n",
    "2. {BASE_PATH}/{dataset}/test.csv\n",
    "3. {BASE_PATH}/{dataset}/test_labels.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52b3fcdb",
   "metadata": {},
   "source": [
    "## Step 3: look at data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b07893d7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n",
      "INFO: benchmark_utils.py: (313060, 194)\n",
      "INFO: benchmark_utils.py: (27330, 71)\n",
      "INFO: benchmark_utils.py: (29527, 2)\n",
      "INFO: benchmark_utils.py: (27329, 72)\n",
      "INFO: benchmark_utils.py: 67\n",
      "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n"
     ]
    }
   ],
   "source": [
    "features, df_train, df_test = load_data(dataset, BASE_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8cad73e3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>addr2</th>\n",
       "      <th>dist1</th>\n",
       "      <th>dist2</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>d1</th>\n",
       "      <th>d2</th>\n",
       "      <th>d3</th>\n",
       "      <th>d4</th>\n",
       "      <th>d5</th>\n",
       "      <th>d10</th>\n",
       "      <th>d11</th>\n",
       "      <th>d15</th>\n",
       "      <th>m1</th>\n",
       "      <th>m2</th>\n",
       "      <th>m3</th>\n",
       "      <th>m4</th>\n",
       "      <th>m6</th>\n",
       "      <th>m7</th>\n",
       "      <th>m8</th>\n",
       "      <th>m9</th>\n",
       "      <th>v1</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v6</th>\n",
       "      <th>v8</th>\n",
       "      <th>v11</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v17</th>\n",
       "      <th>v20</th>\n",
       "      <th>v23</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v30</th>\n",
       "      <th>v36</th>\n",
       "      <th>v37</th>\n",
       "      <th>v40</th>\n",
       "      <th>v41</th>\n",
       "      <th>v44</th>\n",
       "      <th>v47</th>\n",
       "      <th>v48</th>\n",
       "      <th>v54</th>\n",
       "      <th>v56</th>\n",
       "      <th>v59</th>\n",
       "      <th>v62</th>\n",
       "      <th>v65</th>\n",
       "      <th>v67</th>\n",
       "      <th>v68</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v80</th>\n",
       "      <th>v82</th>\n",
       "      <th>v86</th>\n",
       "      <th>v88</th>\n",
       "      <th>v89</th>\n",
       "      <th>v91</th>\n",
       "      <th>v107</th>\n",
       "      <th>v108</th>\n",
       "      <th>v111</th>\n",
       "      <th>v115</th>\n",
       "      <th>v117</th>\n",
       "      <th>v120</th>\n",
       "      <th>v121</th>\n",
       "      <th>v123</th>\n",
       "      <th>v124</th>\n",
       "      <th>v127</th>\n",
       "      <th>v129</th>\n",
       "      <th>v130</th>\n",
       "      <th>v136</th>\n",
       "      <th>v138</th>\n",
       "      <th>v139</th>\n",
       "      <th>v142</th>\n",
       "      <th>v147</th>\n",
       "      <th>v156</th>\n",
       "      <th>v160</th>\n",
       "      <th>v162</th>\n",
       "      <th>v165</th>\n",
       "      <th>v166</th>\n",
       "      <th>v169</th>\n",
       "      <th>v171</th>\n",
       "      <th>v173</th>\n",
       "      <th>v175</th>\n",
       "      <th>v176</th>\n",
       "      <th>v178</th>\n",
       "      <th>v180</th>\n",
       "      <th>v182</th>\n",
       "      <th>v185</th>\n",
       "      <th>v187</th>\n",
       "      <th>v188</th>\n",
       "      <th>v198</th>\n",
       "      <th>v203</th>\n",
       "      <th>v205</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v215</th>\n",
       "      <th>v218</th>\n",
       "      <th>v220</th>\n",
       "      <th>v221</th>\n",
       "      <th>v223</th>\n",
       "      <th>v224</th>\n",
       "      <th>v226</th>\n",
       "      <th>v228</th>\n",
       "      <th>v229</th>\n",
       "      <th>v234</th>\n",
       "      <th>v235</th>\n",
       "      <th>v238</th>\n",
       "      <th>v240</th>\n",
       "      <th>v250</th>\n",
       "      <th>v252</th>\n",
       "      <th>v253</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v260</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v281</th>\n",
       "      <th>v283</th>\n",
       "      <th>v284</th>\n",
       "      <th>v285</th>\n",
       "      <th>v286</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>v296</th>\n",
       "      <th>v297</th>\n",
       "      <th>v301</th>\n",
       "      <th>v303</th>\n",
       "      <th>v305</th>\n",
       "      <th>v307</th>\n",
       "      <th>v309</th>\n",
       "      <th>v310</th>\n",
       "      <th>v314</th>\n",
       "      <th>v320</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_03</th>\n",
       "      <th>id_04</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_10</th>\n",
       "      <th>id_11</th>\n",
       "      <th>id_12</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_15</th>\n",
       "      <th>id_16</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_18</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>id_28</th>\n",
       "      <th>id_29</th>\n",
       "      <th>id_31</th>\n",
       "      <th>id_35</th>\n",
       "      <th>id_36</th>\n",
       "      <th>id_37</th>\n",
       "      <th>id_38</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>68.500</td>\n",
       "      <td>W</td>\n",
       "      <td>13926.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>150.000</td>\n",
       "      <td>142.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>315.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>19.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>14.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>13.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12.000</td>\n",
       "      <td>12.000</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>M2</td>\n",
       "      <td>T</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>117.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>117.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>13926.0_315.0_-13.0</td>\n",
       "      <td>2021-01-02 00:00:00</td>\n",
       "      <td>user</td>\n",
       "      <td>2987000.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>29.000</td>\n",
       "      <td>W</td>\n",
       "      <td>2755.000</td>\n",
       "      <td>404.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>325.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>M0</td>\n",
       "      <td>T</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2755.0_325.0_1.0</td>\n",
       "      <td>2021-01-02 00:00:01</td>\n",
       "      <td>user</td>\n",
       "      <td>2987001.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>59.000</td>\n",
       "      <td>W</td>\n",
       "      <td>4663.000</td>\n",
       "      <td>490.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>debit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>287.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>outlook.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.001</td>\n",
       "      <td>313.999</td>\n",
       "      <td>313.999</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>M0</td>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4663.0_330.0_1.0</td>\n",
       "      <td>2021-01-02 00:01:09</td>\n",
       "      <td>user</td>\n",
       "      <td>2987002.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>50.000</td>\n",
       "      <td>W</td>\n",
       "      <td>18132.000</td>\n",
       "      <td>567.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>117.000</td>\n",
       "      <td>debit</td>\n",
       "      <td>476.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>25.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>112.000</td>\n",
       "      <td>112.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>92.999</td>\n",
       "      <td>0.000</td>\n",
       "      <td>82.999</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109.999</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>M0</td>\n",
       "      <td>F</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1758.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>354.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>10.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>38.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1758.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>354.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18132.0_476.0_-111.0</td>\n",
       "      <td>2021-01-02 00:01:39</td>\n",
       "      <td>user</td>\n",
       "      <td>2987003.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>50.000</td>\n",
       "      <td>H</td>\n",
       "      <td>4497.000</td>\n",
       "      <td>514.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>420.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>169690.800</td>\n",
       "      <td>0.000</td>\n",
       "      <td>5155.000</td>\n",
       "      <td>2840.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>70787.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100.000</td>\n",
       "      <td>NotFound</td>\n",
       "      <td>NaN</td>\n",
       "      <td>New</td>\n",
       "      <td>NotFound</td>\n",
       "      <td>166.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>542.000</td>\n",
       "      <td>144.000</td>\n",
       "      <td>New</td>\n",
       "      <td>NotFound</td>\n",
       "      <td>samsung browser 6.2</td>\n",
       "      <td>T</td>\n",
       "      <td>F</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>mobile</td>\n",
       "      <td>SAMSUNG SM-G892A Build/NRD90M</td>\n",
       "      <td>4497.0_420.0_1.0</td>\n",
       "      <td>2021-01-02 00:01:46</td>\n",
       "      <td>user</td>\n",
       "      <td>2987004.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL  transactionamt productcd     card1   card2   card3   card5   card6   addr1  addr2   dist1  dist2 p_emaildomain r_emaildomain    c1    c2    c4    c5    c6    c7    c8    c9   c10  \\\n",
       "0            0          68.500         W 13926.000     NaN 150.000 142.000  credit 315.000 87.000  19.000    NaN           NaN           NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000   \n",
       "1            0          29.000         W  2755.000 404.000 150.000 102.000  credit 325.000 87.000     NaN    NaN     gmail.com           NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000   \n",
       "2            0          59.000         W  4663.000 490.000 150.000 166.000   debit 330.000 87.000 287.000    NaN   outlook.com           NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000   \n",
       "3            0          50.000         W 18132.000 567.000 150.000 117.000   debit 476.000 87.000     NaN    NaN     yahoo.com           NaN 2.000 5.000 0.000 0.000 4.000 0.000 0.000 1.000 0.000   \n",
       "4            0          50.000         H  4497.000 514.000 150.000 102.000  credit 420.000 87.000     NaN    NaN     gmail.com           NaN 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000   \n",
       "\n",
       "    c11   c12    c13   c14      d1      d2     d3     d4    d5    d10     d11     d15   m1   m2   m3   m4   m6   m7   m8   m9    v1    v3    v4    v6    v8   v11   v13   v14   v17   v20   v23   v26  \\\n",
       "0 2.000 0.000  1.000 1.000  14.000     NaN 13.000    NaN   NaN 12.000  12.000  -1.000    T    T    T   M2    T  NaN  NaN  NaN 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000   \n",
       "1 1.000 0.000  1.000 1.000   0.000     NaN    NaN -1.000   NaN -1.000     NaN  -1.000  NaN  NaN  NaN   M0    T  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000 1.000 1.000 1.000   \n",
       "2 1.000 0.000  1.000 1.000   0.000     NaN    NaN -1.001   NaN -1.001 313.999 313.999    T    T    T   M0    F    F    F    F 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000   \n",
       "3 1.000 0.000 25.000 1.000 112.000 112.000  0.000 92.999 0.000 82.999     NaN 109.999  NaN  NaN  NaN   M0    F  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000 1.000 0.000 1.000 1.000 1.000   \n",
       "4 1.000 0.000  1.000 1.000   0.000     NaN    NaN    NaN   NaN    NaN     NaN     NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "\n",
       "    v27   v30   v36   v37   v40   v41   v44   v47   v48   v54   v56   v59   v62   v65   v67   v68   v70   v76   v78   v80   v82   v86   v88   v89   v91  v107  v108  v111  v115  v117  v120  v121  \\\n",
       "0 0.000 0.000   NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "1 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "2 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "3 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "4   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "\n",
       "   v123  v124     v127  v129    v130  v136  v138  v139  v142  v147  v156       v160  v162     v165     v166  v169  v171  v173  v175  v176  v178  v180  v182  v185  v187  v188  v198  v203  v205  v207  \\\n",
       "0 1.000 1.000  117.000 0.000   0.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "1 1.000 1.000    0.000 0.000   0.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2 1.000 1.000    0.000 0.000   0.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "3 1.000 1.000 1758.000 0.000 354.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "4 1.000 1.000    0.000 0.000   0.000 0.000 0.000 0.000 0.000 0.000 0.000 169690.800 0.000 5155.000 2840.000 0.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000   \n",
       "\n",
       "   v209  v210  v215  v218  v220  v221  v223  v224  v226  v228  v229  v234  v235  v238  v240  v250  v252  v253  v257  v258  v260  v261  v264  v266  v267  v271  v274  v277  v281  v283  v284   v285  \\\n",
       "0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000  0.000   \n",
       "1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000  0.000   \n",
       "2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000  0.000   \n",
       "3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 0.000 0.000 10.000   \n",
       "4 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000  0.000   \n",
       "\n",
       "   v286  v289  v291   v294  v296  v297  v301  v303  v305     v307  v309    v310  v314  v320  id_01     id_02  id_03  id_04  id_05  id_06  id_09  id_10   id_11     id_12  id_13 id_15     id_16  \\\n",
       "0 0.000 0.000 1.000  1.000 0.000 0.000 0.000 0.000 1.000  117.000 0.000   0.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "1 0.000 0.000 1.000  0.000 0.000 0.000 0.000 0.000 1.000    0.000 0.000   0.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "2 0.000 0.000 1.000  0.000 0.000 0.000 0.000 0.000 1.000    0.000 0.000   0.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "3 0.000 0.000 1.000 38.000 0.000 0.000 0.000 0.000 1.000 1758.000 0.000 354.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "4 0.000 0.000 1.000  0.000 0.000 0.000 0.000 1.000 1.000    0.000 0.000   0.000 0.000 0.000  0.000 70787.000    NaN    NaN    NaN    NaN    NaN    NaN 100.000  NotFound    NaN   New  NotFound   \n",
       "\n",
       "    id_17  id_18   id_19   id_20 id_28     id_29                id_31 id_35 id_36 id_37 id_38 devicetype                     deviceinfo             ENTITY_ID      EVENT_TIMESTAMP ENTITY_TYPE  \\\n",
       "0     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN   13926.0_315.0_-13.0  2021-01-02 00:00:00        user   \n",
       "1     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN      2755.0_325.0_1.0  2021-01-02 00:00:01        user   \n",
       "2     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN      4663.0_330.0_1.0  2021-01-02 00:01:09        user   \n",
       "3     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN  18132.0_476.0_-111.0  2021-01-02 00:01:39        user   \n",
       "4 166.000    NaN 542.000 144.000   New  NotFound  samsung browser 6.2     T     F     T     T     mobile  SAMSUNG SM-G892A Build/NRD90M      4497.0_420.0_1.0  2021-01-02 00:01:46        user   \n",
       "\n",
       "     EVENT_ID       LABEL_TIMESTAMP  \n",
       "0 2987000.000  2022-01-01T20:30:04Z  \n",
       "1 2987001.000  2022-01-01T20:30:04Z  \n",
       "2 2987002.000  2022-01-01T20:30:04Z  \n",
       "3 2987003.000  2022-01-01T20:30:04Z  \n",
       "4 2987004.000  2022-01-01T20:30:04Z  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "6400b0c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>dist1</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>v62</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v82</th>\n",
       "      <th>v91</th>\n",
       "      <th>v127</th>\n",
       "      <th>v130</th>\n",
       "      <th>v139</th>\n",
       "      <th>v160</th>\n",
       "      <th>v165</th>\n",
       "      <th>v187</th>\n",
       "      <th>v203</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v221</th>\n",
       "      <th>v234</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v283</th>\n",
       "      <th>v285</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109411.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66104.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103183.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>926.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>1411.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:11:15</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548013.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109536.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66229.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103308.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>927.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>693.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:11:29</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548014.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109661.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66354.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103433.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>928.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>1116.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:11:45</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548015.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109786.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66479.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103558.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>929.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>1589.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:12:00</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548016.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>31.950</td>\n",
       "      <td>W</td>\n",
       "      <td>9500.000</td>\n",
       "      <td>321.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>226.000</td>\n",
       "      <td>debit</td>\n",
       "      <td>204.000</td>\n",
       "      <td>74.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>27.950</td>\n",
       "      <td>27.950</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-06-21 23:12:11</td>\n",
       "      <td>9500.0_204.0_150.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548017.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   transactionamt productcd     card1   card2   card3   card5   card6   addr1  dist1 p_emaildomain r_emaildomain    c1    c2    c4    c5    c6    c7    c8    c9   c10   c11   c12    c13   c14   v62  \\\n",
       "0         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "1         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "2         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "3         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "4          31.950         W  9500.000 321.000 150.000 226.000   debit 204.000 74.000           NaN           NaN 3.000 3.000 0.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000  6.000 3.000 1.000   \n",
       "\n",
       "    v70   v76   v78   v82   v91       v127     v130  v139     v160      v165  v187       v203    v207     v209    v210  v221   v234  v257  v258  v261  v264  v266  v267  v271  v274  v277  v283  \\\n",
       "0 0.000   NaN   NaN   NaN   NaN 109411.000 2301.000 0.000 2401.000 66104.000 1.000 103183.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "1 0.000   NaN   NaN   NaN   NaN 109536.000 2301.000 0.000 2401.000 66229.000 1.000 103308.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "2 0.000   NaN   NaN   NaN   NaN 109661.000 2301.000 0.000 2401.000 66354.000 1.000 103433.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "3 0.000   NaN   NaN   NaN   NaN 109786.000 2301.000 0.000 2401.000 66479.000 1.000 103558.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "4 1.000 1.000 2.000 1.000 1.000     27.950   27.950   NaN      NaN       NaN   NaN        NaN     NaN      NaN     NaN   NaN    NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000   \n",
       "\n",
       "    v285  v289  v291    v294   id_01    id_02  id_05  id_06  id_09  id_13   id_17   id_19   id_20 devicetype deviceinfo      EVENT_TIMESTAMP            ENTITY_ID ENTITY_TYPE    EVENT_ID  EVENT_LABEL  \n",
       "0 26.000 1.000 2.000 926.000 -10.000 1411.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:11:15  15775.0_330.0_129.0        user 3548013.000            0  \n",
       "1 26.000 1.000 2.000 927.000 -10.000  693.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:11:29  15775.0_330.0_129.0        user 3548014.000            0  \n",
       "2 26.000 1.000 2.000 928.000 -10.000 1116.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:11:45  15775.0_330.0_129.0        user 3548015.000            0  \n",
       "3 26.000 1.000 2.000 929.000 -10.000 1589.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:12:00  15775.0_330.0_129.0        user 3548016.000            0  \n",
       "4  1.000 1.000 1.000   0.000     NaN      NaN    NaN    NaN    NaN    NaN     NaN     NaN     NaN        NaN        NaN  2021-06-21 23:12:11   9500.0_204.0_150.0        user 3548017.000            0  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd88bef4",
   "metadata": {},
   "source": [
    "## Step 3: run Autogluon"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c153785",
   "metadata": {},
   "source": [
    "1. The function run_ag below also saves a leaderboard file (leaderboard_xxx.csv) and a test metrics file (test_metrics_xxx.joblib) into {BASE_PATH}/{dataset}/, respectively\n",
    "2. AutoGluon models are saved at {BASE_PATH}/{dataset}/AutogluonModels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "54b6b65a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n",
      "INFO: benchmark_utils.py: (313060, 194)\n",
      "INFO: benchmark_utils.py: (27330, 71)\n",
      "INFO: benchmark_utils.py: (29527, 2)\n",
      "INFO: benchmark_utils.py: (27329, 72)\n",
      "INFO: benchmark_utils.py: 67\n",
      "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n",
      "INFO: autogluon.tabular.predictor.predictor: Presets specified: ['best_quality']\n",
      "INFO: autogluon.tabular.learner.default_learner: Beginning AutoGluon training ... Time limit = 3600s\n",
      "INFO: autogluon.tabular.learner.default_learner: AutoGluon will save models to \"/home/ec2-user/SageMaker/official-dataset-names/IEEE-CIS Fraud Detection/AutogluonModels/ag-20220615_135015_best_quality/\"\n",
      "INFO: autogluon.tabular.learner.default_learner: AutoGluon Version:  0.4.2\n",
      "INFO: autogluon.tabular.learner.default_learner: Python Version:     3.7.10\n",
      "INFO: autogluon.tabular.learner.default_learner: Operating System:   Linux\n",
      "INFO: autogluon.tabular.learner.default_learner: Train Data Rows:    313060\n",
      "INFO: autogluon.tabular.learner.default_learner: Train Data Columns: 67\n",
      "INFO: autogluon.tabular.learner.default_learner: Label Column: EVENT_LABEL\n",
      "INFO: autogluon.tabular.learner.default_learner: Preprocessing data ...\n",
      "Level 25: autogluon.core.utils.utils: AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n",
      "INFO: autogluon.core.utils.utils: \t2 unique label values:  [0, 1]\n",
      "Level 25: autogluon.core.utils.utils: \tIf 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n",
      "INFO: autogluon.core.data.label_cleaner: Selected class <--> label mapping:  class 1 = 1, class 0 = 0\n",
      "INFO: autogluon.tabular.learner.default_learner: Using Feature Generators to preprocess the data ...\n",
      "INFO: autogluon.features.generators.abstract: Fitting AutoMLPipelineFeatureGenerator...\n",
      "INFO: autogluon.features.generators.abstract: \tAvailable Memory:                    502950.08 MB\n",
      "INFO: autogluon.features.generators.abstract: \tTrain Data (Original)  Memory Usage: 248.99 MB (0.0% of available memory)\n",
      "INFO: autogluon.features.generators.abstract: \tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
      "INFO: autogluon.features.generators.abstract: \tStage 1 Generators:\n",
      "INFO: autogluon.features.generators.abstract: \t\tFitting AsTypeFeatureGenerator...\n",
      "INFO: autogluon.features.generators.abstract: \tStage 2 Generators:\n",
      "INFO: autogluon.features.generators.abstract: \t\tFitting FillNaFeatureGenerator...\n",
      "INFO: autogluon.features.generators.abstract: \tStage 3 Generators:\n",
      "INFO: autogluon.features.generators.abstract: \t\tFitting IdentityFeatureGenerator...\n",
      "INFO: autogluon.features.generators.abstract: \t\tFitting CategoryFeatureGenerator...\n",
      "INFO: autogluon.features.generators.abstract: \t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n",
      "INFO: autogluon.features.generators.abstract: \tStage 4 Generators:\n",
      "INFO: autogluon.features.generators.abstract: \t\tFitting DropUniqueFeatureGenerator...\n",
      "INFO: autogluon.features.generators.abstract: \tTypes of features in original data (raw dtype, special dtypes):\n",
      "INFO: autogluon.common.features.feature_metadata: \t\t('float', [])  : 61 | ['transactionamt', 'card1', 'card2', 'card3', 'card5', ...]\n",
      "INFO: autogluon.common.features.feature_metadata: \t\t('object', []) :  6 | ['productcd', 'card6', 'p_emaildomain', 'r_emaildomain', 'devicetype', ...]\n",
      "INFO: autogluon.features.generators.abstract: \tTypes of features in processed data (raw dtype, special dtypes):\n",
      "INFO: autogluon.common.features.feature_metadata: \t\t('category', []) :  6 | ['productcd', 'card6', 'p_emaildomain', 'r_emaildomain', 'devicetype', ...]\n",
      "INFO: autogluon.common.features.feature_metadata: \t\t('float', [])    : 61 | ['transactionamt', 'card1', 'card2', 'card3', 'card5', ...]\n",
      "INFO: autogluon.features.generators.abstract: \t2.6s = Fit runtime\n",
      "INFO: autogluon.features.generators.abstract: \t67 features in original data used to generate 67 features in processed data.\n",
      "INFO: autogluon.features.generators.abstract: \tTrain Data (Processed) Memory Usage: 154.97 MB (0.0% of available memory)\n",
      "INFO: autogluon.tabular.learner.default_learner: Data preprocessing and feature engineering runtime = 2.94s ...\n",
      "Level 25: autogluon.core.trainer.abstract_trainer: AutoGluon will gauge predictive performance using evaluation metric: 'roc_auc'\n",
      "Level 25: autogluon.core.trainer.abstract_trainer: \tThis metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict()\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \tTo change this, specify the eval_metric parameter of Predictor()\n",
      "INFO: autogluon.core.trainer.abstract_trainer: AutoGluon will fit 2 stack levels (L1 to L2) ...\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting 13 L1 models ...\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 2397.44s of the 3597.05s of remaining time.\n",
      "WARNING: autogluon.core.models.ensemble.bagged_ensemble_model: \tNot enough time to generate out-of-fold predictions for model. Estimated time required was 3589.28s compared to 3115.73s of available time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping KNeighborsUnif_BAG_L1.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 2390.48s of the 3590.09s of remaining time.\n",
      "WARNING: autogluon.core.models.ensemble.bagged_ensemble_model: \tNot enough time to generate out-of-fold predictions for model. Estimated time required was 3399.45s compared to 3106.63s of available time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping KNeighborsDist_BAG_L1.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 2383.77s of the 3583.38s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9629\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t177.58s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t36.28s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBM_BAG_L1 ... Training model for up to 2196.14s of the 3395.75s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.969\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t97.61s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t17.24s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 2095.39s of the 3295.0s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9456\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t15.42s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t24.02s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestEntr_BAG_L1 ... Training model for up to 2054.81s of the 3254.42s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9474\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t13.73s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t26.66s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: CatBoost_BAG_L1 ... Training model for up to 2012.05s of the 3211.67s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9563\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t1615.45s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t3.48s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesGini_BAG_L1 ... Training model for up to 394.91s of the 1594.53s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9468\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t13.38s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t35.27s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesEntr_BAG_L1 ... Training model for up to 344.0s of the 1543.61s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9505\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t15.11s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t41.22s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 285.31s of the 1484.93s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9086\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t166.51s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t6.71s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: XGBoost_BAG_L1 ... Training model for up to 116.32s of the 1315.93s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9652\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t94.35s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t8.73s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetTorch_BAG_L1 ... Training model for up to 19.33s of the 1218.94s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping NeuralNetTorch_BAG_L1.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMLarge_BAG_L1 ... Training model for up to 3.46s of the 1203.08s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping LightGBMLarge_BAG_L1.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Completed 1/20 k-fold bagging repeats ...\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 1197.96s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9719\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t92.94s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.09s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting 11 L2 models ...\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 1104.9s of the 1104.8s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9747\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t21.16s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t2.79s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBM_BAG_L2 ... Training model for up to 1080.72s of the 1080.63s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9752\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t11.64s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t1.43s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestGini_BAG_L2 ... Training model for up to 1067.58s of the 1067.49s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9632\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t16.3s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t21.33s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestEntr_BAG_L2 ... Training model for up to 1029.08s of the 1028.98s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9649\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t14.04s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t20.67s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: CatBoost_BAG_L2 ... Training model for up to 993.29s of the 993.19s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9762\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t393.78s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t1.04s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesGini_BAG_L2 ... Training model for up to 598.03s of the 597.92s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9641\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t14.12s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t37.16s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesEntr_BAG_L2 ... Training model for up to 545.02s of the 544.92s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9641\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t14.5s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t37.31s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetFastAI_BAG_L2 ... Training model for up to 491.61s of the 491.51s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9736\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t289.96s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t6.78s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: XGBoost_BAG_L2 ... Training model for up to 198.91s of the 198.81s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9752\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t25.26s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t3.99s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetTorch_BAG_L2 ... Training model for up to 171.76s of the 171.67s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping NeuralNetTorch_BAG_L2.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMLarge_BAG_L2 ... Training model for up to 148.85s of the 148.75s of remaining time.\n",
      "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9753\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t44.1s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t2.06s\t = Validation runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Completed 1/20 k-fold bagging repeats ...\n",
      "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the 95.08s of remaining time.\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.9767\t = Validation score   (roc_auc)\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t110.29s\t = Training   runtime\n",
      "INFO: autogluon.core.trainer.abstract_trainer: \t0.13s\t = Validation runtime\n",
      "INFO: autogluon.tabular.learner.default_learner: AutoGluon training complete, total runtime = 3615.62s ... Best model: \"WeightedEnsemble_L3\"\n",
      "INFO: autogluon.tabular.predictor.predictor: TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/home/ec2-user/SageMaker/official-dataset-names/IEEE-CIS Fraud Detection/AutogluonModels/ag-20220615_135015_best_quality/\")\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                      model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order\n",
      "0       WeightedEnsemble_L2       0.889      0.972          17.168        130.219   491.324                    0.008                   0.094             92.936            2       True         10\n",
      "1     ExtraTreesGini_BAG_L1       0.879      0.947           1.356         35.271    13.385                    1.356                  35.271             13.385            1       True          6\n",
      "2   RandomForestEntr_BAG_L1       0.878      0.947           0.948         26.657    13.727                    0.948                  26.657             13.727            1       True          4\n",
      "3     ExtraTreesEntr_BAG_L1       0.878      0.950           1.526         41.223    15.112                    1.526                  41.223             15.112            1       True          7\n",
      "4       WeightedEnsemble_L3       0.876      0.977          36.877        330.085  3131.503                    0.023                   0.126            110.288            3       True         21\n",
      "5   RandomForestGini_BAG_L2       0.873      0.963          26.468        220.941  2225.452                    0.544                  21.332             16.300            2       True         13\n",
      "6   RandomForestGini_BAG_L1       0.872      0.946           0.740         24.016    15.415                    0.740                  24.016             15.415            1       True          3\n",
      "7   RandomForestEntr_BAG_L2       0.871      0.965          26.435        220.275  2223.195                    0.511                  20.666             14.044            2       True         14\n",
      "8     ExtraTreesGini_BAG_L2       0.871      0.964          26.592        236.773  2223.269                    0.669                  37.164             14.118            2       True         16\n",
      "9           CatBoost_BAG_L2       0.868      0.976          26.427        200.651  2602.927                    0.504                   1.042            393.775            2       True         15\n",
      "10    ExtraTreesEntr_BAG_L2       0.868      0.964          26.599        236.924  2223.651                    0.675                  37.315             14.499            2       True         17\n",
      "11          CatBoost_BAG_L1       0.865      0.956           1.264          3.485  1615.448                    1.264                   3.485           1615.448            1       True          5\n",
      "12          LightGBM_BAG_L2       0.864      0.975          26.580        201.041  2220.792                    0.656                   1.432             11.641            2       True         12\n",
      "13           XGBoost_BAG_L2       0.864      0.975          28.676        203.598  2234.414                    2.753                   3.989             25.263            2       True         19\n",
      "14        LightGBMXT_BAG_L2       0.862      0.975          26.927        202.401  2230.314                    1.004                   2.792             21.163            2       True         11\n",
      "15     LightGBMLarge_BAG_L2       0.860      0.975          26.780        201.669  2253.252                    0.857                   2.060             44.101            2       True         20\n",
      "16   NeuralNetFastAI_BAG_L2       0.859      0.974          30.341        206.390  2499.116                    4.417                   6.781            289.964            2       True         18\n",
      "17           XGBoost_BAG_L1       0.857      0.965           3.367          8.731    94.352                    3.367                   8.731             94.352            1       True          9\n",
      "18        LightGBMXT_BAG_L1       0.853      0.963           7.588         36.276   177.582                    7.588                  36.276            177.582            1       True          1\n",
      "19          LightGBM_BAG_L1       0.851      0.969           3.731         17.236    97.615                    3.731                  17.236             97.615            1       True          2\n",
      "20   NeuralNetFastAI_BAG_L1       0.837      0.909           5.404          6.713   166.515                    5.404                   6.713            166.515            1       True          8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: benchmark_ag.py: auc on test data: 0.8761825926835967\n",
      "INFO: benchmark_ag.py: tpr@1%fpr on test data: 0.4408502772643253\n"
     ]
    }
   ],
   "source": [
    "run_ag(dataset, BASE_PATH, time_limit=3600, presets='best_quality')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f720c57",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_mxnet_latest_p37",
   "language": "python",
   "name": "conda_mxnet_latest_p37"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: scripts/reproducibility/autosklearn/README.md
================================================
## Steps to reproduce Auto-sklearn models


1. Load and save the datasets locally using [FDB Loader](../../examples/Test_FDB_Loader.ipynb). Keep note of `{DATASET_PATH}` that contains local paths to datasets containing `train.csv`, `test.csv` and `test_labels.csv` from FDB loader.

2. Run `benchmark_autosklearn.py` using following:
```
python3 benchmark_autosklearn.py {DATASET_PATH}
```

3. The script after running successfully will save results in the `DATASET_PATH`. The evaluation metrics on `test.csv` will be saved in `test_metrics_autosklearn.joblib`. 

*Note: Python 3.7+ is needed to run the used version of auto-sklearn and to reproduce the results. Similar to other auto-ml frameworks, auto-sklearn is also not perfectly reproducible because some underlying models are not deterministically seeded. However, the variations in results are within acceptable errors.*


================================================
FILE: scripts/reproducibility/autosklearn/benchmark_autosklearn.py
================================================

import json
import joblib
import datetime
import numpy as np
import pandas as pd
import os, sys, shutil

from autosklearn.metrics import roc_auc, log_loss
from autosklearn.classification import AutoSklearnClassifier

from sklearn.metrics import roc_auc_score, roc_curve
from pandas.api.types import is_numeric_dtype, is_string_dtype

import logging
FORMAT = "%(levelname)s: %(name)s: %(message)s"
DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
logger = logging.getLogger(os.path.basename(__file__))
logger.setLevel(logging.DEBUG)

logging_config = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'simple': {
            'format': '%(levelname)-8s %(name)-15s %(message)s'
        }
    },
    'handlers':{
        'console_handler': {
            'class': 'logging.StreamHandler',
            'formatter': 'simple'
        },
        'file_handler': {
            'class':'logging.FileHandler',
            'mode': 'a',
            'encoding': 'utf-8',
            'filename':'main.log',
            'formatter': 'simple'
        },
        'spec_handler':{
            'class':'logging.FileHandler',
            'filename':'dummy_autosklearn.log',
            'formatter': 'simple'
        },
        'distributed_logfile':{
            'filename':'distributed.log',
            'class': 'logging.FileHandler',
            'formatter': 'simple',
            'level': 'DEBUG'
        }
    },
    'loggers': {
        '': {
            'level': 'INFO',
            'handlers':['file_handler', 'console_handler']
        },
        'autosklearn': {
            'level': 'INFO',
            'propagate': False,
            'handlers': ['spec_handler']
        },
        'smac': {
            'level': 'INFO',
            'propagate': False,
            'handlers': ['spec_handler']
        },
        'EnsembleBuilder': {
            'level': 'INFO',
            'propagate': False,
            'handlers': ['spec_handler']
        },
    },
}

def load_data(dataset_path):
    logger.info(dataset_path)
    
    df_train = pd.read_csv(f"{dataset_path}/train.csv", lineterminator='\n')
    logger.info(df_train.shape)
    
    df_test = pd.read_csv(f"{dataset_path}/test.csv")
    logger.info(df_test.shape)
    
    df_test_labels = pd.read_csv(f"{dataset_path}/test_labels.csv")
    logger.info(df_test_labels.shape)
    
    df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID")
    logger.info(df_test.shape)
    
    
    features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID")
    features = [x for x in df_test.columns if x not in features_to_exclude ]
    logger.info(len(features))
    logger.info(features)
    
    return features, df_train, df_test


def get_recall(fpr, tpr, fpr_target=0.01): 
    return np.interp(fpr_target, fpr, tpr)


def run_autosklearn(dataset_path):
    
    features, df_train, df_test = load_data(dataset_path)

    dateTimeObj = datetime.datetime.now()
    timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S")
    
    numeric_features = [f for f in features if is_numeric_dtype(df_train[f])]
    categorical_features = [f for f in features if f not in numeric_features]
    logger.info(f'categorical: {categorical_features}')
    logger.info(f'numeric: {numeric_features}')
    
    labels = sorted(df_train['EVENT_LABEL'].unique())
    df_train['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True)
    df_test['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True)
    
    for df in [df_train, df_test]:
        df[categorical_features] = df[categorical_features].fillna('<nan>')
        df[categorical_features] = df[categorical_features].astype('category')
    
    out_dir = f"{dataset_path}/AutoSklearnModels/"
    if os.path.exists(out_dir):
        shutil.rmtree(out_dir)
    
    automl = AutoSklearnClassifier(
        metric=roc_auc,
        scoring_functions=[roc_auc, log_loss],
        tmp_folder=out_dir, # for debugging
        delete_tmp_folder_after_terminate=False,
        logging_config=logging_config,
        n_jobs=-1,
        memory_limit=None
    )
    
    assert len(categorical_features) + len(numeric_features) == len(features)
    
    logger.info('Fitting')
    automl.fit(df_train[features], df_train['EVENT_LABEL'])
    joblib.dump(automl, f"{dataset_path}/automl.joblib")
    
    cv = pd.DataFrame(automl.cv_results_)
    cv.to_csv(f"{dataset_path}/cv_results_autosklearn.csv", index=False)
    
    df_pred = automl.predict_proba(df_test[features])[:,1]
    
    auc_score = roc_auc_score(df_test['EVENT_LABEL'], df_pred)
    logger.info(f"auc on test data: {auc_score}")
    
    fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred)
        
    recall = get_recall(fpr, tpr, fpr_target=0.01)
    logger.info(f"tpr@1%fpr on test data: {recall}")
    
    test_metrics = {
    "labels": df_test['EVENT_LABEL'],
    "pred_prob": df_pred,    
    "auc": auc_score,
    "tpr@1%fpr": recall,
    "fpr": fpr,
    "tpr": tpr,
    "thresholds": thresholds
    }
    joblib.dump(test_metrics, f"{dataset_path}/test_metrics_autosklearn.joblib")
    
if __name__ == "__main__":
    args = sys.argv
    logger.info(args)
    run_autosklearn(args[1])
    

================================================
FILE: scripts/reproducibility/benchmark_utils.py
================================================
import numpy as np
import pandas as pd
import os

import matplotlib as mpl

mpl.rcParams['figure.dpi'] = 150
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

import logging
FORMAT = "%(levelname)s: %(name)s: %(message)s"
DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
logger = logging.getLogger(os.path.basename(__file__))
logger.setLevel(logging.DEBUG)


def load_data(dataset, base_path):
    logger.info(dataset)
    
    df_train = pd.read_csv(f"{base_path}/{dataset}/train.csv", lineterminator='\n')
    logger.info(df_train.shape)
    
    df_test = pd.read_csv(f"{base_path}/{dataset}/test.csv")
    logger.info(df_test.shape)
    
    df_test_labels = pd.read_csv(f"{base_path}/{dataset}/test_labels.csv")
    logger.info(df_test_labels.shape)
    
    df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID")
    logger.info(df_test.shape)
    
    
    features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID")
    features = [x for x in df_test.columns if x not in features_to_exclude ]
    logger.info(len(features))
    logger.info(features)
    
    return features, df_train, df_test

def get_recall(fpr, tpr, fpr_target=0.01): 
    return np.interp(fpr_target, fpr, tpr)

================================================
FILE: scripts/reproducibility/h2o/README.md
================================================
- benchmark_h2o.py: a script for h2o benchmarking
- example-h2o-ieeecis.ipynb: an example notebook using benchmark_h2o.py

Note that h2o is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper.


================================================
FILE: scripts/reproducibility/h2o/benchmark_h2o.py
================================================
import pandas as pd
import os
import gc
import joblib

import matplotlib as mpl
from sklearn.metrics import roc_auc_score, roc_curve

mpl.rcParams['figure.dpi'] = 150
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

import logging
FORMAT = "%(levelname)s: %(name)s: %(message)s"
DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT)
logger = logging.getLogger(os.path.basename(__file__))
logger.setLevel(logging.DEBUG)

import sys
sys.path.append('../')
from benchmark_utils import load_data, get_recall

import h2o
from h2o.automl import H2OAutoML
    
def run_h2o(dataset, base_path, connect_url=None, time_limit=None, include_algos=None, exclude_algos=None, verbosity="info", seed=10):
    if connect_url is not None:
        _ = h2o.connect(url=connect_url, https=True, verbose=True)
        h2o.cluster().show_status(True)
    else:
        h2o.init()
    
    gc.collect()
    features, df_train, df_test = load_data(dataset, base_path)
    
    df_train_h2o = h2o.H2OFrame(df_train)
    feature_types_h2o = {k:df_train_h2o.types[k] for k in df_train_h2o.types if k in features}
    # force test schema the same as train schema, otherwise predict will throw errors
    df_test_h2o = h2o.H2OFrame(df_test, column_types=feature_types_h2o)
    
    df_train_h2o['EVENT_LABEL'] = df_train_h2o['EVENT_LABEL'].asfactor()
    df_test_h2o['EVENT_LABEL'] = df_test_h2o['EVENT_LABEL'].asfactor()
        
    aml = H2OAutoML(max_runtime_secs = time_limit, seed = seed,
                     include_algos=include_algos,
                     exclude_algos=exclude_algos,
                 export_checkpoints_dir=f"{base_path}/{dataset}/H2OModels/",
                 verbosity=verbosity)
    
    # use validation error in the leaderboard to avoid leakage when calling aml.predict
    aml.train(x = features, 
          y = 'EVENT_LABEL', 
          training_frame = df_train_h2o,  
             )
    
    lb = aml.leaderboard
    # lb.head(rows=lb.nrows)
    
    h2o.h2o.download_csv(lb, f"{base_path}/{dataset}/leaderboard_h2o.csv")
    
    lb_2 = h2o.automl.get_leaderboard(aml, extra_columns = "ALL")
    h2o.h2o.download_csv(lb_2, f"{base_path}/{dataset}/leaderboard_h2o_full.csv")
    # Get training timing info
    info = aml.training_info
    joblib.dump(info, f"{base_path}/{dataset}/training_info.joblib")
    
    df_pred_h2o = aml.predict(df_test_h2o[features])
    pos_label = df_test_h2o['EVENT_LABEL'].levels()[0][-1] # levels are ordered alphabetically

    pos_label2 = 'p'+pos_label if pos_label=='1' else pos_label
    df_pred_h2o = (h2o.as_list(df_pred_h2o[pos_label2]))[pos_label2]

    auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred_h2o)
    logger.info(f"auc on test data: {auc}")
    
    fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'].astype(str), df_pred_h2o, 
                                     pos_label=pos_label)
    
    y_true = df_test['EVENT_LABEL']
    y_true = (y_true.astype(str)==pos_label)
    
    recall = get_recall(fpr, tpr, fpr_target=0.01)
    logger.info(f"tpr@1%fpr on test data: {recall}")

    test_metrics_h2o = {
    "pos_label": pos_label,
    "labels": df_test['EVENT_LABEL'],
    "pred_prob": df_pred_h2o,    
    "auc": auc,
    "tpr@1%fpr": recall,
    "fpr": fpr,
    "tpr": tpr,
    "thresholds": thresholds
    }
    joblib.dump(test_metrics_h2o, f"{base_path}/{dataset}/test_metrics_h2o.joblib")
    
    h2o.cluster().shutdown(prompt=False)

================================================
FILE: scripts/reproducibility/h2o/example-h2o-ieeecis.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "afc2eecf",
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f00a81aa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container { width:90% }</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "from IPython.display import clear_output\n",
    "display(HTML(\"<style>.container { width:90% }</style>\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "11759d10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2baa2261",
   "metadata": {},
   "source": [
    "## Step 1: pip install required packages if not installed already"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1efdc80c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install h2o\n",
    "import benchmark_h2o\n",
    "from benchmark_h2o import load_data, run_h2o"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b1e0a24",
   "metadata": {},
   "source": [
    "## Step 2: download data using fdb\n",
    "Example: https://github.com/amazon-research/fraud-dataset-benchmark/blob/main/scripts/examples/Test_FDB_Loader.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "0a34e883",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is where datasets are stored: {BASE_PATH}/{dataset}/\n",
    "BASE_PATH = \"/home/ec2-user/SageMaker/official-dataset-names\"\n",
    "dataset = \"IEEE-CIS Fraud Detection\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8aed893e",
   "metadata": {},
   "source": [
    "Make sure three files are downloaded:\n",
    "1. {BASE_PATH}/{dataset}/train.csv\n",
    "2. {BASE_PATH}/{dataset}/test.csv\n",
    "3. {BASE_PATH}/{dataset}/test_labels.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b02d77e0",
   "metadata": {},
   "source": [
    "## Step 3: look at data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9dfd0df9",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n",
      "INFO: benchmark_utils.py: (313060, 194)\n",
      "INFO: benchmark_utils.py: (27330, 71)\n",
      "INFO: benchmark_utils.py: (29527, 2)\n",
      "INFO: benchmark_utils.py: (27329, 72)\n",
      "INFO: benchmark_utils.py: 67\n",
      "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n"
     ]
    }
   ],
   "source": [
    "features, df_train, df_test = load_data(dataset, BASE_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "eebaa1d5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>addr2</th>\n",
       "      <th>dist1</th>\n",
       "      <th>dist2</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>d1</th>\n",
       "      <th>d2</th>\n",
       "      <th>d3</th>\n",
       "      <th>d4</th>\n",
       "      <th>d5</th>\n",
       "      <th>d10</th>\n",
       "      <th>d11</th>\n",
       "      <th>d15</th>\n",
       "      <th>m1</th>\n",
       "      <th>m2</th>\n",
       "      <th>m3</th>\n",
       "      <th>m4</th>\n",
       "      <th>m6</th>\n",
       "      <th>m7</th>\n",
       "      <th>m8</th>\n",
       "      <th>m9</th>\n",
       "      <th>v1</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v6</th>\n",
       "      <th>v8</th>\n",
       "      <th>v11</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v17</th>\n",
       "      <th>v20</th>\n",
       "      <th>v23</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v30</th>\n",
       "      <th>v36</th>\n",
       "      <th>v37</th>\n",
       "      <th>v40</th>\n",
       "      <th>v41</th>\n",
       "      <th>v44</th>\n",
       "      <th>v47</th>\n",
       "      <th>v48</th>\n",
       "      <th>v54</th>\n",
       "      <th>v56</th>\n",
       "      <th>v59</th>\n",
       "      <th>v62</th>\n",
       "      <th>v65</th>\n",
       "      <th>v67</th>\n",
       "      <th>v68</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v80</th>\n",
       "      <th>v82</th>\n",
       "      <th>v86</th>\n",
       "      <th>v88</th>\n",
       "      <th>v89</th>\n",
       "      <th>v91</th>\n",
       "      <th>v107</th>\n",
       "      <th>v108</th>\n",
       "      <th>v111</th>\n",
       "      <th>v115</th>\n",
       "      <th>v117</th>\n",
       "      <th>v120</th>\n",
       "      <th>v121</th>\n",
       "      <th>v123</th>\n",
       "      <th>v124</th>\n",
       "      <th>v127</th>\n",
       "      <th>v129</th>\n",
       "      <th>v130</th>\n",
       "      <th>v136</th>\n",
       "      <th>v138</th>\n",
       "      <th>v139</th>\n",
       "      <th>v142</th>\n",
       "      <th>v147</th>\n",
       "      <th>v156</th>\n",
       "      <th>v160</th>\n",
       "      <th>v162</th>\n",
       "      <th>v165</th>\n",
       "      <th>v166</th>\n",
       "      <th>v169</th>\n",
       "      <th>v171</th>\n",
       "      <th>v173</th>\n",
       "      <th>v175</th>\n",
       "      <th>v176</th>\n",
       "      <th>v178</th>\n",
       "      <th>v180</th>\n",
       "      <th>v182</th>\n",
       "      <th>v185</th>\n",
       "      <th>v187</th>\n",
       "      <th>v188</th>\n",
       "      <th>v198</th>\n",
       "      <th>v203</th>\n",
       "      <th>v205</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v215</th>\n",
       "      <th>v218</th>\n",
       "      <th>v220</th>\n",
       "      <th>v221</th>\n",
       "      <th>v223</th>\n",
       "      <th>v224</th>\n",
       "      <th>v226</th>\n",
       "      <th>v228</th>\n",
       "      <th>v229</th>\n",
       "      <th>v234</th>\n",
       "      <th>v235</th>\n",
       "      <th>v238</th>\n",
       "      <th>v240</th>\n",
       "      <th>v250</th>\n",
       "      <th>v252</th>\n",
       "      <th>v253</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v260</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v281</th>\n",
       "      <th>v283</th>\n",
       "      <th>v284</th>\n",
       "      <th>v285</th>\n",
       "      <th>v286</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>v296</th>\n",
       "      <th>v297</th>\n",
       "      <th>v301</th>\n",
       "      <th>v303</th>\n",
       "      <th>v305</th>\n",
       "      <th>v307</th>\n",
       "      <th>v309</th>\n",
       "      <th>v310</th>\n",
       "      <th>v314</th>\n",
       "      <th>v320</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_03</th>\n",
       "      <th>id_04</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_10</th>\n",
       "      <th>id_11</th>\n",
       "      <th>id_12</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_15</th>\n",
       "      <th>id_16</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_18</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>id_28</th>\n",
       "      <th>id_29</th>\n",
       "      <th>id_31</th>\n",
       "      <th>id_35</th>\n",
       "      <th>id_36</th>\n",
       "      <th>id_37</th>\n",
       "      <th>id_38</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>68.500</td>\n",
       "      <td>W</td>\n",
       "      <td>13926.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>150.000</td>\n",
       "      <td>142.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>315.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>19.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>14.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>13.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12.000</td>\n",
       "      <td>12.000</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>M2</td>\n",
       "      <td>T</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>117.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>117.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>13926.0_315.0_-13.0</td>\n",
       "      <td>2021-01-02 00:00:00</td>\n",
       "      <td>user</td>\n",
       "      <td>2987000.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>29.000</td>\n",
       "      <td>W</td>\n",
       "      <td>2755.000</td>\n",
       "      <td>404.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>325.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>M0</td>\n",
       "      <td>T</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2755.0_325.0_1.0</td>\n",
       "      <td>2021-01-02 00:00:01</td>\n",
       "      <td>user</td>\n",
       "      <td>2987001.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>59.000</td>\n",
       "      <td>W</td>\n",
       "      <td>4663.000</td>\n",
       "      <td>490.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>debit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>287.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>outlook.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-1.001</td>\n",
       "      <td>313.999</td>\n",
       "      <td>313.999</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>M0</td>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>F</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4663.0_330.0_1.0</td>\n",
       "      <td>2021-01-02 00:01:09</td>\n",
       "      <td>user</td>\n",
       "      <td>2987002.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>50.000</td>\n",
       "      <td>W</td>\n",
       "      <td>18132.000</td>\n",
       "      <td>567.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>117.000</td>\n",
       "      <td>debit</td>\n",
       "      <td>476.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>4.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>25.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>112.000</td>\n",
       "      <td>112.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>92.999</td>\n",
       "      <td>0.000</td>\n",
       "      <td>82.999</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109.999</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>M0</td>\n",
       "      <td>F</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1758.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>354.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>10.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>38.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1758.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>354.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18132.0_476.0_-111.0</td>\n",
       "      <td>2021-01-02 00:01:39</td>\n",
       "      <td>user</td>\n",
       "      <td>2987003.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>50.000</td>\n",
       "      <td>H</td>\n",
       "      <td>4497.000</td>\n",
       "      <td>514.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>420.000</td>\n",
       "      <td>87.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>169690.800</td>\n",
       "      <td>0.000</td>\n",
       "      <td>5155.000</td>\n",
       "      <td>2840.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>70787.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100.000</td>\n",
       "      <td>NotFound</td>\n",
       "      <td>NaN</td>\n",
       "      <td>New</td>\n",
       "      <td>NotFound</td>\n",
       "      <td>166.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>542.000</td>\n",
       "      <td>144.000</td>\n",
       "      <td>New</td>\n",
       "      <td>NotFound</td>\n",
       "      <td>samsung browser 6.2</td>\n",
       "      <td>T</td>\n",
       "      <td>F</td>\n",
       "      <td>T</td>\n",
       "      <td>T</td>\n",
       "      <td>mobile</td>\n",
       "      <td>SAMSUNG SM-G892A Build/NRD90M</td>\n",
       "      <td>4497.0_420.0_1.0</td>\n",
       "      <td>2021-01-02 00:01:46</td>\n",
       "      <td>user</td>\n",
       "      <td>2987004.000</td>\n",
       "      <td>2022-01-01T20:30:04Z</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL  transactionamt productcd     card1   card2   card3   card5   card6   addr1  addr2   dist1  dist2 p_emaildomain r_emaildomain    c1    c2    c4    c5    c6    c7    c8    c9   c10  \\\n",
       "0            0          68.500         W 13926.000     NaN 150.000 142.000  credit 315.000 87.000  19.000    NaN           NaN           NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000   \n",
       "1            0          29.000         W  2755.000 404.000 150.000 102.000  credit 325.000 87.000     NaN    NaN     gmail.com           NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000   \n",
       "2            0          59.000         W  4663.000 490.000 150.000 166.000   debit 330.000 87.000 287.000    NaN   outlook.com           NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000   \n",
       "3            0          50.000         W 18132.000 567.000 150.000 117.000   debit 476.000 87.000     NaN    NaN     yahoo.com           NaN 2.000 5.000 0.000 0.000 4.000 0.000 0.000 1.000 0.000   \n",
       "4            0          50.000         H  4497.000 514.000 150.000 102.000  credit 420.000 87.000     NaN    NaN     gmail.com           NaN 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000   \n",
       "\n",
       "    c11   c12    c13   c14      d1      d2     d3     d4    d5    d10     d11     d15   m1   m2   m3   m4   m6   m7   m8   m9    v1    v3    v4    v6    v8   v11   v13   v14   v17   v20   v23   v26  \\\n",
       "0 2.000 0.000  1.000 1.000  14.000     NaN 13.000    NaN   NaN 12.000  12.000  -1.000    T    T    T   M2    T  NaN  NaN  NaN 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000   \n",
       "1 1.000 0.000  1.000 1.000   0.000     NaN    NaN -1.000   NaN -1.000     NaN  -1.000  NaN  NaN  NaN   M0    T  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000 1.000 1.000 1.000   \n",
       "2 1.000 0.000  1.000 1.000   0.000     NaN    NaN -1.001   NaN -1.001 313.999 313.999    T    T    T   M0    F    F    F    F 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000   \n",
       "3 1.000 0.000 25.000 1.000 112.000 112.000  0.000 92.999 0.000 82.999     NaN 109.999  NaN  NaN  NaN   M0    F  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000 1.000 0.000 1.000 1.000 1.000   \n",
       "4 1.000 0.000  1.000 1.000   0.000     NaN    NaN    NaN   NaN    NaN     NaN     NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "\n",
       "    v27   v30   v36   v37   v40   v41   v44   v47   v48   v54   v56   v59   v62   v65   v67   v68   v70   v76   v78   v80   v82   v86   v88   v89   v91  v107  v108  v111  v115  v117  v120  v121  \\\n",
       "0 0.000 0.000   NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "1 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "2 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "3 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "4   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000 1.000 1.000 1.000 1.000 1.000 1.000   \n",
       "\n",
       "   v123  v124     v127  v129    v130  v136  v138  v139  v142  v147  v156       v160  v162     v165     v166  v169  v171  v173  v175  v176  v178  v180  v182  v185  v187  v188  v198  v203  v205  v207  \\\n",
       "0 1.000 1.000  117.000 0.000   0.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "1 1.000 1.000    0.000 0.000   0.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2 1.000 1.000    0.000 0.000   0.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "3 1.000 1.000 1758.000 0.000 354.000 0.000   NaN   NaN   NaN   NaN   NaN        NaN   NaN      NaN      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "4 1.000 1.000    0.000 0.000   0.000 0.000 0.000 0.000 0.000 0.000 0.000 169690.800 0.000 5155.000 2840.000 0.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000   \n",
       "\n",
       "   v209  v210  v215  v218  v220  v221  v223  v224  v226  v228  v229  v234  v235  v238  v240  v250  v252  v253  v257  v258  v260  v261  v264  v266  v267  v271  v274  v277  v281  v283  v284   v285  \\\n",
       "0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000  0.000   \n",
       "1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000  0.000   \n",
       "2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 1.000 0.000  0.000   \n",
       "3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 0.000 0.000 0.000 10.000   \n",
       "4 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000  0.000   \n",
       "\n",
       "   v286  v289  v291   v294  v296  v297  v301  v303  v305     v307  v309    v310  v314  v320  id_01     id_02  id_03  id_04  id_05  id_06  id_09  id_10   id_11     id_12  id_13 id_15     id_16  \\\n",
       "0 0.000 0.000 1.000  1.000 0.000 0.000 0.000 0.000 1.000  117.000 0.000   0.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "1 0.000 0.000 1.000  0.000 0.000 0.000 0.000 0.000 1.000    0.000 0.000   0.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "2 0.000 0.000 1.000  0.000 0.000 0.000 0.000 0.000 1.000    0.000 0.000   0.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "3 0.000 0.000 1.000 38.000 0.000 0.000 0.000 0.000 1.000 1758.000 0.000 354.000 0.000 0.000    NaN       NaN    NaN    NaN    NaN    NaN    NaN    NaN     NaN       NaN    NaN   NaN       NaN   \n",
       "4 0.000 0.000 1.000  0.000 0.000 0.000 0.000 1.000 1.000    0.000 0.000   0.000 0.000 0.000  0.000 70787.000    NaN    NaN    NaN    NaN    NaN    NaN 100.000  NotFound    NaN   New  NotFound   \n",
       "\n",
       "    id_17  id_18   id_19   id_20 id_28     id_29                id_31 id_35 id_36 id_37 id_38 devicetype                     deviceinfo             ENTITY_ID      EVENT_TIMESTAMP ENTITY_TYPE  \\\n",
       "0     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN   13926.0_315.0_-13.0  2021-01-02 00:00:00        user   \n",
       "1     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN      2755.0_325.0_1.0  2021-01-02 00:00:01        user   \n",
       "2     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN      4663.0_330.0_1.0  2021-01-02 00:01:09        user   \n",
       "3     NaN    NaN     NaN     NaN   NaN       NaN                  NaN   NaN   NaN   NaN   NaN        NaN                            NaN  18132.0_476.0_-111.0  2021-01-02 00:01:39        user   \n",
       "4 166.000    NaN 542.000 144.000   New  NotFound  samsung browser 6.2     T     F     T     T     mobile  SAMSUNG SM-G892A Build/NRD90M      4497.0_420.0_1.0  2021-01-02 00:01:46        user   \n",
       "\n",
       "     EVENT_ID       LABEL_TIMESTAMP  \n",
       "0 2987000.000  2022-01-01T20:30:04Z  \n",
       "1 2987001.000  2022-01-01T20:30:04Z  \n",
       "2 2987002.000  2022-01-01T20:30:04Z  \n",
       "3 2987003.000  2022-01-01T20:30:04Z  \n",
       "4 2987004.000  2022-01-01T20:30:04Z  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c89c46e9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>dist1</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>v62</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v82</th>\n",
       "      <th>v91</th>\n",
       "      <th>v127</th>\n",
       "      <th>v130</th>\n",
       "      <th>v139</th>\n",
       "      <th>v160</th>\n",
       "      <th>v165</th>\n",
       "      <th>v187</th>\n",
       "      <th>v203</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v221</th>\n",
       "      <th>v234</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v283</th>\n",
       "      <th>v285</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109411.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66104.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103183.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>926.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>1411.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:11:15</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548013.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109536.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66229.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103308.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>927.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>693.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:11:29</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548014.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109661.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66354.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103433.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>928.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>1116.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:11:45</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548015.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>125.000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.000</td>\n",
       "      <td>481.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>102.000</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>8.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>61.000</td>\n",
       "      <td>5.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109786.000</td>\n",
       "      <td>2301.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>2401.000</td>\n",
       "      <td>66479.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>103558.000</td>\n",
       "      <td>877.000</td>\n",
       "      <td>1961.000</td>\n",
       "      <td>465.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>73.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>26.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>929.000</td>\n",
       "      <td>-10.000</td>\n",
       "      <td>1589.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>52.000</td>\n",
       "      <td>166.000</td>\n",
       "      <td>633.000</td>\n",
       "      <td>533.000</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>2021-06-21 23:12:00</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548016.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>31.950</td>\n",
       "      <td>W</td>\n",
       "      <td>9500.000</td>\n",
       "      <td>321.000</td>\n",
       "      <td>150.000</td>\n",
       "      <td>226.000</td>\n",
       "      <td>debit</td>\n",
       "      <td>204.000</td>\n",
       "      <td>74.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>6.000</td>\n",
       "      <td>3.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>2.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>27.950</td>\n",
       "      <td>27.950</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>1.000</td>\n",
       "      <td>0.000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2021-06-21 23:12:11</td>\n",
       "      <td>9500.0_204.0_150.0</td>\n",
       "      <td>user</td>\n",
       "      <td>3548017.000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   transactionamt productcd     card1   card2   card3   card5   card6   addr1  dist1 p_emaildomain r_emaildomain    c1    c2    c4    c5    c6    c7    c8    c9   c10   c11   c12    c13   c14   v62  \\\n",
       "0         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "1         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "2         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "3         125.000         S 15775.000 481.000 150.000 102.000  credit 330.000    NaN           NaN     yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000   \n",
       "4          31.950         W  9500.000 321.000 150.000 226.000   debit 204.000 74.000           NaN           NaN 3.000 3.000 0.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000  6.000 3.000 1.000   \n",
       "\n",
       "    v70   v76   v78   v82   v91       v127     v130  v139     v160      v165  v187       v203    v207     v209    v210  v221   v234  v257  v258  v261  v264  v266  v267  v271  v274  v277  v283  \\\n",
       "0 0.000   NaN   NaN   NaN   NaN 109411.000 2301.000 0.000 2401.000 66104.000 1.000 103183.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "1 0.000   NaN   NaN   NaN   NaN 109536.000 2301.000 0.000 2401.000 66229.000 1.000 103308.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "2 0.000   NaN   NaN   NaN   NaN 109661.000 2301.000 0.000 2401.000 66354.000 1.000 103433.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "3 0.000   NaN   NaN   NaN   NaN 109786.000 2301.000 0.000 2401.000 66479.000 1.000 103558.000 877.000 1961.000 465.000 0.000 73.000   NaN   NaN   NaN   NaN   NaN   NaN 0.000   NaN   NaN 1.000   \n",
       "4 1.000 1.000 2.000 1.000 1.000     27.950   27.950   NaN      NaN       NaN   NaN        NaN     NaN      NaN     NaN   NaN    NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 1.000   \n",
       "\n",
       "    v285  v289  v291    v294   id_01    id_02  id_05  id_06  id_09  id_13   id_17   id_19   id_20 devicetype deviceinfo      EVENT_TIMESTAMP            ENTITY_ID ENTITY_TYPE    EVENT_ID  EVENT_LABEL  \n",
       "0 26.000 1.000 2.000 926.000 -10.000 1411.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:11:15  15775.0_330.0_129.0        user 3548013.000            0  \n",
       "1 26.000 1.000 2.000 927.000 -10.000  693.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:11:29  15775.0_330.0_129.0        user 3548014.000            0  \n",
       "2 26.000 1.000 2.000 928.000 -10.000 1116.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:11:45  15775.0_330.0_129.0        user 3548015.000            0  \n",
       "3 26.000 1.000 2.000 929.000 -10.000 1589.000  6.000  0.000  0.000 52.000 166.000 633.000 533.000    desktop    Windows  2021-06-21 23:12:00  15775.0_330.0_129.0        user 3548016.000            0  \n",
       "4  1.000 1.000 1.000   0.000     NaN      NaN    NaN    NaN    NaN    NaN     NaN     NaN     NaN        NaN        NaN  2021-06-21 23:12:11   9500.0_204.0_150.0        user 3548017.000            0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71fd76ac",
   "metadata": {},
   "source": [
    "## Step 3: run H2O"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7393d373",
   "metadata": {},
   "source": [
    "1. The function run_h2o below also saves a leaderboard file (leaderboard_xxx.csv) and a test metrics file (test_metrics_xxx.joblib) into {BASE_PATH}/{dataset}/, respectively\n",
    "2. H2O models are saved at {BASE_PATH}/{dataset}/H2OModels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f851663a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# H2OStartupError: Your java is not supported: java version \"1.7.0_261\"; OpenJDK Runtime Environment (amzn-2.6.22.1.84.amzn1-x86_64 u261-b02); OpenJDK 64-Bit Server VM (build 24.261-b02, mixed mode)\n",
    "# If you see this error above, you may need to run the following instructions:\n",
    "\n",
    "# !sudo yum install -y java-1.8.0-openjdk.x86_64\n",
    "# !sudo /usr/sbin/alternatives --set java /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java\n",
    "# !sudo /usr/sbin/alternatives --set javac /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/javac"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d259b767",
   "metadata": {},
   "source": [
    "https://swiftotter.com/technical/amazon-aws-jenkins-2-60-1-java-8-update#/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "63ec8755",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "openjdk version \"1.8.0_312\"\n",
      "OpenJDK Runtime Environment (build 1.8.0_312-b07)\n",
      "OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)\n"
     ]
    }
   ],
   "source": [
    "!java -version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "9dc0646d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'3.36.1.2'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import h2o\n",
    "h2o.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "7c060159",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.\n",
      "Attempting to start a local H2O server...\n",
      "  Java Version: openjdk version \"1.8.0_312\"; OpenJDK Runtime Environment (build 1.8.0_312-b07); OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)\n",
      "  Starting server from /home/ec2-user/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar\n",
      "  Ice root: /tmp/tmpag6zcv5a\n",
      "  JVM stdout: /tmp/tmpag6zcv5a/h2o_ec2_user_started_from_python.out\n",
      "  JVM stderr: /tmp/tmpag6zcv5a/h2o_ec2_user_started_from_python.err\n",
      "  Server is running at http://127.0.0.1:54321\n",
      "Connecting to H2O server at http://127.0.0.1:54321 ... successful.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"overflow:auto\"><table style=\"width:50%\"><tr><td>H2O_cluster_uptime:</td>\n",
       "<td>02 secs</td></tr>\n",
       "<tr><td>H2O_cluster_timezone:</td>\n",
       "<td>UTC</td></tr>\n",
       "<tr><td>H2O_data_parsing_timezone:</td>\n",
       "<td>UTC</td></tr>\n",
       "<tr><td>H2O_cluster_version:</td>\n",
       "<td>3.36.1.2</td></tr>\n",
       "<tr><td>H2O_cluster_version_age:</td>\n",
       "<td>20 days </td></tr>\n",
       "<tr><td>H2O_cluster_name:</td>\n",
       "<td>H2O_from_python_ec2_user_t9z3ig</td></tr>\n",
       "<tr><td>H2O_cluster_total_nodes:</td>\n",
       "<td>1</td></tr>\n",
       "<tr><td>H2O_cluster_free_memory:</td>\n",
       "<td>26.64 Gb</td></tr>\n",
       "<tr><td>H2O_cluster_total_cores:</td>\n",
       "<td>64</td></tr>\n",
       "<tr><td>H2O_cluster_allowed_cores:</td>\n",
       "<td>64</td></tr>\n",
       "<tr><td>H2O_cluster_status:</td>\n",
       "<td>locked, healthy</td></tr>\n",
       "<tr><td>H2O_connection_url:</td>\n",
       "<td>http://127.0.0.1:54321</td></tr>\n",
       "<tr><td>H2O_connection_proxy:</td>\n",
       "<td>{\"http\": null, \"https\": null}</td></tr>\n",
       "<tr><td>H2O_internal_security:</td>\n",
       "<td>False</td></tr>\n",
       "<tr><td>Python_version:</td>\n",
       "<td>3.7.10 final</td></tr></table></div>"
      ],
      "text/plain": [
       "--------------------------  -------------------------------\n",
       "H2O_cluster_uptime:         02 secs\n",
       "H2O_cluster_timezone:       UTC\n",
       "H2O_data_parsing_timezone:  UTC\n",
       "H2O_cluster_version:        3.36.1.2\n",
       "H2O_cluster_version_age:    20 days\n",
       "H2O_cluster_name:           H2O_from_python_ec2_user_t9z3ig\n",
       "H2O_cluster_total_nodes:    1\n",
       "H2O_cluster_free_memory:    26.64 Gb\n",
       "H2O_cluster_total_cores:    64\n",
       "H2O_cluster_allowed_cores:  64\n",
       "H2O_cluster_status:         locked, healthy\n",
       "H2O_connection_url:         http://127.0.0.1:54321\n",
       "H2O_connection_proxy:       {\"http\": null, \"https\": null}\n",
       "H2O_internal_security:      False\n",
       "Python_version:             3.7.10 final\n",
       "--------------------------  -------------------------------"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n",
      "INFO: benchmark_utils.py: (313060, 194)\n",
      "INFO: benchmark_utils.py: (27330, 71)\n",
      "INFO: benchmark_utils.py: (29527, 2)\n",
      "INFO: benchmark_utils.py: (27329, 72)\n",
      "INFO: benchmark_utils.py: 67\n",
      "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n",
      "Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n",
      "AutoML progress: |\n",
      "17:56:29.266: Project: AutoML_1_20220615_175629\n",
      "17:56:29.267: 5-fold cross-validation will be used.\n",
      "17:56:29.270: Setting stopping tolerance adaptively based on the training frame: 0.0017872537194430643\n",
      "17:56:29.271: Build control seed: 10\n",
      "17:56:29.274: training frame: Frame key: AutoML_1_20220615_175629_training_py_11_sid_9519    cols: 194    rows: 313060  chunks: 68    size: 137888486  checksum: -8673498857111412012\n",
      "17:56:29.274: validation frame: NULL\n",
      "17:56:29.274: leaderboard frame: NULL\n",
      "17:56:29.275: blending frame: NULL\n",
      "17:56:29.275: response column: EVENT_LABEL\n",
      "17:56:29.275: fold column: null\n",
      "17:56:29.275: weights column: null\n",
      "17:56:29.289: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w), lr_search (6g, 30w)]}, {GLM : [def_1 (1g, 10w)]}, {DRF : [def_1 (2g, 10w), XRT (3g, 10w)]}, {GBM : [def_5 (1g, 10w), def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w), def_1 (3g, 10w), grid_1 (4g, 60w), lr_annealing (6g, 10w)]}, {DeepLearning : [def_1 (3g, 10w), grid_1 (4g, 30w), grid_2 (5g, 30w), grid_3 (5g, 30w)]}, {completion : [resume_best_grids (10g, 60w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w), best_of_family_2 (2g, 5w), best_of_family_3 (3g, 5w), best_of_family_4 (4g, 5w), best_of_family_5 (5g, 5w), all_2 (2g, 10w), all_3 (3g, 10w), all_4 (4g, 10w), all_5 (5g, 10w), monotonic (6g, 10w), best_of_family_gbm (6g, 10w), all_gbm (7g, 10w), best_of_family_xglm (8g, 10w), all_xglm (8g, 10w), best_of_family (10g, 10w), best_N (10g, 10w)]}]\n",
      "17:56:29.317: AutoML job created: 2022.06.15 17:56:29.241\n",
      "17:56:29.318: AutoML build started: 2022.06.15 17:56:29.317\n",
      "17:56:29.402: AutoML: starting XGBoost_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "████████████████\n",
      "18:10:12.338: New leader: XGBoost_1_AutoML_1_20220615_175629, auc: 0.9547716353973954\n",
      "18:10:12.344: AutoML: starting GLM_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "██\n",
      "18:12:16.9: AutoML: starting GBM_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "███\n",
      "18:15:33.456: New leader: GBM_1_AutoML_1_20220615_175629, auc: 0.9552574031901246\n",
      "18:15:33.476: AutoML: starting StackedEnsemble_BestOfFamily_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "\n",
      "18:15:37.718: New leader: StackedEnsemble_BestOfFamily_1_AutoML_1_20220615_175629, auc: 0.9629154248601777\n",
      "18:15:37.722: AutoML: starting XGBoost_2_AutoML_1_20220615_175629 model training\n",
      "\n",
      "███████\n",
      "18:22:03.32: AutoML: starting DRF_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "█\n",
      "18:23:30.486: AutoML: starting GBM_2_AutoML_1_20220615_175629 model training\n",
      "\n",
      "██\n",
      "18:25:25.304: AutoML: starting GBM_3_AutoML_1_20220615_175629 model training\n",
      "\n",
      "██\n",
      "18:27:17.200: AutoML: starting GBM_4_AutoML_1_20220615_175629 model training\n",
      "\n",
      "██\n",
      "18:29:30.748: AutoML: starting StackedEnsemble_BestOfFamily_2_AutoML_1_20220615_175629 model training\n",
      "\n",
      "\n",
      "18:29:34.124: New leader: StackedEnsemble_BestOfFamily_2_AutoML_1_20220615_175629, auc: 0.9629166327028895\n",
      "18:29:34.130: AutoML: starting StackedEnsemble_AllModels_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "\n",
      "18:29:37.696: New leader: StackedEnsemble_AllModels_1_AutoML_1_20220615_175629, auc: 0.9641251732621091\n",
      "18:29:37.699: AutoML: starting XGBoost_3_AutoML_1_20220615_175629 model training\n",
      "\n",
      "████\n",
      "18:33:45.401: AutoML: starting XRT_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "██\n",
      "18:35:01.402: AutoML: starting GBM_5_AutoML_1_20220615_175629 model training\n",
      "\n",
      "██\n",
      "18:37:04.114: AutoML: starting DeepLearning_1_AutoML_1_20220615_175629 model training\n",
      "\n",
      "████████\n",
      "18:45:23.506: AutoML: starting StackedEnsemble_BestOfFamily_3_AutoML_1_20220615_175629 model training\n",
      "\n",
      "\n",
      "18:45:26.972: AutoML: starting StackedEnsemble_AllModels_2_AutoML_1_20220615_175629 model training\n",
      "\n",
      "\n",
      "18:45:30.973: AutoML: starting XGBoost_grid_1_AutoML_1_20220615_175629 hyperparameter search\n",
      "\n",
      "██████\n",
      "18:50:41.172: AutoML: starting GBM_grid_1_AutoML_1_20220615_175629 hyperparameter search\n",
      "\n",
      "███\n",
      "18:54:01.180: AutoML: starting DeepLearning_grid_1_AutoML_1_20220615_175629 hyperparameter search\n",
      "\n",
      "███| (done) 100%\n",
      "\n",
      "18:56:30.999: Actual modeling steps: [{XGBoost : [def_2 (1g, 10w)]}, {GLM : [def_1 (1g, 10w)]}, {GBM : [def_5 (1g, 10w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w)]}, {XGBoost : [def_1 (2g, 10w)]}, {DRF : [def_1 (2g, 10w)]}, {GBM : [def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w)]}, {StackedEnsemble : [best_of_family_2 (2g, 5w), all_2 (2g, 10w)]}, {XGBoost : [def_3 (3g, 10w)]}, {DRF : [XRT (3g, 10w)]}, {GBM : [def_1 (3g, 10w)]}, {DeepLearning : [def_1 (3g, 10w)]}, {StackedEnsemble : [best_of_family_3 (3g, 5w), all_3 (3g, 10w)]}, {XGBoost : [grid_1 (4g, 90w)]}, {GBM : [grid_1 (4g, 60w)]}, {DeepLearning : [grid_1 (4g, 30w)]}]\n",
      "18:56:30.999: AutoML build stopped: 2022.06.15 18:56:30.999\n",
      "18:56:30.999: AutoML build done: built 16 models\n",
      "18:56:30.999: AutoML duration:  1:00:01.682\n",
      "\n",
      "stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO: benchmark_h2o.py: auc on test data: 0.8816474193300993\n",
      "INFO: benchmark_h2o.py: tpr@1%fpr on test data: 0.43807763401109057\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "H2O session _sid_9519 closed.\n"
     ]
    }
   ],
   "source": [
    "run_h2o(dataset, BASE_PATH, time_limit=3600)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0a81d69",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70efd958",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_mxnet_latest_p37",
   "language": "python",
   "name": "conda_mxnet_latest_p37"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: scripts/reproducibility/label-noise/benchmark_experiments.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c77e5eb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "#! pip install humanize\n",
    "#! pip install catboost"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8bd366d",
   "metadata": {},
   "source": [
    "# Label noise\n",
    "\n",
    "\n",
    "## Problem statement \n",
    "Have some binary classification task, traditionally assume data of the form X,y\n",
    "\n",
    "In reality, some of the labels may be incorrect, distinguish\n",
    "```\n",
    "y - true label\n",
    "y* - observed, possibly incorrect label\n",
    "```\n",
    "\n",
    "This can obviously effect model training, validation. Would also effect benchmarking process (comparing performance on noisy data doesn't tell you about performance on actual data).\n",
    "\n",
    "## Types of noise\n",
    "\n",
    "Can be completely independent:\n",
    "`p(y* != y | x, y) = p(y* != y)`\n",
    "\n",
    "class-dependent, depends on y:\n",
    "`p(y* != y | x, y) = p(y* != y | y)`\n",
    "\n",
    "feature-dependent, depends on x:\n",
    "`p(y* != y | x, y) = p(y* != y | x, y)`\n",
    "\n",
    "In fraud modeling, higher likelihood of `(y*, y) = (0, 1)` than reverse.\n",
    "(missed fraud, label maturity, intentional data poisoning, etc.)\n",
    "\n",
    "\"feature-dependent\" is probably most realistic in fraud but fewer removal techniques and also harder to synthetically generate. We will work with \"boundary conditional\" noise, probability of being mislabeled is weighted by distance from some decision boundary (score from model trained on clean data), implemented in scikit-clean.\n",
    "\n",
    "## Literature/packages\n",
    "\n",
    "Many methods in the literature to address this; can build loss functions that are robust to noise, can try to identify and filter (remove) or clean (flip label) examples identified as noisy.\n",
    "\n",
    "Some packages including CleanLab and scikit-clean. Can also hand-code an ensemble method. Most of these are model-agnostic."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b172deb",
   "metadata": {},
   "source": [
    "## CleanLab\n",
    "\n",
    "well-established, state of the art, open source package with some theoretical guarantees\n",
    "\n",
    "score all examples with y* = 1, determine average score t_1\n",
    "now score all examples with y* = 0. Any that score above t_1 are marked as noise\n",
    "\n",
    "can wrap any (sklearn-compatible) model with this process. \n",
    "\n",
    "## scikit-clean \n",
    "\n",
    "library of several different approaches including filtering as well as noise generation. Is similarly designed to be model-agnostic but doesn't always do a great job (doesn't handle unencoded categorical features well). Some of its methods can also be *very* slow relative to others\n",
    "\n",
    "## micro-models\n",
    "\n",
    "slice up training data, train a model on each slice, let models vote on whether to remove data. Can use majority (more than half of models \"misclassify\" example), consensus (all models misclassify) or any other threshold.\n",
    "\n",
    "## experiment design\n",
    "\n",
    "take 7 of the datasets - [‘ieeecis’, ‘ccfraud’, ‘fraudecom’, ‘sparknov’, ‘fakejob’, ‘vehicleloan’,‘twitterbot’]\n",
    "* drop IP and malurl dataset as they are difficult to work with \"out of the box\"\n",
    "* use numerical and categorical features, target-encode categorical features (drop text and enrichable features)\n",
    "\n",
    "add boundary-conditional noise `n` to training data (flipping both classes).\n",
    "\n",
    "values: `n in [0, 0.1, 0.2, 0.3, 0.4, 0.5]`\n",
    "    \n",
    "target encoding is done after noise is added\n",
    "    \n",
    "Catboost used as base classifier in all cases (with default settings)\n",
    "\n",
    "compare following methods for cleaning training data\n",
    "* baseline (no cleaning done)\n",
    "* CleanLab\n",
    "* scikit-clean MCS \n",
    "* micro-model majority voting (hand-built)\n",
    "* micro-model consensus voting (hand-built)\n",
    "\n",
    "measure AUC on (clean) test data\n",
    "\n",
    "repeat process 5 times for each experiment (start with clean data, add random noise, filter noise back out, train classifier, etc.), compute mean and std. dev of AUC for each\n",
    "\n",
    "CleanLab usually winds up being the best, but not uniformly. Baseline is sometimes the best for zero noise (as expected), and sometimes MCS or micro-model majority will come out ahead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "846f161f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# basic imports\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import humanize\n",
    "import pickle\n",
    "\n",
    "# basics from sklearn\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from category_encoders.target_encoder import TargetEncoder\n",
    "\n",
    "# noise generation\n",
    "from skclean.simulate_noise import flip_labels_cc, BCNoise\n",
    "\n",
    "# base classifiers\n",
    "from catboost import CatBoostClassifier\n",
    "\n",
    "# cleaning methods/helpers\n",
    "from cleanlab.classification import CleanLearning\n",
    "from micro_models import MicroModelCleaner\n",
    "from skclean.pipeline import Pipeline\n",
    "from skclean.handlers import Filter\n",
    "from skclean.detectors import MCS\n",
    "\n",
    "# dataset loader\n",
    "from load_fdb_datasets import prepare_noisy_dataset, dataset_stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85117ba5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# wrapper definitions for the various types of cleaning methods we will use. \n",
    "# Each one wraps a model_class (in our case catboost, but could use xgboost, etc.)\n",
    "# resulting model_class can then take noisy data in its .fit() method and clean before training\n",
    "\n",
    "def baseline_model(model_class, params):\n",
    "    return model_class(**params)\n",
    "\n",
    "def cleanlab_model(model_class, params, pulearning=False):\n",
    "    if pulearning:\n",
    "        return CleanLearning(model_class(**params), pulearning=pulearning)\n",
    "    else:\n",
    "        return CleanLearning(model_class(**params))\n",
    "    \n",
    "def micromodels(model_class, pulearning, num_clfs, threshold, params):\n",
    "    return MicroModelCleaner(model_class, pulearning=pulearning, num_clfs=num_clfs, threshold=threshold, **params)\n",
    "\n",
    "def skclean_MCS(model_class, params):\n",
    "    skclean_pipeline = Pipeline([\n",
    "        ('detector',MCS(classifier=model_class(**params))),\n",
    "        ('handler',Filter(model_class(**params)))\n",
    "    ])\n",
    "    return skclean_pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd6bcd08",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# some high-level parameters, \n",
    "# the number of runs for each experiment (determine mean/std. dev)\n",
    "num_samples = 5 \n",
    "# whether to use target encoding on categorical features\n",
    "target_encoding = True\n",
    "# whether to save intermediate results to disk (in case of failure etc.)\n",
    "save_results = True\n",
    "\n",
    "# we will be creating a lot of classifiers, let's use the same parameters for each\n",
    "model_config_dict = {\n",
    "    'catboost': {\n",
    "        'model_class': CatBoostClassifier,\n",
    "        'default_params': {\n",
    "            'verbose': False,\n",
    "            'iterations': 100\n",
    "        }\n",
    "    }\n",
    "}\n",
    "\n",
    "# all of our experiments will use catboost and boundary-consistent noise\n",
    "base_model_type = 'catboost'\n",
    "noise_type = 'boundary-consistent'\n",
    "model_class = model_config_dict[base_model_type]['model_class']\n",
    "\n",
    "# the set of experimental parameters, we will iterate over all these datasets\n",
    "keys = ['ieeecis', 'sparknov', 'ccfraud', 'fraudecom', 'fakejob', 'vehicleloan', 'twitterbot']\n",
    "# all these cleaning methods\n",
    "clf_types = ['baseline', 'skclean_MCS', 'cleanlab', 'micromodels_majority', 'micromodels_consensus']\n",
    "# all these noise levels\n",
    "noise_amounts = [0, 0.1, 0.2, 0.3, 0.4, 0.5]\n",
    "# and we will let cleaning methods know that noise can happen for either class\n",
    "pulearning = None\n",
    "\n",
    "# a little bit of setup for saving intermediate results to disk\n",
    "if save_results:\n",
    "    results_file_path = './results'\n",
    "    results_file_name = '{}_noise_benchmark_results.pkl'\n",
    "    try:\n",
    "        os.mkdir(results_file_path)\n",
    "    except OSError as error:\n",
    "        print(error) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef2e3bd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize results dict, we will index results by dataset/noise_amount/cleaning_method\n",
    "results = {}\n",
    "\n",
    "# main experimental loop   \n",
    "for key in keys:\n",
    "    # check to see if we have already run this experiment and saved to disk\n",
    "    full_result_path = os.path.join(results_file_path,results_file_name.format(key))\n",
    "    if os.path.exists(full_result_path) and save_results:\n",
    "        with open(full_result_path, 'rb') as results_file:\n",
    "            results[key] = pickle.load(results_file)\n",
    "    # otherwise start from scratch\n",
    "    else:\n",
    "        # initialize sub-results\n",
    "        results[key] = {}\n",
    "        model_params = model_config_dict[base_model_type]['default_params']\n",
    "        \n",
    "        for noise_amount in noise_amounts:\n",
    "            print(f\"\\n =={key}_{noise_amount}== \\n\")\n",
    "            \n",
    "            # initialize sub-sub-results\n",
    "            results[key][noise_amount] = {}\n",
    "\n",
    "            # these are the cleaning classifiers we will use\n",
    "            clfs = {\n",
    "                'baseline': baseline_model(model_class, model_params),\n",
    "                'skclean_MCS': skclean_MCS(model_class, model_params),\n",
    "                'cleanlab': cleanlab_model(model_class, model_params, pulearning),\n",
    "                'micromodels_majority': micromodels(model_class, pulearning=pulearning,\n",
    "                                                    num_clfs=8, threshold=0.5, params=model_params),\n",
    "                'micromodels_consensus': micromodels(model_class, pulearning=pulearning,\n",
    "                                                     num_clfs=8, threshold=1, params=model_params),\n",
    "\n",
    "            }\n",
    "            print('generating datasets')\n",
    "            # preparing a dataset has some overhead, we want to do this five times for each dataset/noise level\n",
    "            # we will save a little bit of time by doing this in advance and using same set of five\n",
    "            # for each cleaning method\n",
    "            datasets = [prepare_noisy_dataset(key, noise_type, noise_amount, split=1, target_encoding=target_encoding) \n",
    "                        for i in range(num_samples)]\n",
    "            \n",
    "            # now for each cleaning method, train a \"clean\" model on noisy training data, then determine\n",
    "            # auc on clean test data and record the results. Do this five times for each cleaning method\n",
    "            # to determine mean/std. dev\n",
    "            for clf_type in clfs:\n",
    "                print(f\"testing {clf_type}\")\n",
    "                auc = []\n",
    "                try:\n",
    "                    for i in range(num_samples):\n",
    "                        # grab the dataset we need for this run and extract metadata and subsets\n",
    "                        dataset = datasets[i]\n",
    "                        features, cat_features, label = dataset['features'], dataset['cat_features'], dataset['label']\n",
    "                        train, test = dataset['train'], dataset['test']\n",
    "                        X_tr, y_tr = train[features], train[label].values.reshape(-1)\n",
    "                        X_ts, y_ts = test[features], test[label].values.reshape(-1)\n",
    "                        clf = clfs[clf_type]\n",
    "                        # fit the \"clean\" classifier on noisy training data\n",
    "                        clf.fit(X_tr, y_tr)\n",
    "                        # make predictions on clean test data and calculate AUC\n",
    "                        y_pred = clf.predict_proba(X_ts)[:, 1]\n",
    "                        auc.append(roc_auc_score(y_ts, y_pred))\n",
    "                        print(f\"{clf_type} auc: {auc}\", end=\"\\r\", flush=True)\n",
    "                    # store mean/std. dev for this run in the results dict\n",
    "                    results[key][noise_amount][clf_type] = (np.mean(auc), np.std(auc), auc)\n",
    "                    print('\\n{} auc: {:.2f} ± {:.4f}\\n'.format(clf_type,\n",
    "                                                               *results[key][noise_amount][clf_type][:2]))\n",
    "                # if this run failed for some reason, handle it gracefully\n",
    "                except Exception as e:\n",
    "                    results[key][noise_amount][clf_type] = (0, 0, [0] * num_samples)\n",
    "                    print(e)\n",
    "    \n",
    "    # if we are saving intermediate results to disk, do so now\n",
    "    if save_results:\n",
    "        with open(full_result_path, 'wb') as results_file:\n",
    "            pickle.dump(results[key], results_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a8a4509",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# a couple of helper functions to analyze/summarize results\n",
    "\n",
    "def highlight_max(s, props=''):\n",
    "    return np.where(s == np.nanmax(s.values), props, '')\n",
    "\n",
    "def record_places(places, scores):\n",
    "    scores = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}\n",
    "    last_score, last_stddev, last_placement = (2, 0, 1)\n",
    "    for i, clf in enumerate(scores.keys()): \n",
    "        if scores[clf][0] + scores[clf][1] >= last_score:\n",
    "            placement = last_placement                          \n",
    "        else:\n",
    "            placement = i+1\n",
    "            last_score, last_stddev = scores[clf]            \n",
    "            last_placement = i+1\n",
    "        places[clf][placement] += 1 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fa49c8e",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# create dataframe of results for each experiment, also process results into dict for keeping track of \n",
    "# 1st/2nd/etc. place, as well as a dict for plotting later\n",
    "\n",
    "places = {clf:{p:0 for p in range(1,len(clf_types)+1)} for clf in clf_types}\n",
    "plots = {key:{clf:[[],[]] for clf in clf_types} for key in keys}\n",
    "        \n",
    "for key in results.keys():\n",
    "    print(f\"\\n =={key}==\\n\")\n",
    "    rows = pd.Index([clf_type for clf_type in clf_types])\n",
    "    columns = pd.MultiIndex.from_product([noise_amounts, ['mean','std_dev']], names=['type 2 noise', 'auc'])\n",
    "    df = pd.DataFrame(index=rows, columns=columns)\n",
    "    \n",
    "    for noise_amount in noise_amounts:\n",
    "        scores = {}\n",
    "        for clf_type in clf_types:\n",
    "            auc = results[key][noise_amount][clf_type]  \n",
    "            df.loc[clf_type, (noise_amount, 'mean')] = auc[0] \n",
    "            df.loc[clf_type, (noise_amount, 'std_dev')] = auc[1]\n",
    "            scores[clf_type] = (auc[0], auc[1])\n",
    "\n",
    "            plots[key][clf_type][0].append(noise_amount)\n",
    "            plots[key][clf_type][1].append(auc[0])\n",
    "        record_places(places, scores)\n",
    "    display(df.style.set_caption(f\"{key}\")\n",
    "            .format({(n,'mean'): \"{:.2f}\" for n in noise_amounts})\n",
    "            .format({(n,'std_dev'): \"{:.4f}\" for n in noise_amounts})\n",
    "            .apply(highlight_max, props='font-weight:bold;background-color:lightblue', axis=0,\n",
    "                  subset=[[n,'mean'] for n in noise_amounts]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cb8dbd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# produce \"race results\" (i.e. how many first place, second place, etc. finishes)\n",
    "\n",
    "race_results = pd.DataFrame.from_dict(places).rename(index=lambda x : humanize.ordinal(x))\n",
    "race_results['totals'] = race_results.sum(axis=1)\n",
    "display(race_results)\n",
    "print(race_results.to_latex())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "602877ee",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# finally, we can plot the results of individual experiments\n",
    "\n",
    "colors = ['black','purple','green','red','orange']\n",
    "linestyles = ['-','--',':']\n",
    "ylims = {\n",
    "    'boundary-consistent': {\n",
    "        'ieeecis':[0.5,0.9],\n",
    "        'sparknov':[0.5,1],\n",
    "        'ccfraud':[0.25,1],\n",
    "        'fraudecom':[0.48,0.52],\n",
    "        'fakejob':[0.5,1],\n",
    "        'vehicleloan':[0.57,0.66],\n",
    "        'twitterbot':[0.7,0.95]\n",
    "    },\n",
    "    'class-conditional': {\n",
    "        'ieeecis':[0.7,0.9],\n",
    "        'sparknov':[0.7,1],\n",
    "        'ccfraud':[0.8,1],\n",
    "        'fraudecom':[0.48,0.52],\n",
    "        'fakejob':[0.7,1],\n",
    "        'vehicleloan':[0.5,0.7],\n",
    "        'twitterbot':[0.8,0.95]\n",
    "    }\n",
    "}\n",
    "\n",
    "x_labels = {\n",
    "    'boundary-consistent':'Boundary-Consistent Noise Level',\n",
    "    'class-conditional':'Class-Conditional Type 2 Noise Level'\n",
    "}\n",
    "\n",
    "legends = {\n",
    "    'boundary-consistent':'Cleaning Method',\n",
    "    'class-conditional':'Type 1 Noise, Cleaning Method'\n",
    "}\n",
    "def fix_failures(x):\n",
    "    if x == 0:\n",
    "        return None\n",
    "    else:\n",
    "        return x\n",
    "\n",
    "def labels(noise_type, noise_amount, clf_type):\n",
    "    if noise_type == 'boundary-consistent':\n",
    "        return '{}'.format(clf_type)\n",
    "    elif noise_type == 'class-conditional':\n",
    "        return '{}, {}'.format(noise_amount, clf_type)\n",
    "\n",
    "for key in results.keys():\n",
    "    plt.figure(figsize=(10,10))\n",
    "    \n",
    "    for c, clf_type in enumerate(clf_types):\n",
    "        a = plots[key][clf_type]\n",
    "        plt.plot(a[0],[fix_failures(c) for c in a[1]],\n",
    "                 label=labels(noise_type, noise_amount, clf_type),\n",
    "                 color=colors[c],\n",
    "                 linestyle=linestyles[0])\n",
    "    plt.title(key)\n",
    "    plt.xlabel(x_labels[noise_type])\n",
    "    plt.ylabel('Test AUC')\n",
    "    plt.ylim(ylims[noise_type][key])\n",
    "    plt.legend(title=legends[noise_type])\n",
    "    plt.savefig(f\"./figures/label_noise_{key}.png\")\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b891c49a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: scripts/reproducibility/label-noise/feature_dict.py
================================================
feature_dict = {
  'ieeecis': {
    'transactionamt': 'numeric',
    'productcd': 'categorical',
    'card1': 'numeric',
    'card2': 'numeric',
    'card3': 'numeric',
    'card5': 'numeric',
    'card6': 'categorical',
    'addr1': 'numeric',
    'dist1': 'numeric',
    'p_emaildomain': 'categorical',
    'r_emaildomain': 'categorical',
    'c1': 'numeric',
    'c2': 'numeric',
    'c4': 'numeric',
    'c5': 'numeric',
    'c6': 'numeric',
    'c7': 'numeric',
    'c8': 'numeric',
    'c9': 'numeric',
    'c10': 'numeric',
    'c11': 'numeric',
    'c12': 'numeric',
    'c13': 'numeric',
    'c14': 'numeric',
    'v62': 'numeric',
    'v70': 'numeric',
    'v76': 'numeric',
    'v78': 'numeric',
    'v82': 'numeric',
    'v91': 'numeric',
    'v127': 'numeric',
    'v130': 'numeric',
    'v139': 'numeric',
    'v160': 'numeric',
    'v165': 'numeric',
    'v187': 'numeric',
    'v203': 'numeric',
    'v207': 'numeric',
    'v209': 'numeric',
    'v210': 'numeric',
    'v221': 'numeric',
    'v234': 'numeric',
    'v257': 'numeric',
    'v258': 'numeric',
    'v261': 'numeric',
    'v264': 'numeric',
    'v266': 'numeric',
    'v267': 'numeric',
    'v271': 'numeric',
    'v274': 'numeric',
    'v277': 'numeric',
    'v283': 'numeric',
    'v285': 'numeric',
    'v289': 'numeric',
    'v291': 'numeric',
    'v294': 'numeric',
    'id_01': 'numeric',
    'id_02': 'numeric',
    'id_05': 'numeric',
    'id_06': 'numeric',
    'id_09': 'numeric',
    'id_13': 'numeric',
    'id_17': 'numeric',
    'id_19': 'numeric',
    'id_20': 'numeric',
    'devicetype': 'categorical',
    'deviceinfo': 'categorical'
  },
  'ccfraud': {
    'v1': 'numeric',
    'v2': 'numeric',
    'v3': 'numeric',
    'v4': 'numeric',
    'v5': 'numeric',
    'v6': 'numeric',
    'v7': 'numeric',
    'v8': 'numeric',
    'v9': 'numeric',
    'v10': 'numeric',
    'v11': 'numeric',
    'v12': 'numeric',
    'v13': 'numeric',
    'v14': 'numeric',
    'v15': 'numeric',
    'v16': 'numeric',
    'v17': 'numeric',
    'v18': 'numeric',
    'v19': 'numeric',
    'v20': 'numeric',
    'v21': 'numeric',
    'v22': 'numeric',
    'v23': 'numeric',
    'v24': 'numeric',
    'v25': 'numeric',
    'v26': 'numeric',
    'v27': 'numeric',
    'v28': 'numeric',
    'amount': 'numeric'
  },
  'fraudecom': {
    'purchase_value': 'numeric',
    'source': 'categorical',
    'browser': 'categorical',
    'age': 'numeric',
    'ip_address': 'enrichable',
    'time_since_signup': 'numeric'
  },
  'sparknov': {
    'cc_num': 'categorical',
    'category': 'categorical',
    'amt': 'numeric',
    'first': 'categorical',
    'last': 'categorical',
    'gender': 'categorical',
    'street': 'categorical',
    'city': 'categorical',
    'state': 'categorical',
    'zip': 'categorical',
    'lat': 'numeric',
    'long': 'numeric',
    'city_pop': 'numeric',
    'job': 'categorical',
    'dob': 'text',
    'merch_lat': 'numeric',
    'merch_long': 'numeric'
  },
  'twitterbot': {
    'created_at' : 'text',
    'default_profile': 'categorical',
    'default_profile_image': 'categorical',
    'description': 'text',
    'favourites_count': 'numeric',
    'followers_count': 'numeric',
    'friends_count': 'numeric',
    'geo_enabled': 'categorical',
    'lang': 'categorical',
    'location': 'categorical',
    'profile_background_image_url': 'text',
    'profile_image_url': 'text',
    'screen_name': 'text',
    'statuses_count': 'numeric',
    'verified': 'categorical',
    'average_tweets_per_day': 'numeric',
    'account_age_days': 'numeric'
  },
  'fakejob': {
    'title': 'categorical',
    'location': 'categorical',
    'department': 'categorical',
    'salary_range': 'text',
    'company_profile': 'text',
    'description': 'text',
    'requirements': 'text',
    'benefits': 'text',
    'telecommuting': 'categorical',
    'has_company_logo': 'categorical',
    'has_questions': 'categorical',
    'employment_type': 'categorical',
    'required_experience': 'categorical',
    'required_education': 'categorical',
    'industry': 'categorical',
    'function': 'categorical'
  },
  'vehicleloan': {
    'disbursed_amount': 'numeric',
    'asset_cost': 'numeric',
    'ltv': 'numeric',
    'branch_id': 'categorical',
    'supplier_id': 'categorical',
    'manufacturer_id': 'categorical',
    'current_pincode_id': 'categorical',
    'date_of_birth': 'text',
    'employment_type': 'categorical',
    'state_id': 'categorical',
    'employee_code_id': 'categorical',
    'mobileno_avl_flag': 'categorical',
    'aadhar_flag': 'categorical',
    'pan_flag': 'categorical',
    'voterid_flag': 'categorical',
    'driving_flag': 'categorical',
    'passport_flag': 'categorical',
    'perform_cns_score': 'numeric',
    'perform_cns_score_description': 'categorical',
    'pri_no_of_accts': 'numeric',
    'pri_active_accts': 'numeric',
    'pri_overdue_accts': 'numeric',
    'pri_current_balance': 'numeric',
    'pri_sanctioned_amount': 'numeric',
    'pri_disbursed_amount': 'numeric',
    'sec_no_of_accts': 'numeric',
    'sec_active_accts': 'numeric',
    'sec_overdue_accts': 'numeric',
    'sec_current_balance': 'numeric',
    'sec_sanctioned_amount': 'numeric',
    'sec_disbursed_amount': 'numeric',
    'primary_instal_amt': 'numeric',
    'sec_instal_amt': 'numeric',
    'new_accts_in_last_six_months': 'numeric',
    'delinquent_accts_in_last_six_months': 'numeric',
    'average_acct_age': 'text',
    'credit_history_length': 'text',
    'no_of_inquiries': 'numeric'
  }
}

================================================
FILE: scripts/reproducibility/label-noise/load_fdb_datasets.py
================================================
import os
import re
import json
import pandas as pd
import numpy as np
import warnings
from datetime import datetime

from category_encoders.target_encoder import TargetEncoder
from skclean.simulate_noise import flip_labels_cc, BCNoise

from fdb.datasets import FraudDatasetBenchmark

import feature_dict

DATASET_PATH = './data/dataset.csv'
METADATA_PATH = './data/feature_metadata.json'
FD = feature_dict.feature_dict

def noise_amount(df):
    return df[df.noise == 1].shape[0]

def noise_rate(df):
    if df.shape[0] > 0:
        return noise_amount(df)/df.shape[0]
    else:
        return None

def type_1_noise_amount(df):
    # examples with true label 0, mislabeled as 1
    # here 'df.label' is the observed label, not the true one
    return df[(df.label==1) & (df.noise == 1)].shape[0]

def type_2_noise_amount(df):
    # examples with true label 1, mislabeled as 0
    # here 'df.label' is the observed label, not the true one
    return df[(df.label==0) & (df.noise == 1)].shape[0]

def actual_legit_amount(df):
    return df[(df.label == 0) | (df.noise == 1)].shape[0]

def observed_legit_amount(df):
    return df[df.label == 0].shape[0]

def actual_fraud_amount(df):
    return df[((df.label == 1) & (df.noise == 0)) | ((df.label == 0) & (df.noise == 1))].shape[0]

def observed_fraud_amount(df):
    return df[df.label == 1].shape[0]

def actual_fraud_rate(df):
    if df.shape[0] > 0:
        return actual_fraud_amount(df)/df.shape[0]
    else:
        return None

def observed_fraud_rate(df):
    if df.shape[0] > 0:
        return observed_fraud_amount(df)/df.shape[0]
    else:
        return None

def type_1_noise_rate(df):
    if df.shape[0] > 0:
        return type_1_noise_amount(df)/actual_legit_amount(df)
    else:
        return None

def type_2_noise_rate(df):
    if df.shape[0] > 0:
        return type_2_noise_amount(df)/actual_fraud_amount(df)
    else:
        return None

def prepare_data_fdb(key, drop_text_enr_features=True):    
    """
    main function, gets datasets from FDB and then does some preprocessing/cleaning so they are suitable
    for modeling, returns data and metadata
    
    inputs: 
        key - the FDB dataset to load
        drop_text_enr_features - whether we want to drop text/enrichable features
    this returns
        df - full pandas dataframe containing features, labels and metadata
            this includes training and test data, with a 'dataset' column to indicate which
            all of these datasets have a timestamp column (even if it is "fake") and by default
            data will be sorted by this column. All test > train w.r.t. this timestamp
            
        features - list of feature names
        cat_features - list of categorical feature names (subset of features)
        label - name of label column
        record_id - name of unique id column
    """
    
    obj = FraudDatasetBenchmark(key=key)
    
    print(obj.key)
    
    # extract training and testing data (and test labels) from the return object
    # sort training data by event timestamp
    train_df = obj.train.sort_values(by='EVENT_TIMESTAMP',ignore_index=True)
    test_df = obj.test.reset_index(drop=True)
    test_labels = obj.test_labels.reset_index(drop=True)

    # define metadata and label column names
    metadata = ['EVENT_LABEL', 'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID',
                'label', 'LABEL_TIMESTAMP', 'noise', 'dataset']
    label = ['label']
    
    # we maintain a feature dictionary in another file, this helps us determine which are categorical, numerical, etc.
    feature_dict = FD[key]
    raw_features = feature_dict.keys()
    num_features = [f for f in raw_features if feature_dict[f] == 'numeric']
    cat_features = [f for f in raw_features if feature_dict[f] == 'categorical']
    txt_features = [f for f in raw_features if feature_dict[f] == 'text']
    enr_features = [f for f in raw_features if feature_dict[f] == 'enrichable']
    
    # add / rename labels
    train_df.rename({'EVENT_LABEL':'label'}, axis=1, inplace=True)
    test_df['label'] = test_labels['EVENT_LABEL']
    if key == 'twitterbot':
        train_df.loc[train_df.label == 'bot', 'label'] = 1
        test_df.loc[test_df.label == 'bot', 'label'] = 1
        train_df.loc[train_df.label == 'human', 'label'] = 0
        test_df.loc[test_df.label == 'human', 'label'] = 0

    # put train / test into single dataframe, create a 'dataset' column to keep track
    train_df['dataset'] = 'train'    
    test_df['dataset'] = 'test'

    # create noise column - we won't generate any noise now but it may be useful to have (can also be ignored)
    train_df['noise'] = 0
    test_df['noise'] = 0

    # concatenate train/test into single dataframe 
    # (remember we have 'dataset' column to separate them again if needed)
    df = pd.concat([train_df, test_df], axis=0, ignore_index=True)
    
    # there are a few date columns that are timestamps, we convert those to epoch
    # the new values are put into new columns, those column names are added to the numerical features
    if key == 'twitterbot':
        df['eng_created_at'] = df['created_at'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp())
        num_features.append('eng_created_at')
    if key == 'sparknov':
        df['eng_dob'] = df['dob'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d').timestamp())
        num_features.append('eng_dob')
    
    # fakejob has a salary range column, e.g. "10000 - 20000" that can be converted into two numerical columns
    if key == 'fakejob':
        def convert(x):
            r = re.search(r"([0-9]*)-([0-9]*)",str(x))
            try:
                m, M = r.group(1), r.group(2)
                if m == '' or M == '':
                    m, M = 0,0
            except:
                m, M = 0,0
            return m,M

        df['salary_min'], df['salary_max'] = zip(*df['salary_range'].map(convert))
        num_features = num_features + ['salary_min','salary_max']
    
    # vehicleloan has a timestamp column that we convert to epoch
    # it also has "account age" and "credit history" length cols 
    # in form "Xyrs Ymon" that can be converted to numeric
    if key == 'vehicleloan':
        df['eng_dob'] = df['date_of_birth'].apply(lambda x : datetime.strptime(x, '%d-%m-%Y').timestamp())
        
        def convert(x):
            r = re.search(r"([0-9]*)yrs ([0-9]*)mon", x)
            try:
                age = 12*float(r.group(1)) + float(r.group(2))
            except:
                age = 0
            return age
    
        df['eng_average_acct_age'] = df['average_acct_age'].apply(convert)
        df['eng_credit_history_length'] = df['credit_history_length'].apply(convert)
        num_features = num_features + ['eng_dob','eng_average_acct_age','eng_credit_history_length']
    
    # by default we will drop any remaining text or enrichable (IP address) features as we won't use them
    # but you can pass in False for this if they are of interest
    if drop_text_enr_features:
        df.drop(txt_features + enr_features, axis=1, inplace=True)
        features = num_features + cat_features
    
    # cast all numeric features to float just in case they aren't
    for feature in num_features:
        df[feature] = df[feature].astype('float64')
        df[feature].fillna(0, inplace=True)
    
    # cast all categorical features to str in case they aren't
    for feature in cat_features:
        df[feature] = df[feature].astype(str) 
        df[feature].fillna('', inplace=True)
    
    # rename the timestamp column
    df.rename({'EVENT_TIMESTAMP':'creation_date'}, axis=1, inplace=True)    
    
    # cast the label to int just to be sure
    df['label'] = df['label'].astype('int')
    
    # name of unique id column will always be EVENT_ID
    record_id = 'EVENT_ID'
    
    if drop_text_enr_features:
        return df, features, cat_features, label, record_id
    else:
        return df, features, cat_features, txt_features, enr_features, label, record_id


def add_noise(df, noise_type, noise_amount, *, time_index=None, features=None, cat_features=None, label=None):

    if noise_type not in ['random', 'time-dependent', 'boundary-consistent']:
        raise(Exception('Invalid Noise Type'))
    
    # if we want time-dependent noise it will be useful to convert timestamps into epoch
    def convert_to_millis(x):
        try:
            m = datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').timestamp()
        except:
            m = datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp()
        return m

    # random noise can be class-conditional in both directions (other types of noise cannot)
    # if noise_amount is passed in as [r,s] we can flip labels in both directions: 
    #    r is percent of 0s flipped to 1s
    #    s is percent of 1s flipped to 0s
    # for random noise, if noise_amount is a single number, assume it is s, and that r=0 
    #   (i.e. class-conditional noise where only 1s get flipped to 0s)
    if isinstance(noise_amount, tuple) or isinstance(noise_amount, list): 
        if noise_type != 'random':
            raise(Exception('For time-dependent and boundary-consistent noise,'
                            'only a single value is allowed for noise_amount'))
        r = noise_amount[0]
        s = noise_amount[1]
    else:
        r = 0
        s = noise_amount
    
    # we will add noise to a *copy* of the dataframe
    df_copy = df.copy()
    
    if noise_type == 'time-dependent':
        df_copy['event_millis'] = df_copy[time_index].apply(convert_to_millis)
        df_copy['event_millis'] = df_copy['event_millis'] - df_copy['event_millis'].min()    
        mislabel = df_copy[(df_copy.noise == 0) 
                           & (df_copy.label == 1)].sample(frac = s, 
                                                                 weights=df_copy['event_millis']).index
        df_copy.loc[mislabel,'noise'] = 1
        df_copy.loc[mislabel,'label'] = 0
    else:
        if noise_type == 'boundary-consistent':
            from catboost import CatBoostClassifier
            warnings.filterwarnings("ignore", category=FutureWarning)
            target_encoder = TargetEncoder(cols=cat_features)
            reshaped_y = df_copy[label].values.reshape(df_copy[label].shape[0],)
            X = target_encoder.fit_transform(df_copy[features], reshaped_y)
            clf = CatBoostClassifier(verbose=False)
            clf.fit(X, reshaped_y)
            _, noisy_labels = BCNoise(clf, noise_level=s).simulate_noise(X, reshaped_y)
        else:        
            lcm = np.array([[1-r,r],[s,1-s]])
            noisy_labels = flip_labels_cc(df_copy.label,lcm)

        idx = (df_copy.label != noisy_labels)
        df_copy.loc[idx,'noise'] = 1
        df_copy['label'] = noisy_labels
    
    return df_copy


def train_valid_split(df, split=0.7, shuffle=True, sort_key='creation_date'):
    if shuffle:
        df = df.sample(frac=1).reset_index(drop=True)
    else:
        df = df.sort_values(by=sort_key, ignore_index=True)
    train_idx = int(round(split*df.shape[0]))
    train = df[:train_idx].reset_index(drop=True)
    valid = df[train_idx:].reset_index(drop=True)
    
    return train, valid
    

def prepare_noisy_dataset(key, noise_type, noise_amount, split=0.7, shuffle=True, 
                          sort_key='creation_date', target_encoding=False):
    """
    this function can be used to fetch datasets from FDB, 
    starts by calling prepare_data_fdb and then adding noise
    
    input: 
        key - name of FDB dataset
        noise_type - what type of noise to add
        noise_amount - how much noise to add
        split - training/validation split
        shuffle - whether or not to shuffle or sort before doing train/valid split
        sort_key - key to use to sort for train/valid split as well as weight for time-dependent noise
    """
    
    # start by getting clean dataset
    
    df, features, cat_features, label, record_id = prepare_data_fdb(key)

    if noise_type == 'boundary-consistent':
        train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount, 
                                    time_index=sort_key, features=features, cat_features=cat_features, label=label)
    else:
        train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount, time_index=sort_key)
        
    train, valid = train_valid_split(train_and_valid, split, shuffle=shuffle, sort_key=sort_key)
    test = df[df.dataset == 'test'].reset_index(drop=True)
    
    train = train[features + ['noise'] + label]
    valid = valid[features + ['noise'] + label]
    test = test[features + ['noise'] + label]
    
    if target_encoding:
        warnings.filterwarnings("ignore", category=FutureWarning)
        target_encoder = TargetEncoder(cols=cat_features)
        reshaped_y = train[label].values.reshape(train[label].shape[0],)
        train.loc[:, features] = target_encoder.fit_transform(train[features], reshaped_y)
        valid.loc[:, features] = target_encoder.transform(valid[features])
        test.loc[:, features] = target_encoder.transform(test[features])
        cat_features = None
    
    dataset = {
        'description': f"{key} dataset with noise type: {noise_type}, noise amount: {noise_amount} ",
        'features':features,
        'cat_features':cat_features,
        'label':label,
        'record_id':record_id,
        'train':train,
        'valid':valid,
        'test':test, 
        'noise':(noise_rate(train), noise_rate(valid), noise_rate(test)),
        'fraud_level':(actual_fraud_rate(train), actual_fraud_rate(valid), actual_fraud_rate(test)),
        'observed_fraud_level':(observed_fraud_rate(train),observed_fraud_rate(valid),observed_fraud_rate(test)),
        'type_1_noise_rate':(type_1_noise_rate(train),type_1_noise_rate(valid),type_1_noise_rate(test)),
        'type_2_noise_rate':(type_2_noise_rate(train),type_2_noise_rate(valid),type_2_noise_rate(test))
    }
        
    return dataset


def dataset_stats(dataset):
    noise = dataset['noise']
    fraud_level = dataset['fraud_level']
    observed_fraud_level = dataset['observed_fraud_level']
    type_1_noise_rate = dataset['type_1_noise_rate']
    type_2_noise_rate = dataset['type_2_noise_rate']
    stats = list(zip(['train','valid','test'],noise,type_1_noise_rate,type_2_noise_rate,fraud_level,observed_fraud_level))
    print(dataset['description'])
    for stat in stats:
        print('{} - total noise rate: {:.3f}, type 1 noise rate: {:.3f}, type 2 noise rate: {:.3f},\n'
                '(actual) fraud rate: {:.3f}, observed fraud rate: {:.3f}'.format(*stat))
        

================================================
FILE: scripts/reproducibility/label-noise/micro_models.py
================================================
import logging
import pandas as pd
import numpy as np


class MicroModelError(Exception):
    """
    basic exception type for micro-model specific errors
    """
    def __init__(self, error_message):
        logging.error(error_message)


class MicroModel:
    """
    Basic wrapper for the model to be used in ensemble noise removal, ModelClass can be anything that implements
    fit and predict_proba. Mainly used by MicroModelEnsemble, user is probably not calling this directly
    """

    def __init__(self, ModelClass, *args, **kwargs):
        """
        initialization of the class, ModelClass should be a *class* not an object
        e.g. CatBoostClassifier, not CatBoostClassifier()
        """
        self.clf = ModelClass(*args, **kwargs)
        self.thresh = None

    def set_thresh(self, thresh):
        # can set a threshold to be used in model predictions
        self.thresh = thresh

    def fit(self, x, y, *args, **kwargs):
        # pass-through method to call model.fit()
        self.clf.fit(x, y.values.ravel(), *args, **kwargs)

    def predict_proba(self, x, *args, **kwargs):
        # pass-through method to call model.predict_proba()
        if 'predict_proba' in dir(self.clf):
            return self.clf.predict_proba(x, *args, **kwargs)
        else:
            raise (MicroModelError('ModelClass must implement predict_proba'))

    def predict(self, x):
        # make predictions, using either defined threshold (if set) or default value of 0.5
        if self.thresh is not None:
            t = self.thresh
        else:
            t = 0.5
        scores = self.predict_proba(x)[:, 1]
        preds = [int(s > t) for s in scores]
        return scores, preds


class MicroModelEnsemble:
    """
    Ensemble of micro-models used to remove noise
    """

    def __init__(self, ModelClass, num_clfs=16, score_type='preds_avg', *args, **kwargs):
        """
        initialization of the class, ModelClass should be a *class* not an object
        e.g. CatBoostClassifier, not CatBoostClassifier()
        params:
        ModelClass - base class to use, needs to implement fit and predict_proba
        num_clfs - number of classifiers to use in cleaning ensemble
        score_type - means of computing anomaly score from micro-model scores
        args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost
        """
        self.score_type = score_type
        
        if type(num_clfs) is not int or num_clfs <= 0:
            raise (MicroModelError('num_clfs must be a positive integer'))
        self.ModelClass = ModelClass

        # one classifier that will be trained over entire dataset
        self.big_clf = MicroModel(ModelClass=ModelClass, *args, **kwargs)

        # micro-models to later be trained over slices
        self.num_clfs = num_clfs
        self.clfs = []
        for i in range(num_clfs):
            self.clfs.append(MicroModel(ModelClass=ModelClass, *args, **kwargs))
        self.thresholds = {}

    def fit(self, x, y, *args, **kwargs):
        # assumption that data is already shuffled or sorted (by date or other appropriate key)
        # according to the usecase

        if not isinstance(y, pd.DataFrame):
            y = pd.DataFrame(y)

            # fit one classifier on all the data
        self.big_clf.fit(x, y, *args, **kwargs)

        # now fit individual models on slices of data
        stride = round(x.shape[0] / self.num_clfs)
        for i, clf in enumerate(self.clfs):
            idx = slice(i * stride, min((i + 1) * stride, x.shape[0]))
            x_i = x.iloc[idx, :]
            y_i = y.iloc[idx, :]
            clf.fit(x_i, y_i, *args, **kwargs)

    def predict_proba(self, x, *args, **kwargs):
        # output is the mean of the (binary) predictions of all models in the ensemble
        # e.g. the percentage of models that voted on the example
        results = pd.DataFrame(index=np.arange(x.shape[0]))
        if self.score_type == 'preds_avg':
            for i, clf in enumerate(self.clfs):
                _, results[i] = clf.predict(x, *args, **kwargs)
        elif self.score_type == 'score_avg':
            for i, clf in enumerate(self.clfs):
                results[i] = clf.predict_proba(x, *args, **kwargs)[:, 1]

        scores = results.mean(axis=1, numeric_only=True)
        return scores

    def predict(self, x, threshold=0.5, *args, **kwargs):
        # compare output of predict_proba to a threshold in order to make a binary prediction, default is 0.5
        scores = self.predict_proba(x)
        preds = np.array([int(s >= threshold) for s in scores])
        return scores, preds

    def filter_noise(self, x, y, pulearning=True, threshold=0.5):
        # compare ensemble predictions to observed labels and return the examples that are NOT considered noise
        # i.e. this is noise REMOVAL
        # pu_learning=True means a class-conditional assumption is being made,
        # there no examples of true 0s mislabeled as 1s
        scores, susp = self.predict(x, threshold)
        if pulearning:
            conf = ((y == 1) | ((y == 0) & (susp == 0)))
        else:
            conf = (((y == 1) & (scores > 1 - threshold)) | ((y == 0) & (scores < threshold)))

        return x[conf].reset_index(drop=True), y[conf]

    def clean_noise(self, x, y, pulearning=True, threshold=0.5):
        # compare ensemble predictions to observed labels and return all examples with corrected labels
        # i.e. this is noise CLEANING
        # pu_learning=True means a class-conditional assumption is being made,
        # there no examples of true 0s mislabeled as 1s
        x = x.copy()
        y = y.copy()
        _, susp = self.predict(x, threshold)
        # flip all the probable 1s to actual 1s
        probable_1 = (y == 0) & (susp == 1)
        y[probable_1] = 1
        if not pulearning:
            # if there are both types of noise, flip probable 0s to actual 0s
            probable_0 = (y == 1) & (susp == 0)
            y[probable_0] = 0

        return x, y


class MicroModelCleaner:
    """
    This class performs the entire model training process end-to-end - given a dataset it will first train an ensemble
    then remove noise, then train a final model on the clean data
    """

    def __init__(self, ModelClass, strategy='filter', pulearning=True, num_clfs=16, threshold=0.5, *args, **kwargs):
        """
        initialization of the class, ModelClass should be a *class* not an object
        e.g. CatBoostClassifier, not CatBoostClassifier()
        params:
        ModelClass - base class to use, needs to implement fit and predict_proba
        strategy - whether to remove noise ('filter') or flip labels ('clean')
        pulearning - class-conditional assumption, if True assume there is no true 0's mislabeled as 1's
        num_clfs - number of classifiers to use in cleaning ensemble
        threshold - percentage of classifiers that have to vote to remove noise (0.5 is majority voting)
        args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost
        """
        self.detector = MicroModelEnsemble(ModelClass, num_clfs, *args, **kwargs)
        self.clf = ModelClass(*args, **kwargs)
        if strategy.lower() not in ['filter', 'clean']:
            raise (MicroModelError('strategy must be filter or clean'))
        self.strategy = strategy.lower()
        self.pulearning = pulearning
        self.threshold = threshold

    def fit(self, x, y, *args, **kwargs):
        # first train the Ensemble to deal with the noise
        self.detector.fit(x, y, *args, **kwargs)
        if self.strategy == 'filter':
            x_clean, y_clean = self.detector.filter_noise(x, y, self.pulearning, self.threshold)
        else:
            x_clean, y_clean = self.detector.clean_noise(x, y, self.pulearning, self.threshold)

        # then train final model on clean data
        self.clf.fit(x_clean, y_clean, *args, **kwargs)

    def predict(self, x, *args, **kwargs):
        return self.clf.predict(x, *args, **kwargs)

    def predict_proba(self, x, *args, **kwargs):
        return self.clf.predict_proba(x, *args, **kwargs)


================================================
FILE: setup.py
================================================
import os
from glob import glob

from setuptools import find_packages, setup


setup(
    name='fraud_dataset_benchmark',
    version='1.0',
    
    # declare your packages
    packages=find_packages(where='src', exclude=('test',)),
    package_dir={'': 'src'},
    include_package_data=True,
    data_files=[('.',[
        'src/fdb/versioned_datasets/ipblock/20220607.zip',
    ])],

    # Enable build-time format checking
    check_format=False,

    # Enable type checking
    test_mypy=False,

    # Enable linting at build time
    test_flake8=False,

    # exclude_package_data={
    #     '': glob('fdb/*/__pycache__', recursive=True),
    # }
)


================================================
FILE: src/__init__.py
================================================


================================================
FILE: src/fdb/__init__.py
================================================


================================================
FILE: src/fdb/datasets.py
================================================
from abc import abstractmethod, ABC
from fdb.preprocessing import *
from fdb.preprocessing_objects import load_data
from sklearn.metrics import roc_auc_score, roc_curve, auc

class FraudDatasetBenchmark(ABC):
    def __init__(
        self, 
        key, 
        load_pre_downloaded=False,
        delete_downloaded=True,
        add_random_values_if_real_na = {
            "EVENT_TIMESTAMP": True,
            "LABEL_TIMESTAMP": True,
            "ENTITY_ID": True,
            "ENTITY_TYPE": True,
            "EVENT_ID": True
            }):
        self.key = key
        self.obj = load_data(self.key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na)
    
    @property
    def train(self):
        return self.obj.train

    @property
    def test(self):
        return self.obj.test

    @property
    def test_labels(self):
        return self.obj.test_labels

    def eval(self, y_pred):
        
        """
        Method to evaluate predictions against the test set
        """
        roc_score = roc_auc_score(self.test_labels['EVENT_LABEL'], y_pred)
        fpr, tpr, thres = roc_curve(self.test_labels['EVENT_LABEL'], y_pred)
        tpr_1fpr = np.interp(0.01, fpr, tpr)
        metrics = {'roc_score': roc_score, 'tpr_1fpr': tpr_1fpr}
        return metrics


================================================
FILE: src/fdb/kaggle_configs.py
================================================
KAGGLE_CONFIGS = {

    "fakejob":
    {
        "owner": "shivamb",
        "dataset": "real-or-fake-fake-jobposting-prediction",
        "filename": 'fake_job_postings.csv',
        "name": "Real / Fake Job Posting Prediction",
        "type": "datasets",
        "version": 1
    },    

    "vehicleloan":
    {
        "owner": "avikpaul4u",
        "dataset": "vehicle-loan-default-prediction",
        "filename": 'train.csv',
        "name": "Vehicle Loan Default Prediction",
        "type": "datasets",
        "version": 4
    },

    "malurl":
    {
        "owner": "sid321axn",
        "dataset": "malicious-urls-dataset",
        "filename": 'malicious_phish.csv',
        "name": "Malicious URLs Dataset",
        "type": "datasets",
        "version": 1
    },

    "ieeecis": 
    {
        "owner": "ieee-fraud-detection",
        "name": "IEEE-CIS Fraud Detection",
        "type": "competitions",
    },

    "ccfraud": 
    {
        "owner": "mlg-ulb",
        "dataset": "creditcardfraud",
        "filename": 'creditcard.csv',
        "name": "Credit Card Fraud Detection",
        "type": "datasets",
        "version": 3
    },

    "fraudecom": 
    {
        "owner": "vbinh002",
        "dataset": "fraud-ecommerce",
        "filename": 'Fraud_Data.csv',
        "name": "Fraud ecommerce",
        "type": "datasets",
        "version": 1
    },

    "sparknov":
    {
        "owner": "kartik2112",
        "dataset": "fraud-detection",
        "name": "Simulated Credit Card Transactions generated using Sparkov",
        "type": "datasets",
        "version": 1
    },

    "twitterbot":
    {
        "owner": "davidmartngutirrez",
        "dataset": "twitter-bots-accounts",
        "filename": "twitter_human_bots_dataset.csv",
        "name": "Twitter Bots Accounts",
        "type": "datasets",
        "version": 2
    }
}

================================================
FILE: src/fdb/preprocessing.py
================================================


import os
import re
import shutil
import kaggle
import pkgutil
import requests
import zipfile
import numpy as np
from abc import ABC
import pandas as pd
import socket, struct
from faker import Faker
from zipfile import ZipFile
from datetime import datetime
from datetime import timedelta
from io import StringIO, BytesIO
from dateutil.relativedelta import relativedelta

from fdb.kaggle_configs import KAGGLE_CONFIGS

fake = Faker(['en_US'])


# Naming convention for the meta data columns in standardized datasets
_EVENT_TIMESTAMP = 'EVENT_TIMESTAMP'  # timestamp column
_ENTITY_TYPE = 'ENTITY_TYPE'  # afd specific requirement
_EVENT_LABEL = 'EVENT_LABEL'  # label column
_EVENT_ID = 'EVENT_ID'  # transaction/event id 
_ENTITY_ID = 'ENTITY_ID'  # represents user/account id
_LABEL_TIMESTAMP = 'LABEL_TIMESTAMP'  # added in a cases where entity id is meaninful

# Kaggle config related strings
_OWNER = 'owner'
_COMPETITIONS = 'competitions'
_TYPE = 'type'
_FILENAME = 'filename'
_DATASETS = 'datasets'
_DATASET = 'dataset'
_VERSION = 'version'

# Some fixed parameters
_RANDOM_STATE = 1
_CWD = os.getcwd()
_DOWNLOAD_LOCATION = os.path.join(_CWD, 'tmp')
_TIMESTAMP_FORMAT = '%Y-%m-%dT%H:%M:%SZ'
_DEFAULT_LABEL_TIMESTAMP = datetime.now().strftime(_TIMESTAMP_FORMAT)


class BasePreProcessor(ABC):
    def __init__(
        self, 
        key = None, 
        train_percentage = 0.8,
        timestamp_col = None, 
        label_col = None, 
        label_timestamp_col = None,
        event_id_col = None,
        entity_id_col = None,
        features_to_drop = [],
        load_pre_downloaded = False,
        delete_downloaded = True,
        add_random_values_if_real_na = {
            "EVENT_TIMESTAMP": True,
            "LABEL_TIMESTAMP": True,
            "ENTITY_ID": True,
            "ENTITY_TYPE": True,
            "EVENT_ID": True
            }
        ):
        
        self.key = key 
        self.train_percentage = train_percentage
        self.features_to_drop = features_to_drop
        self.delete_downloaded = delete_downloaded
        
        self._timestamp_col = timestamp_col
        self._label_col = label_col
        self._label_timestamp_col = label_timestamp_col
        self._event_id_col = event_id_col
        self._entity_id_col = entity_id_col
        self._add_random_values_if_real_na = add_random_values_if_real_na
        
        # Simply get all required objects at the time of object creation
        if KAGGLE_CONFIGS.get(self.key) and not load_pre_downloaded:
            self.download_kaggle_data()  # download the data when an object is created
        self.load_data()
        self.preprocess()
        self.train_test_split()


    def _download_kaggle_data_from_competetions(self):
        file_name = KAGGLE_CONFIGS[self.key][_OWNER]
        kaggle.api.competition_download_files(
            competition = KAGGLE_CONFIGS[self.key][_OWNER],
            path = _DOWNLOAD_LOCATION
        )
        return file_name

    def _download_kaggle_data_from_datasets_with_given_filename(self):
        file_name = KAGGLE_CONFIGS[self.key][_FILENAME]
        response = kaggle.api.datasets_download_file(
            owner_slug = KAGGLE_CONFIGS[self.key][_OWNER],
            dataset_slug = KAGGLE_CONFIGS[self.key][_DATASET],
            file_name = file_name,
            dataset_version_number=KAGGLE_CONFIGS[self.key][_VERSION],
            _preload_content = False,
        )
        with open(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'wb') as f:
            f.write(response.data)
        return file_name

    def _download_kaggle_data_from_datasets_containing_single_file(self):
        file_name = KAGGLE_CONFIGS[self.key][_DATASET]
        kaggle.api.dataset_download_files(
            dataset = os.path.join(KAGGLE_CONFIGS[self.key][_OWNER], KAGGLE_CONFIGS[self.key][_DATASET]),
            path = _DOWNLOAD_LOCATION
        )
        return file_name

    def download_kaggle_data(self):
        """
        Download and extract the data from Kaggle. Puts the data in tmp directory within current directory.
        """

        if not os.path.exists(_DOWNLOAD_LOCATION):
            os.mkdir(_DOWNLOAD_LOCATION)

        print('Data download location', _DOWNLOAD_LOCATION)
            
        
        if KAGGLE_CONFIGS[self.key][_TYPE] == _COMPETITIONS:
            file_name = self._download_kaggle_data_from_competetions()
                 
        elif KAGGLE_CONFIGS[self.key][_TYPE] == _DATASETS:
            # If filename is given, download single file,
            # Else download all files.
            if KAGGLE_CONFIGS[self.key].get(_FILENAME):
                file_name = self._download_kaggle_data_from_datasets_with_given_filename()
            else:
                file_name = self._download_kaggle_data_from_datasets_containing_single_file()
                
        else:
            raise ValueError('Type should be among competetions or datasets in config')
        
        with zipfile.ZipFile(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'r') as zip_ref:
            zip_ref.extractall(_DOWNLOAD_LOCATION)

    def load_data(self):
        self.df = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION, KAGGLE_CONFIGS[self.key]['filename']), dtype='object')
        # delete downloaded data after loading in memory
        if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION)

    @property
    def timestamp_col(self):
        return self._timestamp_col  # If timestamp not available, will create fake timestamps

    @property
    def label_col(self):
        if self._label_col is None:
            raise ValueError('Label column not specified')
        else:
            return self._label_col

    @property
    def event_id_col(self):
        return self._event_id_col  # If event id not available, will create fake event ids
    
    @property
    def entity_id_col(self):
        return self._entity_id_col

    def standardize_timestamp_col(self):
        if self.timestamp_col is not None:
            self.df[_EVENT_TIMESTAMP] = pd.to_datetime(self.df[self.timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT))
            self.df.drop(self.timestamp_col, axis=1, inplace=True)
        elif self.timestamp_col is None and self._add_random_values_if_real_na[_EVENT_TIMESTAMP]:
            self.df[_EVENT_TIMESTAMP] = self.df[_EVENT_LABEL].apply(
                lambda x: fake.date_time_between(
                    start_date='-1y',   # think about making it to fixed date. vs from now?
                    end_date='now',
                    tzinfo=None).strftime(_TIMESTAMP_FORMAT))
        
        if self._label_timestamp_col is None and self._add_random_values_if_real_na[_LABEL_TIMESTAMP]:
            self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 
        elif self._label_timestamp_col is not None:
            self.df[_LABEL_TIMESTAMP] = pd.to_datetime(self.df[self._label_timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT))
            self.df.drop(self._label_timestamp_col, axis=1, inplace=True)

    def standardize_label_col(self):
        self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True)
        self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].astype(int)

    def standardize_event_id_col(self):
        if self.event_id_col is not None:
            self.df.rename({self.event_id_col: _EVENT_ID}, axis=1, inplace=True)
            self.df[_EVENT_ID] = self.df[_EVENT_ID].astype(str)
        elif self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: # add fake one if not exist
            self.df[_EVENT_ID] = self.df[_EVENT_LABEL].apply(
                lambda x: fake.uuid4())

            
    def standardize_entity_id_col(self):
        if self.entity_id_col is not None:
            self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True)
        elif self.entity_id_col is None and self._add_random_values_if_real_na[_ENTITY_ID]: # add fake one if not exist
            self.df[_ENTITY_ID] = self.df[_EVENT_LABEL].apply(
                lambda x: fake.uuid4())

    def rename_features(self):
        rename_map = {} # default is empty map that won't rename any columns
        self.df.rename(rename_map, axis=1, inplace=True)

    def subset_features(self):
        features_to_select = self.df.columns.tolist()
        self.df = self.df[features_to_select]  # all by default
    
    def drop_features(self):
        self.df.drop(self.features_to_drop, axis=1, inplace=True)

    def add_meta_data(self):
        if self._add_random_values_if_real_na[_ENTITY_TYPE]: 
            self.df[_ENTITY_TYPE] = 'user'

    def sort_by_timestamp(self):
        self.df.sort_values(by=_EVENT_TIMESTAMP, ascending=True, inplace=True)

    def lower_case_col_names(self):
         self.df.columns = [s.lower() for s in self.df.columns]
        
    def preprocess(self):
        self.lower_case_col_names()
        self.standardize_label_col()
        self.standardize_event_id_col()
        self.standardize_entity_id_col()
        self.standardize_timestamp_col()
        self.add_meta_data()
        self.rename_features()
        self.subset_features()
        self.drop_features()
        if self.timestamp_col:
            self.sort_by_timestamp()

    def train_test_split(self):
        """
        Default setting is out of time with 80%-20% into training and testing respectively
        """
        if self.timestamp_col: 
            split_pt = int(self.df.shape[0]*self.train_percentage)
            self.train = self.df.copy().iloc[:split_pt, :]
            self.test = self.df.copy().iloc[split_pt:, :]
        else:  # random if no timestamp col available
            self.train = self.df.sample(frac=self.train_percentage, random_state=_RANDOM_STATE)
            self.test = self.df.copy()[~self.df.index.isin(self.train.index)]
            self.test.reset_index(drop=True, inplace=True)
        
        self.test_labels = self.test[[_EVENT_LABEL]]
        if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]:
            self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]
        self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore")


class FakejobPreProcessor(BasePreProcessor):
    def __init__(self, **kw):
        super(FakejobPreProcessor, self).__init__(**kw)


class VehicleloanPreProcessor(BasePreProcessor):
    def __init__(self, **kw):
        super(VehicleloanPreProcessor, self).__init__(**kw)


class MalurlPreProcessor(BasePreProcessor):
    """
    This one originally multiple classes for manignant. 
    We will combine all malignant one class to keep benchmark binary for now
    
    """
    def __init__(self, **kw):
        super(MalurlPreProcessor, self).__init__(**kw)

    def standardize_label_col(self):
        self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True)
        binary_mapper = {
            'defacement': 1,
            'phishing': 1,
            'malware': 1,
            'benign': 0
        }
        
        self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper)

    def add_dummy_col(self):
        self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4())

    def preprocess(self):
        super(MalurlPreProcessor, self).preprocess()
        self.add_dummy_col()

class IEEEPreProcessor(BasePreProcessor):
    """
    Some pre-processing was done using kaggle kernels below.  

    References:
        Data Source: https://www.kaggle.com/c/ieee-fraud-detection/data

        Some processing from: https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600
        Feature selection to reduce to 100: https://www.kaggle.com/code/pavelvpster/ieee-fraud-feature-selection-rfecv/notebook

    """
    def __init__(self, **kw):
        super(IEEEPreProcessor, self).__init__(**kw)

    @staticmethod
    def _dtypes_cols():

        # FIRST 53 COLUMNS
        cols = ['TransactionID', 'TransactionDT', 'TransactionAmt',
            'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
            'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain',
            'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
            'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8',
            'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4',
            'M5', 'M6', 'M7', 'M8', 'M9']

        # V COLUMNS TO LOAD DECIDED BY CORRELATION EDA
        # https://www.kaggle.com/cdeotte/eda-for-columns-v-and-id
        v =  [1, 3, 4, 6, 8, 11]
        v += [13, 14, 17, 20, 23, 26, 27, 30]
        v += [36, 37, 40, 41, 44, 47, 48]
        v += [54, 56, 59, 62, 65, 67, 68, 70]
        v += [76, 78, 80, 82, 86, 88, 89, 91]

        #v += [96, 98, 99, 104] #relates to groups, no NAN 
        v += [107, 108, 111, 115, 117, 120, 121, 123] # maybe group, no NAN
        v += [124, 127, 129, 130, 136] # relates to groups, no NAN

        # LOTS OF NAN BELOW
        v += [138, 139, 142, 147, 156, 162] #b1
        v += [165, 160, 166] #b1
        v += [178, 176, 173, 182] #b2
        v += [187, 203, 205, 207, 215] #b2
        v += [169, 171, 175, 180, 185, 188, 198, 210, 209] #b2
        v += [218, 223, 224, 226, 228, 229, 235] #b3
        v += [240, 258, 257, 253, 252, 260, 261] #b3
        v += [264, 266, 267, 274, 277] #b3
        v += [220, 221, 234, 238, 250, 271] #b3

        v += [294, 284, 285, 286, 291, 297] # relates to grous, no NAN
        v += [303, 305, 307, 309, 310, 320] # relates to groups, no NAN
        v += [281, 283, 289, 296, 301, 314] # relates to groups, no NAN

        # COLUMNS WITH STRINGS
        str_type = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain','M1', 'M2', 'M3', 'M4','M5',
                    'M6', 'M7', 'M8', 'M9', 'id_12', 'id_15', 'id_16', 'id_23', 'id_27', 'id_28', 'id_29', 'id_30', 
                    'id_31', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo']
        str_type += ['id-12', 'id-15', 'id-16', 'id-23', 'id-27', 'id-28', 'id-29', 'id-30', 
            'id-31', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38']


        cols += ['V'+str(x) for x in v]
        dtypes = {}
        for c in cols+['id_0'+str(x) for x in range(1,10)]+['id_'+str(x) for x in range(10,34)]+\
            ['id-0'+str(x) for x in range(1,10)]+['id-'+str(x) for x in range(10,34)]:
                dtypes[c] = 'float32'
        for c in str_type: dtypes[c] = 'category'

        return dtypes, cols


    def load_data(self):
        """
        Hard coded file names for this dataset as it contains multiple files to be combined
        """

        dtypes, cols = IEEEPreProcessor._dtypes_cols()

        self.df = pd.read_csv(
            os.path.join(_DOWNLOAD_LOCATION,
             'train_transaction.csv'), 
             index_col='TransactionID',
             dtype=dtypes, 
             usecols=cols+['isFraud'])

        self.df_id = pd.read_csv(
            os.path.join(_DOWNLOAD_LOCATION, 
            'train_identity.csv'),
            index_col='TransactionID', 
            dtype=dtypes)
        self.df = self.df.merge(self.df_id, how='left', left_index=True, right_index=True)

        # delete downloaded data after loading in memory
        if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION)

    def normalization(self):
        # NORMALIZE D COLUMNS
        for i in range(1,16):
            if i in [1,2,3,5,9]: continue
            self.df['d'+str(i)] =  self.df['d'+str(i)] - self.df[self.timestamp_col]/np.float32(24*60*60)

    def standardize_entity_id_col(self):
        def _encode_CB(col1, col2, df):
            nm = col1+'_'+col2
            df[nm] = df[col1].astype(str)+'_'+df[col2].astype(str)
        
        _encode_CB('card1', 'addr1', self.df)
        self.df['day'] = self.df[self.timestamp_col] / (24*60*60)
        self.df[_ENTITY_ID] = self.df['card1_addr1'].astype(str) + '_' + np.floor(self.df['day'] - self.df['d1']).astype(str)

    @staticmethod
    def _add_seconds(x):
        init_time = '2021-01-01T00:00:00Z'
        dt_format = _TIMESTAMP_FORMAT
        init_time = datetime.strptime(init_time, dt_format) # start date from last 18 months
        final_time = init_time + timedelta(seconds=x)
        return final_time.strftime(_TIMESTAMP_FORMAT)   

    def standardize_timestamp_col(self):
        self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: IEEEPreProcessor._add_seconds(x))        
        self.df.drop(self.timestamp_col, axis=1, inplace=True)
        if self._add_random_values_if_real_na["LABEL_TIMESTAMP"]:
            self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 

    def subset_features(self):
        features_to_select = \
         ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1',
         'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11',
         'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160',
         'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264',
         'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02',
         'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo',
         'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID', 'EVENT_LABEL', 'LABEL_TIMESTAMP']
        self.df = self.df.loc[:, self.df.columns.isin(features_to_select)]

    def preprocess(self):
        self.lower_case_col_names()
        self.normalization()  # normalize D columns
        self.standardize_label_col()
        self.standardize_event_id_col()
        self.standardize_entity_id_col()
        self.standardize_timestamp_col()
        self.add_meta_data()
        self.rename_features()
        self.subset_features()  
        if self.timestamp_col:
            self.sort_by_timestamp()


class CCFraudPreProcessor(BasePreProcessor):
    def __init__(self, **kw):
        super(CCFraudPreProcessor, self).__init__(**kw)

    @staticmethod
    def _add_minutes(x):
        dt_format = _TIMESTAMP_FORMAT
        init_time = datetime.strptime('2021-09-01T00:00:00Z', dt_format)  # chose randomly but in last 18 months
        final_time = init_time + timedelta(minutes=x)
        return final_time.strftime(_TIMESTAMP_FORMAT)   

    def standardize_timestamp_col(self):
        self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].astype(float).apply(lambda x: CCFraudPreProcessor._add_minutes(x))        
        self.df.drop(self.timestamp_col, axis=1, inplace=True)
        if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]:
            self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 
        
class FraudecomPreProcessor(BasePreProcessor):
    def __init__(self, ip_address_col, signup_time_col, **kw):
        self.ip_address_col = ip_address_col
        self.signup_time_col = signup_time_col
        super(FraudecomPreProcessor, self).__init__(**kw)

    @staticmethod
    def _add_years(init_time):
        dt_format = '%Y-%m-%d %H:%M:%S'
        init_time = datetime.strptime(init_time, dt_format)
        final_time = init_time + relativedelta(years=6)  # move to more recent time range
        return final_time.strftime(_TIMESTAMP_FORMAT) 


    def standardize_timestamp_col(self):

        self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: FraudecomPreProcessor._add_years(x))        
        self.df.drop(self.timestamp_col, axis=1, inplace=True)

        # Also add _LABEL_TIMESTAMP to allow training of this dataset with TFI
        if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]:
            self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 

    def process_ip(self):
        """
        This dataset has ip address as a feature, but needs to be converted into standard IPV4.
        """
        self.df[self.ip_address_col] = self.df[self.ip_address_col].astype(float).astype(int).\
                                        apply(lambda x: socket.inet_ntoa(struct.pack('!L', x)))

    def create_time_since_signup(self):
        self.df['time_since_signup'] = (
            pd.to_datetime(self.df[self.timestamp_col]) -\
            pd.to_datetime(self.df[self.signup_time_col])).dt.seconds

    def preprocess(self):
        self.lower_case_col_names()
        self.standardize_label_col()
        self.standardize_event_id_col()
        self.standardize_entity_id_col()
        self.create_time_since_signup()  # One manually engineered feature
        self.standardize_timestamp_col()
        self.add_meta_data()
        self.process_ip()  # This extra step added
        self.rename_features()
        self.drop_features()  # Replace select with drop
        if self.timestamp_col:
            self.sort_by_timestamp()


class SparknovPreProcessor(BasePreProcessor):
    def __init__(self, **kw):
        super(SparknovPreProcessor, self).__init__(**kw)
        
    def load_data(self):
        """
        Hard coded file names for this dataset as it contains multiple files to be combined
        """

        df_train = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTrain.csv'))
        df_train['seg'] = 'train'

        df_test = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTest.csv'))
        df_test['seg'] = 'test'

        self.df = pd.concat([df_train, df_test], ignore_index=True)

        # delete downloaded data after loading in memory
        if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION)

    @staticmethod
    def _add_months(x):
        _TIMESTAMP_FORMAT_SPARKNOV = '%Y-%m-%d %H:%M:%S'

        x = datetime.strptime(x, _TIMESTAMP_FORMAT_SPARKNOV)  
        final_time = x + relativedelta(months=20) # chosen to move dates close to now()
        return final_time.strftime(_TIMESTAMP_FORMAT)    

    def standardize_timestamp_col(self):

        self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: SparknovPreProcessor._add_months(x))        
        self.df.drop(self.timestamp_col, axis=1, inplace=True)
        self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date 

    def standardize_entity_id_col(self):

        self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True)
        self.df[_ENTITY_ID] = self.df[_ENTITY_ID].\
                                str.lower().\
                                apply(lambda x: re.sub(r'[^A-Za-z0-9]+', '_', x))
        
    def train_test_split(self):
        self.train = self.df.copy()[self.df['seg'] == 'train']
        self.train.reset_index(drop=True, inplace=True)
        self.train.drop(['seg'], axis=1, inplace=True)
        
        self.test = self.df.copy()[self.df['seg'] == 'test']
        self.test.reset_index(drop=True, inplace=True)
        self.test.drop(['seg'], axis=1, inplace=True)
        self.test = self.test.sample(n=20000, random_state=1)
        
        self.test_labels = self.test[[_EVENT_LABEL]]
        if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]:
            self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]
        self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore")


class TwitterbotPreProcessor(BasePreProcessor):
    def __init__(self, **kw):
        super(TwitterbotPreProcessor, self).__init__(**kw)

    def standardize_label_col(self):
        self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True)
        binary_mapper = {
            'bot': 1,
            'human': 0
        }
        
        self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper)


class IPBlocklistPreProcessor(BasePreProcessor):
    """
    The dataset source is http://cinsscore.com/list/ci-badguys.txt. 
    In order to download/access the latest version of this dataset, a sign-in/sign-up to is not required

    Since this dataset is not version controlled from the source, we added the version of dataset we used for experiments
    discussed in the paper. The versioned dataset is as of 2022-06-07. 
    The code is set to pick the fixed version. If the user is interested to use the latest version,
    'version' argument will need to be turned off (i.e. set to None) 
    """
    def __init__(self, version, **kw):
        self.version = version  # string or None. If string, picks one from versioned_datasets, else creates one from source  
        super(IPBlocklistPreProcessor, self).__init__(**kw)
        
    def load_data(self):
        if self.version is None:
            # load malicious IPs from the source
            _URL = 'http://cinsscore.com/list/ci-badguys.txt'  # contains confirmed malicious IPs
            _N_BENIGN = 200000
            
            res = requests.get(_URL)
            ip_mal = pd.read_csv(StringIO(res.text), sep='\n', names=['ip'], header=None)
            ip_mal['is_ip_malign'] = 1
            
            # add fake IPs as benign
            ip_ben = pd.DataFrame({
                'ip': [fake.ipv4() for i in range(_N_BENIGN)], 
                'is_ip_malign': 0
            })
            
            self.df = pd.concat([ip_mal, ip_ben], axis=0, ignore_index=True)
        else:

            _VERSIONED_DATA_PATH = f'versioned_datasets/{self.key}/{self.version}.zip'
            data = pkgutil.get_data(__name__, _VERSIONED_DATA_PATH)
            with zipfile.ZipFile(BytesIO(data)) as f:
                self.train = pd.read_csv(f.open('train.csv'))
                self.test = pd.read_csv(f.open('test.csv'))
                self.test_labels = pd.read_csv(f.open('test_labels.csv'))

    def add_dummy_col(self):
        self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4())
    
    def train_test_split(self):
        if self.version is None:
            super(IPBlocklistPreProcessor, self).train_test_split()
        
    def preprocess(self):
        if self.version is None:
            super(IPBlocklistPreProcessor, self).preprocess()
            self.add_dummy_col()      


================================================
FILE: src/fdb/preprocessing_objects.py
================================================
from fdb.preprocessing import *


def load_data(key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na):
    common_kw = {
        "key": key,
        "load_pre_downloaded": load_pre_downloaded,
        "delete_downloaded": delete_downloaded,
        "add_random_values_if_real_na": add_random_values_if_real_na
    }

    if key == 'fakejob':
        obj = FakejobPreProcessor(
                train_percentage = 0.8,
                timestamp_col = None, 
                label_col = 'fraudulent', 
                event_id_col = 'job_id',
                **common_kw
                )
    
    elif key == 'vehicleloan':
        obj = VehicleloanPreProcessor(
            train_percentage = 0.8,
            timestamp_col = None, 
            label_col = 'loan_default', 
            event_id_col = 'uniqueid',
            features_to_drop = ['disbursal_date'],
            **common_kw
            )

    elif key == 'malurl':
        obj = MalurlPreProcessor(
            train_percentage = 0.9,
            timestamp_col = None,
            label_col = 'type',
            event_id_col = None,
            **common_kw
        )

    elif key == 'ieeecis':
        obj = IEEEPreProcessor(
            train_percentage = 0.95,
            timestamp_col = 'transactiondt',
            label_col = 'isfraud',
            event_id_col = None,
            entity_id_col = None,  # manually created in code
            **common_kw
        )

    elif key == 'ccfraud':
        obj = CCFraudPreProcessor(
            train_percentage = 0.8,
            timestamp_col = 'time',
            label_col = 'class',
            event_id_col = None,
            **common_kw
        )

    elif key == 'fraudecom':
        obj = FraudecomPreProcessor(
            train_percentage = 0.8,
            timestamp_col = 'purchase_time',
            signup_time_col = 'signup_time',
            label_col = 'class',
            event_id_col = 'user_id',
            entity_id_col = 'device_id',
            ip_address_col = 'ip_address',
            features_to_drop = ['signup_time', 'sex'],
            **common_kw
        )

    elif key == 'sparknov':
        obj = SparknovPreProcessor(
            timestamp_col = 'trans_date_trans_time',
            label_col = 'is_fraud',
            event_id_col = 'trans_num',
            entity_id_col = 'merchant',
            features_to_drop = ['unix_time', 'unnamed: 0'],
            **common_kw
            )

    elif key == 'twitterbot':
        obj = TwitterbotPreProcessor(
            train_percentage = 0.8,
            timestamp_col = None,
            label_col = 'account_type',
            event_id_col = 'id',
            **common_kw
        )

    elif key == 'ipblock':
        obj = IPBlocklistPreProcessor(
            label_col = 'is_ip_malign',
            version = '20220607',
            **common_kw
        )

    else:
        raise ValueError('Invalid key')

    return obj

================================================
FILE: src/fdb/versioned_datasets/__init__.py
================================================


================================================
FILE: src/fdb/versioned_datasets/ipblock/__init__.py
================================================