main f100cb829599 cached
39 files
623.7 KB
219.8k tokens
125 symbols
1 requests
Download .txt
Showing preview only (647K chars total). Download the full file or copy to clipboard to get everything.
Repository: amazon-science/fraud-dataset-benchmark
Branch: main
Commit: f100cb829599
Files: 39
Total size: 623.7 KB

Directory structure:
gitextract_sn16q5ml/

├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── scripts/
│   ├── examples/
│   │   └── Test_FDB_Loader.ipynb
│   └── reproducibility/
│       ├── afd/
│       │   ├── README.md
│       │   ├── configs/
│       │   │   ├── CreditCardFraudDetection.json
│       │   │   ├── FakeJobPostingPrediction.json
│       │   │   ├── Fraudecommerce.json
│       │   │   ├── IEEECISFraudDetection.json
│       │   │   ├── IPBlocklist.json
│       │   │   ├── MaliciousURL.json
│       │   │   ├── SimulatedCreditCardTransactionsSparkov.json
│       │   │   ├── TwitterBotAccounts.json
│       │   │   └── VehicleLoanDefaultPrediction.json
│       │   ├── create_afd_resources.py
│       │   └── score_afd_model.py
│       ├── autogluon/
│       │   ├── README.md
│       │   ├── benchmark_ag.py
│       │   └── example-ag-ieeecis.ipynb
│       ├── autosklearn/
│       │   ├── README.md
│       │   └── benchmark_autosklearn.py
│       ├── benchmark_utils.py
│       ├── h2o/
│       │   ├── README.md
│       │   ├── benchmark_h2o.py
│       │   └── example-h2o-ieeecis.ipynb
│       └── label-noise/
│           ├── benchmark_experiments.ipynb
│           ├── feature_dict.py
│           ├── load_fdb_datasets.py
│           └── micro_models.py
├── setup.py
└── src/
    ├── __init__.py
    └── fdb/
        ├── __init__.py
        ├── datasets.py
        ├── kaggle_configs.py
        ├── preprocessing.py
        ├── preprocessing_objects.py
        └── versioned_datasets/
            ├── __init__.py
            └── ipblock/
                └── __init__.py

================================================
FILE CONTENTS
================================================

================================================
FILE: CODE_OF_CONDUCT.md
================================================
## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.


================================================
FILE: CONTRIBUTING.md
================================================
# Contributing Guidelines

Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
documentation, we greatly value feedback and contributions from our community.

Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
information to effectively respond to your bug report or contribution.


## Reporting Bugs/Feature Requests

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

* A reproducible test case or series of steps
* The version of our code being used
* Any modifications you've made relevant to the bug
* Anything unusual about your environment or deployment


## Contributing via Pull Requests
Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:

1. You are working against the latest source on the *main* branch.
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.

To send us a pull request, please:

1. Fork the repository.
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull request interface.
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.

GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).


## Finding contributions to work on
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.


## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.


## Security issue notifications
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.


## Licensing

See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2021-2022 Prince Grover
Copyright (c) 2021-2022 Zheng Li
Copyright (c) 2022 Jianbo Liu
Copyright (c) 2022 Jakub Zablocki
Copyright (c) 2022 Jianbo Liu
Copyright (c) 2022 Hao Zhou
Copyright (c) 2022 Julia Xu
Copyright (c) 2022 Anqi Cheng

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# FDB: Fraud Dataset Benchmark

*By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin Tittelfitz](jtittelfitz), Anqi Cheng, [Jakub Zablocki](qbaza), Jianbo Liu, and [Hao Zhou](haozhouamzn)*


[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 


The **Fraud Dataset Benchmark (FDB)** is a compilation of publicly available datasets relevant to **fraud detection** ([arXiv Link](https://arxiv.org/abs/2208.14417)). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning. 


## Datasets used in FDB
Brief summary of the datasets used in FDB. Each dataset is described in detail in [data source section](#data-sources).

| **#** | **Dataset name**                                           | **Dataset key** | **Fraud category**                  | **#Train** | **#Test** | **Class ratio (train)** | **#Feats** | **#Cat** | **#Num** | **#Text** | **#Enrichable** |
|-------|------------------------------------------------------------|-----------------|-------------------------------------|------------|-----------|-------------------------|------------|----------|----------|-----------|-----------------|
| 1     | IEEE-CIS Fraud Detection                                   | ieeecis         | Card Not Present Transactions Fraud | 561,013    | 28,527    | 3.50%                   | 67         | 6        | 61       | 0         | 0               |
| 2     | Credit Card Fraud Detection                                | ccfraud         | Card Not Present Transactions Fraud | 227,845    | 56,962    | 0.18%                   | 28         | 0        | 28       | 0         | 0               |
| 3     | Fraud ecommerce                                            | fraudecom       | Card Not Present Transactions Fraud | 120,889    | 30,223    | 10.60%                  | 6          | 2        | 3        | 0         | 1               |
| 4     | Simulated Credit Card Transactions generated using Sparkov | sparknov        | Card Not Present Transactions Fraud | 1,296,675  | 20,000    | 5.70%                   | 17         | 10       | 6        | 1         | 0               |
| 5     | Twitter Bots Accounts                                      | twitterbot      | Bot Attacks                         | 29,950     | 7,488     | 33.10%                  | 16         | 6        | 6        | 4         | 0               |
| 6     | Malicious URLs dataset                                     | malurl          | Malicious Traffic                  | 586,072   | 65,119    | 34.20%                  | 2          | 0        | 1        | 1         | 0               |
| 7     | Fake Job Posting Prediction                                | fakejob         | Content Moderation                  | 14,304     | 3,576     | 4.70%                   | 16         | 10       | 1        | 5         | 0               |
| 8     | Vehicle Loan Default Prediction                            | vehicleloan    | Credit Risk                         | 186,523    | 46,631    | 21.60%                  | 38         | 13       | 22       | 3         | 0               |
| 9     | IP Blocklist                                               | ipblock         | Malicious Traffic                   | 172,000    | 43,000    | 7%                      | 1          | 0        | 0        | 0         | 1               |


## Installation

### Requirements
- Kaggle account
    - **Important**: `ieeecis` dataset requires you to [**join IEEE-CIS competetion**](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call fdb API. Otherwise you will get <span style="color:red">ApiException: (403)</span>.
- AWS account
- Python 3.7+ 

- Python requirements
```
autogluon==0.4.2
h2o==3.36.1.2
boto3==1.20.21
click==8.0.3
click-plugins==1.1.1
Faker==4.14.2
joblib==1.0.0
kaggle==1.5.12
numpy==1.19.5
pandas==1.1.2
regex==2020.7.14
scikit-learn==0.22.1
scipy==1.5.4
auto-sklearn==0.14.7
dask==2022.8.1
```

### Step 1: Setup Kaggle CLI
The `FraudDatasetBenchmark` object is going to load datasets from the source (which in most of the cases is Kaggle), and then it will modify/standardize on the fly, and provide train-test splits. So, the first step is to setup Kaggle CLI in the machine being used to run Python.

Use intructions from [How to Use Kaggle](https://www.kaggle.com/docs/api) guide. The steps include:

Remember to download the authentication token from "My Account" on Kaggle, and save token at `~/.kaggle/kaggle.json` on Linux, OSX and at `C:\Users<Windows-username>.kaggle\kaggle.json` on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder.
  
    
#### Step 1.2. [Join IEEE-CIS competetion](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call `fdb.datasets` with `ieeecis`. Otherwise you will get <span style="color:red">ApiException: (403)</span>.
  
  
### Step 2: Clone Repo
Once Kaggle CLI is setup and installed, clone the github repo using `git clone https://github.com/amazon-research/fraud-dataset-benchmark.git` if using HTTPS, or `git clone git@github.com:amazon-research/fraud-dataset-benchmark.git` if using SSH. 

### Step 3: Install
Once repo is cloned, from your terminal, `cd` to the repo and type `pip install .`, which will install the required classes and methods.


## FraudDatasetBenchmark Usage
The usage is straightforward, where you create a `dataset` object of `FraudDatasetBenchmark` class, and extract useful goodies like train/test splits and eval_metrics.   

**Important note**: If you are running multiple experiments that require re-loading dataframes multiple times, default setting of downloading from Kaggle before loading into dataframe exceed the account level API limits. So, use the setting to persist the downloaded dataset and then load from the persisted data. During the first call of FraudDatasetBenchmark(), use `load_pre_downloaded=False, delete_downloaded=False` and for subsequent calls, use `load_pre_downloaded=True, delete_downloaded=False`. The default setting is 
`load_pre_downloaded=False, delete_downloaded=True`
```
from fdb.datasets import FraudDatasetBenchmark

# all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud', 'fraudecom', 'twitterbot', 'ipblock'] 
key = 'ipblock'

obj = FraudDatasetBenchmark(
    key=key,
    load_pre_downloaded=False,  # default
    delete_downloaded=True,  # default
    add_random_values_if_real_na = { 
        "EVENT_TIMESTAMP": True, 
        "LABEL_TIMESTAMP": True,
        "ENTITY_ID": True,
        "ENTITY_TYPE": True,
        "ENTITY_ID": True,
        "EVENT_ID": True
        } # default
    )
print(obj.key)

print('Train set: ')
display(obj.train.head())
print(len(obj.train.columns))
print(obj.train.shape)

print('Test set: ')
display(obj.test.head())
print(obj.test.shape)

print('Test scores')
display(obj.test_labels.head())
print(obj.test_labels['EVENT_LABEL'].value_counts())
print(obj.train['EVENT_LABEL'].value_counts(normalize=True))
print('=========')

``` 
Notebook template to load dataset using FDB data-loader is available at [scripts/examples/Test_FDB_Loader.ipynb](scripts/examples/Test_FDB_Loader.ipynb)

## Reproducibility
Reproducibility scripts are available at [scripts/reproducibility/](scripts/reproducibility/) in respective folders for [afd](scripts/reproducibility/afd), [autogluon](scripts/reproducibility/autogluon) and [h2o](scripts/reproducibility/h2o). Each folder also had README with steps to reproduce.


## Benchmark Results

<!-- | **Dataset key** | **AUC-ROC** |             |               |                  |                  | **Recall at 1% FPR** |             |               |                  |                  |
|:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:|:--------------------:|:-----------:|:-------------:|:----------------:|:----------------:|
|                 | **AFD OFI** | **AFD TFI** | **AutoGluon** |      **H2O**     | **Auto-sklearn** |      **AFD OFI**     | **AFD TFI** | **AutoGluon** |      **H2O**     | **Auto-sklearn** |
|     ccfraud     |    0.985    |     0.99    |      0.99     |     **0.992**    |       0.988      |         0.88         |     0.88    |      0.88     |       0.853      |       0.88       |
|     fakejob     |    0.987    |      -      |   **0.998**   |       0.99       |       0.983      |         0.786        |      -      |     0.925     |       0.781      |       0.781      |
|    fraudecom    |    0.519    |  **0.636**  |     0.522     |       0.518      |       0.515      |         0.011        |    0.099    |     0.012     |       0.009      |       0.012      |
|     ieeecis     |    0.938    |   **0.94**  |     0.855     |       0.89       |       0.932      |         0.587        |     0.56    |     0.425     |       0.442      |       0.569      |
|      malurl     |    0.985    |      -      |   **0.998**   | Training failure |        0.5       |         0.868        |      -      |     0.976     | Training failure |       0.01       |
|     sparknov    |  **0.998**  |      -      |     0.997     |       0.997      |       0.995      |           1          |      -      |     0.927     |       0.896      |       0.868      |
|    twitterbot   |    0.934    |      -      |   **0.943**   |       0.938      |       0.936      |         0.518        |      -      |     0.419     |       0.382      |       0.369      |
|   vehicleloan   |  **0.673**  |      -      |     0.669     |       0.67       |       0.664      |         0.036        |      -      |      0.04     |       0.037      |       0.035      |
|     ipblock     |  **0.937**  |      -      |     0.804     | Training failure |        0.5       |         0.466        |      -      |      0.32     | Training failure |       0.01       | -->

| **Dataset key** | **AUC-ROC** |             |               |                  |                  |
|:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:|
|                 | **AFD OFI** | **AFD TFI** | **AutoGluon** |      **H2O**     | **Auto-sklearn** |
|     ccfraud     |    0.985    |     0.99    |      0.99     |     **0.992**    |       0.988      |
|     fakejob     |    0.987    |      -      |   **0.998**   |       0.99       |       0.983      |
|    fraudecom    |    0.519    |  **0.636**  |     0.522     |       0.518      |       0.515      |
|     ieeecis     |    0.938    |   **0.94**  |     0.855     |       0.89       |       0.932      |
|      malurl     |    0.985    |      -      |   **0.998**   | Training failure |        0.5       |
|     sparknov    |  **0.998**  |      -      |     0.997     |       0.997      |       0.995      |
|    twitterbot   |    0.934    |      -      |   **0.943**   |       0.938      |       0.936      |
|   vehicleloan   |  **0.673**  |      -      |     0.669     |       0.67       |       0.664      |
|     ipblock     |  **0.937**  |      -      |     0.804     | Training failure |        0.5       |

### ROC Curves

The numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML.  
![roc curves](images/all_fdb.png)


## Data Sources


1. **IEEE-CIS Fraud Detection**
    - Source URL: https://www.kaggle.com/c/ieee-fraud-detection/overview
    - Source license: https://www.kaggle.com/competitions/ieee-fraud-detection/rules
    - Variables: Anonymized product, card, address, email domain, device, transaction date information. Numeric columns with name prefixes as V, C, D and M, and meaning hidden from public.
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Vesta Corporation](https://www.vesta.io/)
    - Release date: 2019-10-03
    - Description: Prepared by IEEE Computational Intelligence Society, this card-non-present transaction fraud dataset was launched during IEEE-CIS Fraud Detection Kaggle competition, and was provided by Vesta Corporation. The original dataset contains 393 features which are reduced to 67 features in the benchmark. Feature selection was performed based on highly voted Kaggle kernels. The fraud rate in training segment of source dataset is 3.5%. We only used training files (train transaction and train identity) containing 590,540 transactions in the benchmark, and split that into train (95%) and test (5%) segments based on time. Based on the insights from a Kaggle kernel written by the competition winner, we added UUID (called it as ENTITY_ID) that represents a fingerprint and was created using card, address, time and D1 features.

2. **Credit Card Fraud Detection**
    - Source URL: https://www.kaggle.com/mlg-ulb/creditcardfraud/
    - Source license: https://opendatacommons.org/licenses/dbcl/1-0/
    - Variables: PCA transformed features, time, amount (highly imbalanced)
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Machine Learning Group - ULB](https://mlg.ulb.ac.be/)
    - Release date: 2018-03-23
    - Description: This dataset contains anonymized credit card transactions by European cardholders in September 2013. The dataset contains 492 frauds out of 284,807 transactions over 2 days. Data only contains numerical features that are the result of a PCA transformation, plus non transformed time and amount.

3. **Fraud ecommerce**
    - Source URL: https://www.kaggle.com/vbinh002/fraud-ecommerce
    - Source license: None
    - Variables: The features include sign up time, purchase time, purchase value, device id, user id, browser, and IP address. We added a new feature that measured the time difference between sign up and purchase, as the age of an account is often an important variable in fraud detection.
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Binh Vu](https://www.kaggle.com/vbinh002) 
    - Release date: 2018-12-09
    - Description: This dataset contains ~150k e-commerce transactions.

4. **Simulated Credit Card Transactions generated using Sparkov**
    - Source URL: https://www.kaggle.com/kartik2112/fraud-detection
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: Transaction date, credit card number, merchant, category, amount, name, street, gender. All variables are synthetically generated using the Sparknov tool.
    - Fraud category: Card Not Present Transaction Fraud
    - Provider: [Kartik Shenoy](https://www.kaggle.com/kartik2112)
    - Release date: 2020-08-05
    - Description: This is a simulated credit card transaction dataset. The dataset was generated using Sparkov Data Generation tool and we modified a version of dataset created for Kaggle. It covers transactions of 1000 customers with a pool of 800 merchants over 6 months. We used both train and test segments directly from the source and randomly down sampled test segment.

5. **Twitter Bots Accounts**
    - Source URL: https://www.kaggle.com/code/davidmartngutirrez/bots-accounts-eda/data?select=twitter_human_bots_dataset.csv
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: Features like account creation date, follower and following counts, profile description, account age, meta data about profile picture and account activity, and a label indicating whether the account is human or bot.
    - Fraud category: Bot Attacks
    - Provider: [David Martín Gutiérrez](https://www.kaggle.com/davidmartngutirrez)
    - Release date: 2020-08-20
    - Description: The dataset composes of 37,438 rows corresponding to different user accounts from Twitter.

6. **Malicious URLs dataset**
    - Source URL: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label.
    - Fraud category: Malicious Traffic
    - Provider: [Manu Siddhartha](https://www.kaggle.com/sid321axn) 
    - Release date: 2021-07-23
    - Description: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. There is no timestamp information from the source. Therefore, we generate a dummy timestamp column for consistency.

7. **Real / Fake Job Posting Prediction**
    - Source URL: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
    - Source license: https://creativecommons.org/publicdomain/zero/1.0/
    - Variables: Title, location, department, company, salary range, requirements, description, benefits, telecommuting. Most of the variables are categorical and free form text in nature.
    - Fraud category: Content Moderation
    - Provider: [Shivam Bansal](https://www.kaggle.com/shivamb) 
    - Release date: 2020-02-29
    - Description: This Kaggle dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The task is to train classification model to detect which job posts are fraudulent.

8. **Vehicle Loan Default Prediction**
    - Source URL: https://www.kaggle.com/avikpaul4u/vehicle-loan-default-prediction
    - Source license: Unknown
    - Variables: Loanee information, loan information, credit bureau data, and history.
    - Fraud category: Credit Risk
    - Provider: [Avik Paul](https://www.kaggle.com/avikpaul4u) 
    - Release date: 2019-11-12
    - Description: The task in this dataset is to determine the probability of vehicle loan default, particularly the risk of default on the first monthly installments. It contains data for 233k loans with 21.7% default rate.
    
9. **IP Blocklist**
    - Source URL: http://cinsscore.com/list/ci-badguys.txt
    - Source license: Unknown
    - Variables: The dataset contains IP address and label telling malicious or fake. A dummy categorical variable that has no relation label is added.
    - Fraud category: Malicious Traffic
    - Provider: [CINSscore.com](http://cinsscore.com)
    - Release date: 2017-09-25
    - Description: This dataset is made up from malicious IP address from cinsscore.com. To the list of malicious IP addresses, we added randomly generated IP address using Faker labeled as benign.
    

## Citation
```
@misc{grover2023fraud,
      title={Fraud Dataset Benchmark and Applications}, 
      author={Prince Grover and Julia Xu and Justin Tittelfitz and Anqi Cheng and Zheng Li and Jakub Zablocki and Jianbo Liu and Hao Zhou},
      year={2023},
      eprint={2208.14417},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

## License
This project is licensed under the MIT-0 License.


## Acknowledgement
We thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes. 







================================================
FILE: scripts/examples/Test_FDB_Loader.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('../../src/')\n",
    "from fdb.datasets import FraudDatasetBenchmark\n",
    "from fdb.kaggle_configs import KAGGLE_CONFIGS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>.container { width:90% }</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Notebook setups\n",
    "\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from io import StringIO\n",
    "\n",
    "from IPython.core.display import display, HTML\n",
    "from IPython.display import clear_output\n",
    "display(HTML(\"<style>.container { width:90% }</style>\"))\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_colwidth', 200)\n",
    "pd.set_option('display.max_rows', 500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "\n",
    "if os.path.exists('tmp'):\n",
    "    shutil.rmtree('tmp')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNCOMMENT IF YOU NEED TO UPLOAD DATA TO AN S3 BUCKET IN YOUR ACCOUNT\n",
    "\n",
    "# import boto3\n",
    "# BUCKET='<ADD S3 BUCKET NAME IF YOU WANT TO UPLOAD DATA TO YOUR ACCOUNT>'\n",
    "\n",
    "# def _s3_upload(df):\n",
    "#     csv_memory=StringIO()\n",
    "#     df.to_csv(csv_memory, index=False)\n",
    "#     content = csv_memory.getvalue()\n",
    "#     s3_client.put_object(\n",
    "#         Body=content,\n",
    "#         Bucket=BUCKET,\n",
    "#         Key=KEY,\n",
    "#        ACL='bucket-owner-full-control')\n",
    "# s3_client = boto3.client('s3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# All options for keys\n",
    "all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud','fraudecom', 'twitterbot', 'ipblock']\n",
    "# all_keys = ['ipblock']\n",
    "# all_keys = ['twitterbot']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Default setting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Default setting pulls data from the source in your system, modified the data and adds random values for columns that are missing, if add_random_values_if_real_na flags are True.\n",
    "\n",
    "Defalt parameters: \n",
    "- load_pre_downloaded: False\n",
    "- delete_downloaded: True\n",
    "- add_random_values_if_real_na = ```\n",
    "{\n",
    "\"EVENT_TIMESTAMP\": True,\n",
    "\"LABEL_TIMESTAMP\": True,\n",
    "\"ENTITY_ID\": True,\n",
    "\"ENTITY_TYPE\": True,\n",
    "\"EVENT_ID\": True\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "fakejob\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>title</th>\n",
       "      <th>location</th>\n",
       "      <th>department</th>\n",
       "      <th>salary_range</th>\n",
       "      <th>company_profile</th>\n",
       "      <th>description</th>\n",
       "      <th>requirements</th>\n",
       "      <th>benefits</th>\n",
       "      <th>telecommuting</th>\n",
       "      <th>has_company_logo</th>\n",
       "      <th>has_questions</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>required_experience</th>\n",
       "      <th>required_education</th>\n",
       "      <th>industry</th>\n",
       "      <th>function</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5736</th>\n",
       "      <td>5737</td>\n",
       "      <td>Jr. Business Analyst &amp; Quality Analyst (entry level)</td>\n",
       "      <td>US, NJ, PISCATAWAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Duration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial &amp;amp; Health care clients.Candidate should have knowledge or experience in ...</td>\n",
       "      <td>What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Entry level</td>\n",
       "      <td>Master's Degree</td>\n",
       "      <td>Financial Services</td>\n",
       "      <td>Finance</td>\n",
       "      <td>0</td>\n",
       "      <td>382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b</td>\n",
       "      <td>2022-12-13T13:05:21Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7106</th>\n",
       "      <td>7107</td>\n",
       "      <td>English Teacher Abroad</td>\n",
       "      <td>US, PA, Scranton</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>We help teachers get safe &amp;amp; secure jobs abroad :)</td>\n",
       "      <td>Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr...</td>\n",
       "      <td>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only</td>\n",
       "      <td>See job description</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Contract</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Education Management</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>deadb697-08d2-4dca-83ec-a15d5e501a5b</td>\n",
       "      <td>2022-07-26T01:40:53Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11978</th>\n",
       "      <td>11979</td>\n",
       "      <td>SQL Server Database Developer Job opportunity at Barrington, IL</td>\n",
       "      <td>US, IL, Barrington</td>\n",
       "      <td>NaN</td>\n",
       "      <td>90000-100000</td>\n",
       "      <td>We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc...</td>\n",
       "      <td>Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...</td>\n",
       "      <td>Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...</td>\n",
       "      <td>Benefits - FullBonus Eligible - Yes</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Mid-Senior level</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Information Technology and Services</td>\n",
       "      <td>Information Technology</td>\n",
       "      <td>0</td>\n",
       "      <td>f5fcea87-6798-4529-a6c7-205d893b9b24</td>\n",
       "      <td>2023-03-09T13:06:59Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9374</th>\n",
       "      <td>9375</td>\n",
       "      <td>Legal Analyst - 12 Month FTC</td>\n",
       "      <td>GB, LND, London</td>\n",
       "      <td>Legal</td>\n",
       "      <td>NaN</td>\n",
       "      <td>MarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo...</td>\n",
       "      <td>DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki...</td>\n",
       "      <td>Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col...</td>\n",
       "      <td>Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible.</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Associate</td>\n",
       "      <td>Professional</td>\n",
       "      <td>Financial Services</td>\n",
       "      <td>Legal</td>\n",
       "      <td>0</td>\n",
       "      <td>114fbd01-0573-42cf-9365-78729264e1aa</td>\n",
       "      <td>2022-12-09T08:17:07Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1300</th>\n",
       "      <td>1301</td>\n",
       "      <td>Part-Time Finance Assistant</td>\n",
       "      <td>GB, LND,</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Salary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a...</td>\n",
       "      <td>Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec...</td>\n",
       "      <td>Salary:£9 - £10 per hour</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Part-time</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Accounting</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>05a5dbdb-9778-4e4a-b967-7850dd483a54</td>\n",
       "      <td>2022-08-28T17:32:28Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      EVENT_ID  \\\n",
       "5736      5737   \n",
       "7106      7107   \n",
       "11978    11979   \n",
       "9374      9375   \n",
       "1300      1301   \n",
       "\n",
       "                                                                 title  \\\n",
       "5736              Jr. Business Analyst & Quality Analyst (entry level)   \n",
       "7106                                           English Teacher Abroad    \n",
       "11978  SQL Server Database Developer Job opportunity at Barrington, IL   \n",
       "9374                                      Legal Analyst - 12 Month FTC   \n",
       "1300                                       Part-Time Finance Assistant   \n",
       "\n",
       "                 location department  salary_range  \\\n",
       "5736   US, NJ, PISCATAWAY        NaN           NaN   \n",
       "7106    US, PA, Scranton         NaN           NaN   \n",
       "11978  US, IL, Barrington        NaN  90000-100000   \n",
       "9374      GB, LND, London      Legal           NaN   \n",
       "1300            GB, LND,         NaN           NaN   \n",
       "\n",
       "                                                                                                                                                                                               company_profile  \\\n",
       "5736                                                                                                                                                                                                       NaN   \n",
       "7106                                                                                                                                                     We help teachers get safe &amp; secure jobs abroad :)   \n",
       "11978  We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc...   \n",
       "9374   MarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo...   \n",
       "1300                                                                                                                                                                                                       NaN   \n",
       "\n",
       "                                                                                                                                                                                                   description  \\\n",
       "5736   Duration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial &amp; Health care clients.Candidate should have knowledge or experience in ...   \n",
       "7106   Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr...   \n",
       "11978  Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...   \n",
       "9374   DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki...   \n",
       "1300   Salary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a...   \n",
       "\n",
       "                                                                                                                                                                                                  requirements  \\\n",
       "5736   What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad...   \n",
       "7106                                                                        University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only   \n",
       "11978  Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...   \n",
       "9374   Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col...   \n",
       "1300   Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec...   \n",
       "\n",
       "                                                                                                                                                                                                benefits  \\\n",
       "5736                                                                                                                                                                                                 NaN   \n",
       "7106                                                                                                                                                                                 See job description   \n",
       "11978                                                                                                                                                                Benefits - FullBonus Eligible - Yes   \n",
       "9374   Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible.    \n",
       "1300                                                                                                                                                                           Salary:£9 - £10 per hour    \n",
       "\n",
       "      telecommuting has_company_logo has_questions employment_type  \\\n",
       "5736              0                0             0       Full-time   \n",
       "7106              0                1             1        Contract   \n",
       "11978             0                0             0       Full-time   \n",
       "9374              0                1             0       Full-time   \n",
       "1300              0                0             0       Part-time   \n",
       "\n",
       "      required_experience required_education  \\\n",
       "5736          Entry level    Master's Degree   \n",
       "7106                  NaN  Bachelor's Degree   \n",
       "11978    Mid-Senior level  Bachelor's Degree   \n",
       "9374            Associate       Professional   \n",
       "1300                  NaN                NaN   \n",
       "\n",
       "                                  industry                function  \\\n",
       "5736                    Financial Services                 Finance   \n",
       "7106                  Education Management                     NaN   \n",
       "11978  Information Technology and Services  Information Technology   \n",
       "9374                    Financial Services                   Legal   \n",
       "1300                            Accounting                     NaN   \n",
       "\n",
       "       EVENT_LABEL                             ENTITY_ID  \\\n",
       "5736             0  382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b   \n",
       "7106             0  deadb697-08d2-4dca-83ec-a15d5e501a5b   \n",
       "11978            0  f5fcea87-6798-4529-a6c7-205d893b9b24   \n",
       "9374             0  114fbd01-0573-42cf-9365-78729264e1aa   \n",
       "1300             0  05a5dbdb-9778-4e4a-b967-7850dd483a54   \n",
       "\n",
       "            EVENT_TIMESTAMP       LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "5736   2022-12-13T13:05:21Z  2023-05-05T08:46:09Z        user  \n",
       "7106   2022-07-26T01:40:53Z  2023-05-05T08:46:09Z        user  \n",
       "11978  2023-03-09T13:06:59Z  2023-05-05T08:46:09Z        user  \n",
       "9374   2022-12-09T08:17:07Z  2023-05-05T08:46:09Z        user  \n",
       "1300   2022-08-28T17:32:28Z  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "22\n",
      "(14304, 22)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>title</th>\n",
       "      <th>location</th>\n",
       "      <th>department</th>\n",
       "      <th>salary_range</th>\n",
       "      <th>company_profile</th>\n",
       "      <th>description</th>\n",
       "      <th>requirements</th>\n",
       "      <th>benefits</th>\n",
       "      <th>telecommuting</th>\n",
       "      <th>has_company_logo</th>\n",
       "      <th>has_questions</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>required_experience</th>\n",
       "      <th>required_education</th>\n",
       "      <th>industry</th>\n",
       "      <th>function</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>Customer Service Associate - Part Time</td>\n",
       "      <td>US, AZ, Phoenix</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr...</td>\n",
       "      <td>The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai...</td>\n",
       "      <td>Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Part-time</td>\n",
       "      <td>Entry level</td>\n",
       "      <td>High School or equivalent</td>\n",
       "      <td>Financial Services</td>\n",
       "      <td>Customer Service</td>\n",
       "      <td>1743dd4b-f989-4227-8480-cbafa760b4de</td>\n",
       "      <td>2022-12-31T18:14:06Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15</td>\n",
       "      <td>Account Executive - Sydney</td>\n",
       "      <td>AU, NSW, Sydney</td>\n",
       "      <td>Sales</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Adthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ...</td>\n",
       "      <td>Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th...</td>\n",
       "      <td>You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication...</td>\n",
       "      <td>In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Associate</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Internet</td>\n",
       "      <td>Sales</td>\n",
       "      <td>d5a82588-fcff-495b-aeda-20a8de0737d0</td>\n",
       "      <td>2022-06-20T15:25:47Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>16</td>\n",
       "      <td>VP of Sales - Vault Dragon</td>\n",
       "      <td>SG, 01, Singapore</td>\n",
       "      <td>Sales</td>\n",
       "      <td>120000-150000</td>\n",
       "      <td>Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ...</td>\n",
       "      <td>About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count...</td>\n",
       "      <td>Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ...</td>\n",
       "      <td>Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Full-time</td>\n",
       "      <td>Executive</td>\n",
       "      <td>Bachelor's Degree</td>\n",
       "      <td>Facilities Services</td>\n",
       "      <td>Sales</td>\n",
       "      <td>298d3508-76bb-4362-9ad4-f843fa3f99fa</td>\n",
       "      <td>2022-10-30T20:49:56Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>19</td>\n",
       "      <td>Visual Designer</td>\n",
       "      <td>US, NY, New York</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Kettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer...</td>\n",
       "      <td>Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>cad2f705-4b22-4110-bb06-b34a47c62a6d</td>\n",
       "      <td>2022-05-30T19:30:26Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>Marketing Assistant</td>\n",
       "      <td>US, TX, Austin</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>IntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ...</td>\n",
       "      <td>IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential...</td>\n",
       "      <td>Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Marketing</td>\n",
       "      <td>24c31ad9-95a9-479c-87c5-de6af06ddef6</td>\n",
       "      <td>2022-12-05T07:48:39Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  EVENT_ID                                    title           location  \\\n",
       "0       10  Customer Service Associate - Part Time     US, AZ, Phoenix   \n",
       "1       15               Account Executive - Sydney    AU, NSW, Sydney   \n",
       "2       16               VP of Sales - Vault Dragon  SG, 01, Singapore   \n",
       "3       19                          Visual Designer   US, NY, New York   \n",
       "4       21                      Marketing Assistant     US, TX, Austin   \n",
       "\n",
       "  department   salary_range  \\\n",
       "0        NaN            NaN   \n",
       "1      Sales            NaN   \n",
       "2      Sales  120000-150000   \n",
       "3        NaN            NaN   \n",
       "4        NaN            NaN   \n",
       "\n",
       "                                                                                                                                                                                           company_profile  \\\n",
       "0  Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr...   \n",
       "1  Adthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ...   \n",
       "2  Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ...   \n",
       "3  Kettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer...   \n",
       "4  IntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ...   \n",
       "\n",
       "                                                                                                                                                                                               description  \\\n",
       "0  The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai...   \n",
       "1  Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th...   \n",
       "2  About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count...   \n",
       "3  Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ...   \n",
       "4  IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential...   \n",
       "\n",
       "                                                                                                                                                                                              requirements  \\\n",
       "0  Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre...   \n",
       "1  You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication...   \n",
       "2  Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ...   \n",
       "3                                                                                                                                                                                                      NaN   \n",
       "4  Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website...   \n",
       "\n",
       "                                                                                                                                                                                                  benefits  \\\n",
       "0                                                                                                                                                                                                      NaN   \n",
       "1  In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ...   \n",
       "2  Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa...   \n",
       "3                                                                                                                                                                                                      NaN   \n",
       "4                                                                                                                                                                                                      NaN   \n",
       "\n",
       "  telecommuting has_company_logo has_questions employment_type  \\\n",
       "0             0                1             0       Part-time   \n",
       "1             0                1             0       Full-time   \n",
       "2             0                1             1       Full-time   \n",
       "3             0                1             0             NaN   \n",
       "4             0                1             0             NaN   \n",
       "\n",
       "  required_experience         required_education             industry  \\\n",
       "0         Entry level  High School or equivalent   Financial Services   \n",
       "1           Associate          Bachelor's Degree             Internet   \n",
       "2           Executive          Bachelor's Degree  Facilities Services   \n",
       "3                 NaN                        NaN                  NaN   \n",
       "4                 NaN                        NaN                  NaN   \n",
       "\n",
       "           function                             ENTITY_ID  \\\n",
       "0  Customer Service  1743dd4b-f989-4227-8480-cbafa760b4de   \n",
       "1             Sales  d5a82588-fcff-495b-aeda-20a8de0737d0   \n",
       "2             Sales  298d3508-76bb-4362-9ad4-f843fa3f99fa   \n",
       "3               NaN  cad2f705-4b22-4110-bb06-b34a47c62a6d   \n",
       "4         Marketing  24c31ad9-95a9-479c-87c5-de6af06ddef6   \n",
       "\n",
       "        EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "0  2022-12-31T18:14:06Z        user  \n",
       "1  2022-06-20T15:25:47Z        user  \n",
       "2  2022-10-30T20:49:56Z        user  \n",
       "3  2022-05-30T19:30:26Z        user  \n",
       "4  2022-12-05T07:48:39Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3576, 20)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL\n",
       "0            0\n",
       "1            0\n",
       "2            0\n",
       "3            0\n",
       "4            0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    3389\n",
      "1     187\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.952531\n",
      "1    0.047469\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "vehicleloan\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>disbursed_amount</th>\n",
       "      <th>asset_cost</th>\n",
       "      <th>ltv</th>\n",
       "      <th>branch_id</th>\n",
       "      <th>supplier_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>current_pincode_id</th>\n",
       "      <th>date_of_birth</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>state_id</th>\n",
       "      <th>employee_code_id</th>\n",
       "      <th>mobileno_avl_flag</th>\n",
       "      <th>aadhar_flag</th>\n",
       "      <th>pan_flag</th>\n",
       "      <th>voterid_flag</th>\n",
       "      <th>driving_flag</th>\n",
       "      <th>passport_flag</th>\n",
       "      <th>perform_cns_score</th>\n",
       "      <th>perform_cns_score_description</th>\n",
       "      <th>pri_no_of_accts</th>\n",
       "      <th>pri_active_accts</th>\n",
       "      <th>pri_overdue_accts</th>\n",
       "      <th>pri_current_balance</th>\n",
       "      <th>pri_sanctioned_amount</th>\n",
       "      <th>pri_disbursed_amount</th>\n",
       "      <th>sec_no_of_accts</th>\n",
       "      <th>sec_active_accts</th>\n",
       "      <th>sec_overdue_accts</th>\n",
       "      <th>sec_current_balance</th>\n",
       "      <th>sec_sanctioned_amount</th>\n",
       "      <th>sec_disbursed_amount</th>\n",
       "      <th>primary_instal_amt</th>\n",
       "      <th>sec_instal_amt</th>\n",
       "      <th>new_accts_in_last_six_months</th>\n",
       "      <th>delinquent_accts_in_last_six_months</th>\n",
       "      <th>average_acct_age</th>\n",
       "      <th>credit_history_length</th>\n",
       "      <th>no_of_inquiries</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>8976</th>\n",
       "      <td>462711</td>\n",
       "      <td>33484</td>\n",
       "      <td>62644</td>\n",
       "      <td>55.23</td>\n",
       "      <td>67</td>\n",
       "      <td>22727</td>\n",
       "      <td>45</td>\n",
       "      <td>1511</td>\n",
       "      <td>16-06-1991</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1201</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>743</td>\n",
       "      <td>C-Very Low Risk</td>\n",
       "      <td>9</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>160423</td>\n",
       "      <td>230489</td>\n",
       "      <td>194538</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9149</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 7mon</td>\n",
       "      <td>1yrs 4mon</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>27b9d5e1-69de-47f2-a559-cfba34dffb5f</td>\n",
       "      <td>2022-09-20T06:58:09Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76007</th>\n",
       "      <td>558674</td>\n",
       "      <td>66882</td>\n",
       "      <td>81187</td>\n",
       "      <td>84.37</td>\n",
       "      <td>2</td>\n",
       "      <td>23508</td>\n",
       "      <td>86</td>\n",
       "      <td>1708</td>\n",
       "      <td>15-09-1994</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>4</td>\n",
       "      <td>1060</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1c58aced-df31-4170-8f85-e0dd95d1ff21</td>\n",
       "      <td>2022-08-25T18:27:59Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>77677</th>\n",
       "      <td>528251</td>\n",
       "      <td>59113</td>\n",
       "      <td>71757</td>\n",
       "      <td>84.87</td>\n",
       "      <td>48</td>\n",
       "      <td>21478</td>\n",
       "      <td>86</td>\n",
       "      <td>6322</td>\n",
       "      <td>01-01-1995</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>5</td>\n",
       "      <td>1189</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>738</td>\n",
       "      <td>C-Very Low Risk</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>45828</td>\n",
       "      <td>58582</td>\n",
       "      <td>58582</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4240</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 2mon</td>\n",
       "      <td>0yrs 4mon</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>fa383d19-de52-4a71-8222-77e328fcf387</td>\n",
       "      <td>2022-10-13T07:51:51Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>209438</th>\n",
       "      <td>633950</td>\n",
       "      <td>56059</td>\n",
       "      <td>71307</td>\n",
       "      <td>81.34</td>\n",
       "      <td>146</td>\n",
       "      <td>18317</td>\n",
       "      <td>86</td>\n",
       "      <td>2989</td>\n",
       "      <td>01-01-1971</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>14</td>\n",
       "      <td>2964</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>6aa0b3ef-8fff-4094-bc16-2a7ec4c00e37</td>\n",
       "      <td>2022-08-09T09:25:01Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>143261</th>\n",
       "      <td>476747</td>\n",
       "      <td>56759</td>\n",
       "      <td>67100</td>\n",
       "      <td>85.69</td>\n",
       "      <td>136</td>\n",
       "      <td>17783</td>\n",
       "      <td>86</td>\n",
       "      <td>3793</td>\n",
       "      <td>03-12-1975</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>8</td>\n",
       "      <td>1295</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>e00bb721-ce37-4d32-99e8-84f8a46cf82f</td>\n",
       "      <td>2022-06-27T20:32:23Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       EVENT_ID disbursed_amount asset_cost    ltv branch_id supplier_id  \\\n",
       "8976     462711            33484      62644  55.23        67       22727   \n",
       "76007    558674            66882      81187  84.37         2       23508   \n",
       "77677    528251            59113      71757  84.87        48       21478   \n",
       "209438   633950            56059      71307  81.34       146       18317   \n",
       "143261   476747            56759      67100  85.69       136       17783   \n",
       "\n",
       "       manufacturer_id current_pincode_id date_of_birth employment_type  \\\n",
       "8976                45               1511    16-06-1991        Salaried   \n",
       "76007               86               1708    15-09-1994        Salaried   \n",
       "77677               86               6322    01-01-1995   Self employed   \n",
       "209438              86               2989    01-01-1971        Salaried   \n",
       "143261              86               3793    03-12-1975   Self employed   \n",
       "\n",
       "       state_id employee_code_id mobileno_avl_flag aadhar_flag pan_flag  \\\n",
       "8976          6             1201                 1           1        0   \n",
       "76007         4             1060                 1           1        0   \n",
       "77677         5             1189                 1           1        0   \n",
       "209438       14             2964                 1           1        0   \n",
       "143261        8             1295                 1           1        0   \n",
       "\n",
       "       voterid_flag driving_flag passport_flag perform_cns_score  \\\n",
       "8976              0            0             0               743   \n",
       "76007             0            0             0                 0   \n",
       "77677             0            0             0               738   \n",
       "209438            0            0             0                 0   \n",
       "143261            0            0             0                 0   \n",
       "\n",
       "       perform_cns_score_description pri_no_of_accts pri_active_accts  \\\n",
       "8976                 C-Very Low Risk               9                5   \n",
       "76007    No Bureau History Available               0                0   \n",
       "77677                C-Very Low Risk               3                3   \n",
       "209438   No Bureau History Available               0                0   \n",
       "143261   No Bureau History Available               0                0   \n",
       "\n",
       "       pri_overdue_accts pri_current_balance pri_sanctioned_amount  \\\n",
       "8976                   0              160423                230489   \n",
       "76007                  0                   0                     0   \n",
       "77677                  0               45828                 58582   \n",
       "209438                 0                   0                     0   \n",
       "143261                 0                   0                     0   \n",
       "\n",
       "       pri_disbursed_amount sec_no_of_accts sec_active_accts  \\\n",
       "8976                 194538               0                0   \n",
       "76007                     0               0                0   \n",
       "77677                 58582               0                0   \n",
       "209438                    0               0                0   \n",
       "143261                    0               0                0   \n",
       "\n",
       "       sec_overdue_accts sec_current_balance sec_sanctioned_amount  \\\n",
       "8976                   0                   0                     0   \n",
       "76007                  0                   0                     0   \n",
       "77677                  0                   0                     0   \n",
       "209438                 0                   0                     0   \n",
       "143261                 0                   0                     0   \n",
       "\n",
       "       sec_disbursed_amount primary_instal_amt sec_instal_amt  \\\n",
       "8976                      0               9149              0   \n",
       "76007                     0                  0              0   \n",
       "77677                     0               4240              0   \n",
       "209438                    0                  0              0   \n",
       "143261                    0                  0              0   \n",
       "\n",
       "       new_accts_in_last_six_months delinquent_accts_in_last_six_months  \\\n",
       "8976                              4                                   0   \n",
       "76007                             0                                   0   \n",
       "77677                             3                                   0   \n",
       "209438                            0                                   0   \n",
       "143261                            0                                   0   \n",
       "\n",
       "       average_acct_age credit_history_length no_of_inquiries  EVENT_LABEL  \\\n",
       "8976          0yrs 7mon             1yrs 4mon               1            0   \n",
       "76007         0yrs 0mon             0yrs 0mon               0            0   \n",
       "77677         0yrs 2mon             0yrs 4mon               0            1   \n",
       "209438        0yrs 0mon             0yrs 0mon               0            1   \n",
       "143261        0yrs 0mon             0yrs 0mon               0            0   \n",
       "\n",
       "                                   ENTITY_ID       EVENT_TIMESTAMP  \\\n",
       "8976    27b9d5e1-69de-47f2-a559-cfba34dffb5f  2022-09-20T06:58:09Z   \n",
       "76007   1c58aced-df31-4170-8f85-e0dd95d1ff21  2022-08-25T18:27:59Z   \n",
       "77677   fa383d19-de52-4a71-8222-77e328fcf387  2022-10-13T07:51:51Z   \n",
       "209438  6aa0b3ef-8fff-4094-bc16-2a7ec4c00e37  2022-08-09T09:25:01Z   \n",
       "143261  e00bb721-ce37-4d32-99e8-84f8a46cf82f  2022-06-27T20:32:23Z   \n",
       "\n",
       "             LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "8976    2023-05-05T08:46:09Z        user  \n",
       "76007   2023-05-05T08:46:09Z        user  \n",
       "77677   2023-05-05T08:46:09Z        user  \n",
       "209438  2023-05-05T08:46:09Z        user  \n",
       "143261  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "44\n",
      "(186523, 44)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>disbursed_amount</th>\n",
       "      <th>asset_cost</th>\n",
       "      <th>ltv</th>\n",
       "      <th>branch_id</th>\n",
       "      <th>supplier_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>current_pincode_id</th>\n",
       "      <th>date_of_birth</th>\n",
       "      <th>employment_type</th>\n",
       "      <th>state_id</th>\n",
       "      <th>employee_code_id</th>\n",
       "      <th>mobileno_avl_flag</th>\n",
       "      <th>aadhar_flag</th>\n",
       "      <th>pan_flag</th>\n",
       "      <th>voterid_flag</th>\n",
       "      <th>driving_flag</th>\n",
       "      <th>passport_flag</th>\n",
       "      <th>perform_cns_score</th>\n",
       "      <th>perform_cns_score_description</th>\n",
       "      <th>pri_no_of_accts</th>\n",
       "      <th>pri_active_accts</th>\n",
       "      <th>pri_overdue_accts</th>\n",
       "      <th>pri_current_balance</th>\n",
       "      <th>pri_sanctioned_amount</th>\n",
       "      <th>pri_disbursed_amount</th>\n",
       "      <th>sec_no_of_accts</th>\n",
       "      <th>sec_active_accts</th>\n",
       "      <th>sec_overdue_accts</th>\n",
       "      <th>sec_current_balance</th>\n",
       "      <th>sec_sanctioned_amount</th>\n",
       "      <th>sec_disbursed_amount</th>\n",
       "      <th>primary_instal_amt</th>\n",
       "      <th>sec_instal_amt</th>\n",
       "      <th>new_accts_in_last_six_months</th>\n",
       "      <th>delinquent_accts_in_last_six_months</th>\n",
       "      <th>average_acct_age</th>\n",
       "      <th>credit_history_length</th>\n",
       "      <th>no_of_inquiries</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>420825</td>\n",
       "      <td>50578</td>\n",
       "      <td>58400</td>\n",
       "      <td>89.55</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1441</td>\n",
       "      <td>01-01-1984</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No Bureau History Available</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>03cf53e2-5c0b-4809-8333-04560101987b</td>\n",
       "      <td>2022-12-29T10:25:40Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>518279</td>\n",
       "      <td>54513</td>\n",
       "      <td>61900</td>\n",
       "      <td>89.66</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1501</td>\n",
       "      <td>08-09-1990</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>825</td>\n",
       "      <td>A-Very Low Risk</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1347</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1yrs 9mon</td>\n",
       "      <td>2yrs 0mon</td>\n",
       "      <td>0</td>\n",
       "      <td>03166b12-ee18-4144-aa73-10a3d2ac999a</td>\n",
       "      <td>2022-08-07T20:17:18Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>510278</td>\n",
       "      <td>43894</td>\n",
       "      <td>61900</td>\n",
       "      <td>71.89</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1501</td>\n",
       "      <td>04-10-1989</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>17</td>\n",
       "      <td>Not Scored: Not Enough Info available on the customer</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>72879</td>\n",
       "      <td>74500</td>\n",
       "      <td>74500</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0yrs 2mon</td>\n",
       "      <td>0yrs 2mon</td>\n",
       "      <td>0</td>\n",
       "      <td>ff0fc8f9-c524-45cc-99b4-139dd726d7cd</td>\n",
       "      <td>2022-11-03T09:35:54Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>510980</td>\n",
       "      <td>52603</td>\n",
       "      <td>61300</td>\n",
       "      <td>86.95</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1492</td>\n",
       "      <td>01-06-1968</td>\n",
       "      <td>Salaried</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>818</td>\n",
       "      <td>A-Very Low Risk</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2608</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1yrs 7mon</td>\n",
       "      <td>1yrs 7mon</td>\n",
       "      <td>0</td>\n",
       "      <td>8955bac7-5812-4e5f-b3ae-22738ee5e701</td>\n",
       "      <td>2023-02-19T06:55:03Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>513916</td>\n",
       "      <td>57713</td>\n",
       "      <td>65750</td>\n",
       "      <td>89.28</td>\n",
       "      <td>67</td>\n",
       "      <td>22807</td>\n",
       "      <td>45</td>\n",
       "      <td>1440</td>\n",
       "      <td>01-06-1976</td>\n",
       "      <td>Self employed</td>\n",
       "      <td>6</td>\n",
       "      <td>1998</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>300</td>\n",
       "      <td>M-Very High Risk</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>29069</td>\n",
       "      <td>1067200</td>\n",
       "      <td>1067200</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>47100</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2yrs 6mon</td>\n",
       "      <td>5yrs 6mon</td>\n",
       "      <td>0</td>\n",
       "      <td>a8154baa-1407-493a-bbc2-4bc1fd30d1f9</td>\n",
       "      <td>2022-08-14T11:20:39Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  EVENT_ID disbursed_amount asset_cost    ltv branch_id supplier_id  \\\n",
       "0   420825            50578      58400  89.55        67       22807   \n",
       "1   518279            54513      61900  89.66        67       22807   \n",
       "2   510278            43894      61900  71.89        67       22807   \n",
       "3   510980            52603      61300  86.95        67       22807   \n",
       "4   513916            57713      65750  89.28        67       22807   \n",
       "\n",
       "  manufacturer_id current_pincode_id date_of_birth employment_type state_id  \\\n",
       "0              45               1441    01-01-1984        Salaried        6   \n",
       "1              45               1501    08-09-1990   Self employed        6   \n",
       "2              45               1501    04-10-1989        Salaried        6   \n",
       "3              45               1492    01-06-1968        Salaried        6   \n",
       "4              45               1440    01-06-1976   Self employed        6   \n",
       "\n",
       "  employee_code_id mobileno_avl_flag aadhar_flag pan_flag voterid_flag  \\\n",
       "0             1998                 1           1        0            0   \n",
       "1             1998                 1           1        0            0   \n",
       "2             1998                 1           1        0            0   \n",
       "3             1998                 1           0        0            1   \n",
       "4             1998                 1           1        0            0   \n",
       "\n",
       "  driving_flag passport_flag perform_cns_score  \\\n",
       "0            0             0                 0   \n",
       "1            0             0               825   \n",
       "2            0             0                17   \n",
       "3            0             0               818   \n",
       "4            0             0               300   \n",
       "\n",
       "                           perform_cns_score_description pri_no_of_accts  \\\n",
       "0                            No Bureau History Available               0   \n",
       "1                                        A-Very Low Risk               2   \n",
       "2  Not Scored: Not Enough Info available on the customer               1   \n",
       "3                                        A-Very Low Risk               1   \n",
       "4                                       M-Very High Risk               6   \n",
       "\n",
       "  pri_active_accts pri_overdue_accts pri_current_balance  \\\n",
       "0                0                 0                   0   \n",
       "1                0                 0                   0   \n",
       "2                1                 0               72879   \n",
       "3                0                 0                   0   \n",
       "4                4                 2               29069   \n",
       "\n",
       "  pri_sanctioned_amount pri_disbursed_amount sec_no_of_accts sec_active_accts  \\\n",
       "0                     0                    0               0                0   \n",
       "1                     0                    0               0                0   \n",
       "2                 74500                74500               0                0   \n",
       "3                     0                    0               0                0   \n",
       "4               1067200              1067200               0                0   \n",
       "\n",
       "  sec_overdue_accts sec_current_balance sec_sanctioned_amount  \\\n",
       "0                 0                   0                     0   \n",
       "1                 0                   0                     0   \n",
       "2                 0                   0                     0   \n",
       "3                 0                   0                     0   \n",
       "4                 0                   0                     0   \n",
       "\n",
       "  sec_disbursed_amount primary_instal_amt sec_instal_amt  \\\n",
       "0                    0                  0              0   \n",
       "1                    0               1347              0   \n",
       "2                    0                  0              0   \n",
       "3                    0               2608              0   \n",
       "4                    0              47100              0   \n",
       "\n",
       "  new_accts_in_last_six_months delinquent_accts_in_last_six_months  \\\n",
       "0                            0                                   0   \n",
       "1                            0                                   0   \n",
       "2                            0                                   0   \n",
       "3                            0                                   0   \n",
       "4                            1                                   1   \n",
       "\n",
       "  average_acct_age credit_history_length no_of_inquiries  \\\n",
       "0        0yrs 0mon             0yrs 0mon               0   \n",
       "1        1yrs 9mon             2yrs 0mon               0   \n",
       "2        0yrs 2mon             0yrs 2mon               0   \n",
       "3        1yrs 7mon             1yrs 7mon               0   \n",
       "4        2yrs 6mon             5yrs 6mon               0   \n",
       "\n",
       "                              ENTITY_ID       EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "0  03cf53e2-5c0b-4809-8333-04560101987b  2022-12-29T10:25:40Z        user  \n",
       "1  03166b12-ee18-4144-aa73-10a3d2ac999a  2022-08-07T20:17:18Z        user  \n",
       "2  ff0fc8f9-c524-45cc-99b4-139dd726d7cd  2022-11-03T09:35:54Z        user  \n",
       "3  8955bac7-5812-4e5f-b3ae-22738ee5e701  2023-02-19T06:55:03Z        user  \n",
       "4  a8154baa-1407-493a-bbc2-4bc1fd30d1f9  2022-08-14T11:20:39Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(46631, 42)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL\n",
       "0            0\n",
       "1            0\n",
       "2            0\n",
       "3            0\n",
       "4            0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    36323\n",
      "1    10308\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.783925\n",
      "1    0.216075\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
      "malurl\n",
      "Train set: \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>dummy_cat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>167113</th>\n",
       "      <td>apolloduck.co.za/</td>\n",
       "      <td>0</td>\n",
       "      <td>d16773dd-0077-4129-a39d-f935464bd07f</td>\n",
       "      <td>5e694594-fcfa-418e-8417-21c5e99b8d8a</td>\n",
       "      <td>2022-05-15T15:36:37Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>87edb1a6-7936-4afa-b7be-4c35b7f1a5c6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>387680</th>\n",
       "      <td>acronyms.thefreedictionary.com/WDOM</td>\n",
       "      <td>0</td>\n",
       "      <td>b40b1f9e-9218-4a65-8b8e-870d45feb368</td>\n",
       "      <td>8d1aea20-97bb-46c4-bf56-3dc935f5c116</td>\n",
       "      <td>2022-06-28T06:32:21Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>864a0704-ab05-49c3-8a0c-5b0b23b3eeef</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>528900</th>\n",
       "      <td>https://nepan.org.np/Alibaba/Alibaba.com/Login.htm</td>\n",
       "      <td>1</td>\n",
       "      <td>86c52fda-2f6f-41ee-aa15-a7b682138cc9</td>\n",
       "      <td>fce90a90-3ce2-475c-ac7d-a0d6c8fa784a</td>\n",
       "      <td>2022-06-11T21:40:20Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>7ef071fc-a143-4d52-bd88-2a21f2b16c56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>251286</th>\n",
       "      <td>soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes</td>\n",
       "      <td>0</td>\n",
       "      <td>447529b9-923c-43e0-afed-c570e037f1aa</td>\n",
       "      <td>c4a96aba-24b1-4cc4-a7b8-f9c0a9a34546</td>\n",
       "      <td>2022-08-15T12:11:14Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>2709ea1a-f5a7-4ecc-8dbe-767910778226</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>433650</th>\n",
       "      <td>ottawakiosk.com/hill_cam.html</td>\n",
       "      <td>0</td>\n",
       "      <td>976080b6-500f-4de3-95c4-a4c2679e672b</td>\n",
       "      <td>21497a05-52ce-4a25-a4d4-361b8298dbc1</td>\n",
       "      <td>2022-08-19T15:47:51Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "      <td>752bff63-ad3b-4845-b975-7f6f7302402c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                        url  \\\n",
       "167113                                                    apolloduck.co.za/   \n",
       "387680                                  acronyms.thefreedictionary.com/WDOM   \n",
       "528900                   https://nepan.org.np/Alibaba/Alibaba.com/Login.htm   \n",
       "251286  soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes   \n",
       "433650                                        ottawakiosk.com/hill_cam.html   \n",
       "\n",
       "        EVENT_LABEL                              EVENT_ID  \\\n",
       "167113            0  d16773dd-0077-4129-a39d-f935464bd07f   \n",
       "387680            0  b40b1f9e-9218-4a65-8b8e-870d45feb368   \n",
       "528900            1  86c52fda-2f6f-41ee-aa15-a7b682138cc9   \n",
       "251286            0  447529b9-923c-43e0-afed-c570e037f1aa   \n",
       "433650            0  976080b6-500f-4de3-95c4-a4c2679e672b   \n",
       "\n",
       "                                   ENTITY_ID       EVENT_TIMESTAMP  \\\n",
       "167113  5e694594-fcfa-418e-8417-21c5e99b8d8a  2022-05-15T15:36:37Z   \n",
       "387680  8d1aea20-97bb-46c4-bf56-3dc935f5c116  2022-06-28T06:32:21Z   \n",
       "528900  fce90a90-3ce2-475c-ac7d-a0d6c8fa784a  2022-06-11T21:40:20Z   \n",
       "251286  c4a96aba-24b1-4cc4-a7b8-f9c0a9a34546  2022-08-15T12:11:14Z   \n",
       "433650  21497a05-52ce-4a25-a4d4-361b8298dbc1  2022-08-19T15:47:51Z   \n",
       "\n",
       "             LABEL_TIMESTAMP ENTITY_TYPE                             dummy_cat  \n",
       "167113  2023-05-05T08:46:09Z        user  87edb1a6-7936-4afa-b7be-4c35b7f1a5c6  \n",
       "387680  2023-05-05T08:46:09Z        user  864a0704-ab05-49c3-8a0c-5b0b23b3eeef  \n",
       "528900  2023-05-05T08:46:09Z        user  7ef071fc-a143-4d52-bd88-2a21f2b16c56  \n",
       "251286  2023-05-05T08:46:09Z        user  2709ea1a-f5a7-4ecc-8dbe-767910778226  \n",
       "433650  2023-05-05T08:46:09Z        user  752bff63-ad3b-4845-b975-7f6f7302402c  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8\n",
      "(586072, 8)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "      <th>dummy_cat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.html</td>\n",
       "      <td>b4233390-3167-401d-a85f-27331078ff27</td>\n",
       "      <td>3fd82c9f-b26a-44dc-ac26-4a635690938c</td>\n",
       "      <td>2022-11-20T12:29:18Z</td>\n",
       "      <td>user</td>\n",
       "      <td>f45a2001-81b6-4b29-bba9-e376cc9a4ca9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>cyndislist.com/us/pa/counties</td>\n",
       "      <td>77d73435-251f-43fa-a82c-cc6ab4dbce6b</td>\n",
       "      <td>7ac20b7a-ee66-46ce-83da-703e095e9c87</td>\n",
       "      <td>2022-12-26T07:01:46Z</td>\n",
       "      <td>user</td>\n",
       "      <td>a54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ</td>\n",
       "      <td>87a47093-0039-445f-8002-87b6af3e709d</td>\n",
       "      <td>eaea621e-895d-43cf-8bbb-93acac029c47</td>\n",
       "      <td>2022-06-25T00:29:41Z</td>\n",
       "      <td>user</td>\n",
       "      <td>20e00a79-d5fc-49d1-b563-173e69f09434</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery</td>\n",
       "      <td>3143022e-ce02-441b-8ad0-5ebbf3c1c829</td>\n",
       "      <td>ba97f126-6159-4655-9c11-807c99807059</td>\n",
       "      <td>2023-03-07T14:27:10Z</td>\n",
       "      <td>user</td>\n",
       "      <td>5398bd49-ce09-4438-bfc3-24fce419c612</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>kitsapsun.com/photos/2011/feb/25/177999/</td>\n",
       "      <td>8885745c-4494-4f04-92a0-bb57006fe7aa</td>\n",
       "      <td>b51cdf46-1467-45f0-9c9c-62233be01d0e</td>\n",
       "      <td>2022-12-07T01:31:11Z</td>\n",
       "      <td>user</td>\n",
       "      <td>0ac04255-86df-47bc-8990-557f4c65fe0d</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                      url  \\\n",
       "0  http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.html   \n",
       "1                                                                                           cyndislist.com/us/pa/counties   \n",
       "2                                 https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ   \n",
       "3                     articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery   \n",
       "4                                                                                kitsapsun.com/photos/2011/feb/25/177999/   \n",
       "\n",
       "                               EVENT_ID                             ENTITY_ID  \\\n",
       "0  b4233390-3167-401d-a85f-27331078ff27  3fd82c9f-b26a-44dc-ac26-4a635690938c   \n",
       "1  77d73435-251f-43fa-a82c-cc6ab4dbce6b  7ac20b7a-ee66-46ce-83da-703e095e9c87   \n",
       "2  87a47093-0039-445f-8002-87b6af3e709d  eaea621e-895d-43cf-8bbb-93acac029c47   \n",
       "3  3143022e-ce02-441b-8ad0-5ebbf3c1c829  ba97f126-6159-4655-9c11-807c99807059   \n",
       "4  8885745c-4494-4f04-92a0-bb57006fe7aa  b51cdf46-1467-45f0-9c9c-62233be01d0e   \n",
       "\n",
       "        EVENT_TIMESTAMP ENTITY_TYPE                             dummy_cat  \n",
       "0  2022-11-20T12:29:18Z        user  f45a2001-81b6-4b29-bba9-e376cc9a4ca9  \n",
       "1  2022-12-26T07:01:46Z        user  a54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2  \n",
       "2  2022-06-25T00:29:41Z        user  20e00a79-d5fc-49d1-b563-173e69f09434  \n",
       "3  2023-03-07T14:27:10Z        user  5398bd49-ce09-4438-bfc3-24fce419c612  \n",
       "4  2022-12-07T01:31:11Z        user  0ac04255-86df-47bc-8990-557f4c65fe0d  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(65119, 6)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>b4233390-3167-401d-a85f-27331078ff27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>77d73435-251f-43fa-a82c-cc6ab4dbce6b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>87a47093-0039-445f-8002-87b6af3e709d</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>3143022e-ce02-441b-8ad0-5ebbf3c1c829</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>8885745c-4494-4f04-92a0-bb57006fe7aa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   EVENT_LABEL                              EVENT_ID\n",
       "0            0  b4233390-3167-401d-a85f-27331078ff27\n",
       "1            0  77d73435-251f-43fa-a82c-cc6ab4dbce6b\n",
       "2            1  87a47093-0039-445f-8002-87b6af3e709d\n",
       "3            0  3143022e-ce02-441b-8ad0-5ebbf3c1c829\n",
       "4            0  8885745c-4494-4f04-92a0-bb57006fe7aa"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    42695\n",
      "1    22424\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.657612\n",
      "1    0.342388\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ieeecis\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>dist1</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>v62</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v82</th>\n",
       "      <th>v91</th>\n",
       "      <th>v127</th>\n",
       "      <th>v130</th>\n",
       "      <th>v139</th>\n",
       "      <th>v160</th>\n",
       "      <th>v165</th>\n",
       "      <th>v187</th>\n",
       "      <th>v203</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v221</th>\n",
       "      <th>v234</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v283</th>\n",
       "      <th>v285</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TransactionID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2987000.0</th>\n",
       "      <td>0</td>\n",
       "      <td>68.5</td>\n",
       "      <td>W</td>\n",
       "      <td>13926.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>150.0</td>\n",
       "      <td>142.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>315.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>117.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>c5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f7</td>\n",
       "      <td>13926.0_315.0_-13.0</td>\n",
       "      <td>2021-01-02T00:00:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987001.0</th>\n",
       "      <td>0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>W</td>\n",
       "      <td>2755.0</td>\n",
       "      <td>404.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>325.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9aa1d670-7446-4979-8c09-87f02311d2ca</td>\n",
       "      <td>2755.0_325.0_1.0</td>\n",
       "      <td>2021-01-02T00:00:01Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987002.0</th>\n",
       "      <td>0</td>\n",
       "      <td>59.0</td>\n",
       "      <td>W</td>\n",
       "      <td>4663.0</td>\n",
       "      <td>490.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>debit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>287.0</td>\n",
       "      <td>outlook.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4cdb1e2e-3c63-4e96-80a6-382d0ec97fe3</td>\n",
       "      <td>4663.0_330.0_1.0</td>\n",
       "      <td>2021-01-02T00:01:09Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987003.0</th>\n",
       "      <td>0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>W</td>\n",
       "      <td>18132.0</td>\n",
       "      <td>567.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>117.0</td>\n",
       "      <td>debit</td>\n",
       "      <td>476.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1758.0</td>\n",
       "      <td>354.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>d3e3803c-b1a3-4dfd-841d-30b8d2611364</td>\n",
       "      <td>18132.0_476.0_-111.0</td>\n",
       "      <td>2021-01-02T00:01:39Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2987004.0</th>\n",
       "      <td>0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>H</td>\n",
       "      <td>4497.0</td>\n",
       "      <td>514.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>420.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>gmail.com</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>169690.796875</td>\n",
       "      <td>5155.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>70787.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>166.0</td>\n",
       "      <td>542.0</td>\n",
       "      <td>144.0</td>\n",
       "      <td>mobile</td>\n",
       "      <td>SAMSUNG SM-G892A Build/NRD90M</td>\n",
       "      <td>2c013afb-7779-45db-a330-a5808d531372</td>\n",
       "      <td>4497.0_420.0_1.0</td>\n",
       "      <td>2021-01-02T00:01:46Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               EVENT_LABEL  transactionamt productcd    card1  card2  card3  \\\n",
       "TransactionID                                                                 \n",
       "2987000.0                0            68.5         W  13926.0    NaN  150.0   \n",
       "2987001.0                0            29.0         W   2755.0  404.0  150.0   \n",
       "2987002.0                0            59.0         W   4663.0  490.0  150.0   \n",
       "2987003.0                0            50.0         W  18132.0  567.0  150.0   \n",
       "2987004.0                0            50.0         H   4497.0  514.0  150.0   \n",
       "\n",
       "               card5   card6  addr1  dist1 p_emaildomain r_emaildomain   c1  \\\n",
       "TransactionID                                                                 \n",
       "2987000.0      142.0  credit  315.0   19.0           NaN           NaN  1.0   \n",
       "2987001.0      102.0  credit  325.0    NaN     gmail.com           NaN  1.0   \n",
       "2987002.0      166.0   debit  330.0  287.0   outlook.com           NaN  1.0   \n",
       "2987003.0      117.0   debit  476.0    NaN     yahoo.com           NaN  2.0   \n",
       "2987004.0      102.0  credit  420.0    NaN     gmail.com           NaN  1.0   \n",
       "\n",
       "                c2   c4   c5   c6   c7   c8   c9  c10  c11  c12   c13  c14  \\\n",
       "TransactionID                                                                \n",
       "2987000.0      1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  2.0  0.0   1.0  1.0   \n",
       "2987001.0      1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0   1.0  1.0   \n",
       "2987002.0      1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  1.0  0.0   1.0  1.0   \n",
       "2987003.0      5.0  0.0  0.0  4.0  0.0  0.0  1.0  0.0  1.0  0.0  25.0  1.0   \n",
       "2987004.0      1.0  0.0  0.0  1.0  0.0  1.0  0.0  1.0  1.0  0.0   1.0  1.0   \n",
       "\n",
       "               v62  v70  v76  v78  v82  v91    v127   v130  v139  \\\n",
       "TransactionID                                                      \n",
       "2987000.0      1.0  0.0  1.0  1.0  0.0  0.0   117.0    0.0   NaN   \n",
       "2987001.0      1.0  0.0  0.0  1.0  1.0  0.0     0.0    0.0   NaN   \n",
       "2987002.0      1.0  0.0  1.0  1.0  1.0  0.0     0.0    0.0   NaN   \n",
       "2987003.0      1.0  0.0  1.0  1.0  1.0  0.0  1758.0  354.0   NaN   \n",
       "2987004.0      NaN  NaN  NaN  NaN  NaN  NaN     0.0    0.0   0.0   \n",
       "\n",
       "                        v160    v165  v187  v203  v207  v209  v210  v221  \\\n",
       "TransactionID                                                              \n",
       "2987000.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987001.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987002.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987003.0                NaN     NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987004.0      169690.796875  5155.0   1.0   0.0   0.0   0.0   0.0   1.0   \n",
       "\n",
       "               v234  v257  v258  v261  v264  v266  v267  v271  v274  v277  \\\n",
       "TransactionID                                                               \n",
       "2987000.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987001.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987002.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987003.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   \n",
       "2987004.0       0.0   1.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   \n",
       "\n",
       "               v283  v285  v289  v291  v294  id_01    id_02  id_05  id_06  \\\n",
       "TransactionID                                                               \n",
       "2987000.0       1.0   0.0   0.0   1.0   1.0    NaN      NaN    NaN    NaN   \n",
       "2987001.0       1.0   0.0   0.0   1.0   0.0    NaN      NaN    NaN    NaN   \n",
       "2987002.0       1.0   0.0   0.0   1.0   0.0    NaN      NaN    NaN    NaN   \n",
       "2987003.0       0.0  10.0   0.0   1.0  38.0    NaN      NaN    NaN    NaN   \n",
       "2987004.0       1.0   0.0   0.0   1.0   0.0    0.0  70787.0    NaN    NaN   \n",
       "\n",
       "               id_09  id_13  id_17  id_19  id_20 devicetype  \\\n",
       "TransactionID                                                 \n",
       "2987000.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987001.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987002.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987003.0        NaN    NaN    NaN    NaN    NaN        NaN   \n",
       "2987004.0        NaN    NaN  166.0  542.0  144.0     mobile   \n",
       "\n",
       "                                  deviceinfo  \\\n",
       "TransactionID                                  \n",
       "2987000.0                                NaN   \n",
       "2987001.0                                NaN   \n",
       "2987002.0                                NaN   \n",
       "2987003.0                                NaN   \n",
       "2987004.0      SAMSUNG SM-G892A Build/NRD90M   \n",
       "\n",
       "                                           EVENT_ID             ENTITY_ID  \\\n",
       "TransactionID                                                               \n",
       "2987000.0      c5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f7   13926.0_315.0_-13.0   \n",
       "2987001.0      9aa1d670-7446-4979-8c09-87f02311d2ca      2755.0_325.0_1.0   \n",
       "2987002.0      4cdb1e2e-3c63-4e96-80a6-382d0ec97fe3      4663.0_330.0_1.0   \n",
       "2987003.0      d3e3803c-b1a3-4dfd-841d-30b8d2611364  18132.0_476.0_-111.0   \n",
       "2987004.0      2c013afb-7779-45db-a330-a5808d531372      4497.0_420.0_1.0   \n",
       "\n",
       "                    EVENT_TIMESTAMP       LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "TransactionID                                                          \n",
       "2987000.0      2021-01-02T00:00:00Z  2023-05-05T08:46:09Z        user  \n",
       "2987001.0      2021-01-02T00:00:01Z  2023-05-05T08:46:09Z        user  \n",
       "2987002.0      2021-01-02T00:01:09Z  2023-05-05T08:46:09Z        user  \n",
       "2987003.0      2021-01-02T00:01:39Z  2023-05-05T08:46:09Z        user  \n",
       "2987004.0      2021-01-02T00:01:46Z  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "73\n",
      "(561013, 73)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>transactionamt</th>\n",
       "      <th>productcd</th>\n",
       "      <th>card1</th>\n",
       "      <th>card2</th>\n",
       "      <th>card3</th>\n",
       "      <th>card5</th>\n",
       "      <th>card6</th>\n",
       "      <th>addr1</th>\n",
       "      <th>dist1</th>\n",
       "      <th>p_emaildomain</th>\n",
       "      <th>r_emaildomain</th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "      <th>c4</th>\n",
       "      <th>c5</th>\n",
       "      <th>c6</th>\n",
       "      <th>c7</th>\n",
       "      <th>c8</th>\n",
       "      <th>c9</th>\n",
       "      <th>c10</th>\n",
       "      <th>c11</th>\n",
       "      <th>c12</th>\n",
       "      <th>c13</th>\n",
       "      <th>c14</th>\n",
       "      <th>v62</th>\n",
       "      <th>v70</th>\n",
       "      <th>v76</th>\n",
       "      <th>v78</th>\n",
       "      <th>v82</th>\n",
       "      <th>v91</th>\n",
       "      <th>v127</th>\n",
       "      <th>v130</th>\n",
       "      <th>v139</th>\n",
       "      <th>v160</th>\n",
       "      <th>v165</th>\n",
       "      <th>v187</th>\n",
       "      <th>v203</th>\n",
       "      <th>v207</th>\n",
       "      <th>v209</th>\n",
       "      <th>v210</th>\n",
       "      <th>v221</th>\n",
       "      <th>v234</th>\n",
       "      <th>v257</th>\n",
       "      <th>v258</th>\n",
       "      <th>v261</th>\n",
       "      <th>v264</th>\n",
       "      <th>v266</th>\n",
       "      <th>v267</th>\n",
       "      <th>v271</th>\n",
       "      <th>v274</th>\n",
       "      <th>v277</th>\n",
       "      <th>v283</th>\n",
       "      <th>v285</th>\n",
       "      <th>v289</th>\n",
       "      <th>v291</th>\n",
       "      <th>v294</th>\n",
       "      <th>id_01</th>\n",
       "      <th>id_02</th>\n",
       "      <th>id_05</th>\n",
       "      <th>id_06</th>\n",
       "      <th>id_09</th>\n",
       "      <th>id_13</th>\n",
       "      <th>id_17</th>\n",
       "      <th>id_19</th>\n",
       "      <th>id_20</th>\n",
       "      <th>devicetype</th>\n",
       "      <th>deviceinfo</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TransactionID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3548013.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109411.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66104.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103183.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>926.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>1411.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>569c4257-3d62-466d-a806-e3b456b2b372</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:11:15Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548014.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109536.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66229.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103308.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>927.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>693.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>e951afe6-b895-42b8-adff-df0f812e9ee8</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:11:29Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548015.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109661.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66354.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103433.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>928.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>1116.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>cd69e301-8c15-42b3-9839-cc4c8b9d89db</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:11:45Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548016.0</th>\n",
       "      <td>125.000000</td>\n",
       "      <td>S</td>\n",
       "      <td>15775.0</td>\n",
       "      <td>481.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>102.0</td>\n",
       "      <td>credit</td>\n",
       "      <td>330.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>yahoo.com</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>109786.000000</td>\n",
       "      <td>2301.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>66479.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>103558.0</td>\n",
       "      <td>877.0</td>\n",
       "      <td>1961.0</td>\n",
       "      <td>465.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>929.0</td>\n",
       "      <td>-10.0</td>\n",
       "      <td>1589.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>166.0</td>\n",
       "      <td>633.0</td>\n",
       "      <td>533.0</td>\n",
       "      <td>desktop</td>\n",
       "      <td>Windows</td>\n",
       "      <td>71431bc1-19ec-49b6-a00f-4e8c7d121b02</td>\n",
       "      <td>15775.0_330.0_129.0</td>\n",
       "      <td>2021-06-21T23:12:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548017.0</th>\n",
       "      <td>31.950001</td>\n",
       "      <td>W</td>\n",
       "      <td>9500.0</td>\n",
       "      <td>321.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>226.0</td>\n",
       "      <td>debit</td>\n",
       "      <td>204.0</td>\n",
       "      <td>74.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>27.950001</td>\n",
       "      <td>27.950001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>de297b4c-d372-4fd3-8c66-ab6ff0c19e16</td>\n",
       "      <td>9500.0_204.0_150.0</td>\n",
       "      <td>2021-06-21T23:12:11Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               transactionamt productcd    card1  card2  card3  card5   card6  \\\n",
       "TransactionID                                                                   \n",
       "3548013.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548014.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548015.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548016.0          125.000000         S  15775.0  481.0  150.0  102.0  credit   \n",
       "3548017.0           31.950001         W   9500.0  321.0  150.0  226.0   debit   \n",
       "\n",
       "               addr1  dist1 p_emaildomain r_emaildomain   c1   c2   c4   c5  \\\n",
       "TransactionID                                                                 \n",
       "3548013.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548014.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548015.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548016.0      330.0    NaN           NaN     yahoo.com  5.0  3.0  3.0  0.0   \n",
       "3548017.0      204.0   74.0           NaN           NaN  3.0  3.0  0.0  1.0   \n",
       "\n",
       "                c6   c7   c8   c9  c10  c11  c12   c13  c14  v62  v70  v76  \\\n",
       "TransactionID                                                                \n",
       "3548013.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548014.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548015.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548016.0      0.0  0.0  8.0  0.0  3.0  5.0  0.0  61.0  5.0  0.0  0.0  NaN   \n",
       "3548017.0      1.0  0.0  0.0  1.0  0.0  1.0  0.0   6.0  3.0  1.0  1.0  1.0   \n",
       "\n",
       "               v78  v82  v91           v127         v130  v139    v160  \\\n",
       "TransactionID                                                            \n",
       "3548013.0      NaN  NaN  NaN  109411.000000  2301.000000   0.0  2401.0   \n",
       "3548014.0      NaN  NaN  NaN  109536.000000  2301.000000   0.0  2401.0   \n",
       "3548015.0      NaN  NaN  NaN  109661.000000  2301.000000   0.0  2401.0   \n",
       "3548016.0      NaN  NaN  NaN  109786.000000  2301.000000   0.0  2401.0   \n",
       "3548017.0      2.0  1.0  1.0      27.950001    27.950001   NaN     NaN   \n",
       "\n",
       "                  v165  v187      v203   v207    v209   v210  v221  v234  \\\n",
       "TransactionID                                                              \n",
       "3548013.0      66104.0   1.0  103183.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548014.0      66229.0   1.0  103308.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548015.0      66354.0   1.0  103433.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548016.0      66479.0   1.0  103558.0  877.0  1961.0  465.0   0.0  73.0   \n",
       "3548017.0          NaN   NaN       NaN    NaN     NaN    NaN   NaN   NaN   \n",
       "\n",
       "               v257  v258  v261  v264  v266  v267  v271  v274  v277  v283  \\\n",
       "TransactionID                                                               \n",
       "3548013.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548014.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548015.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548016.0       NaN   NaN   NaN   NaN   NaN   NaN   0.0   NaN   NaN   1.0   \n",
       "3548017.0       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   1.0   \n",
       "\n",
       "               v285  v289  v291   v294  id_01   id_02  id_05  id_06  id_09  \\\n",
       "TransactionID                                                                \n",
       "3548013.0      26.0   1.0   2.0  926.0  -10.0  1411.0    6.0    0.0    0.0   \n",
       "3548014.0      26.0   1.0   2.0  927.0  -10.0   693.0    6.0    0.0    0.0   \n",
       "3548015.0      26.0   1.0   2.0  928.0  -10.0  1116.0    6.0    0.0    0.0   \n",
       "3548016.0      26.0   1.0   2.0  929.0  -10.0  1589.0    6.0    0.0    0.0   \n",
       "3548017.0       1.0   1.0   1.0    0.0    NaN     NaN    NaN    NaN    NaN   \n",
       "\n",
       "               id_13  id_17  id_19  id_20 devicetype deviceinfo  \\\n",
       "TransactionID                                                     \n",
       "3548013.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548014.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548015.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548016.0       52.0  166.0  633.0  533.0    desktop    Windows   \n",
       "3548017.0        NaN    NaN    NaN    NaN        NaN        NaN   \n",
       "\n",
       "                                           EVENT_ID            ENTITY_ID  \\\n",
       "TransactionID                                                              \n",
       "3548013.0      569c4257-3d62-466d-a806-e3b456b2b372  15775.0_330.0_129.0   \n",
       "3548014.0      e951afe6-b895-42b8-adff-df0f812e9ee8  15775.0_330.0_129.0   \n",
       "3548015.0      cd69e301-8c15-42b3-9839-cc4c8b9d89db  15775.0_330.0_129.0   \n",
       "3548016.0      71431bc1-19ec-49b6-a00f-4e8c7d121b02  15775.0_330.0_129.0   \n",
       "3548017.0      de297b4c-d372-4fd3-8c66-ab6ff0c19e16   9500.0_204.0_150.0   \n",
       "\n",
       "                    EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "TransactionID                                    \n",
       "3548013.0      2021-06-21T23:11:15Z        user  \n",
       "3548014.0      2021-06-21T23:11:29Z        user  \n",
       "3548015.0      2021-06-21T23:11:45Z        user  \n",
       "3548016.0      2021-06-21T23:12:00Z        user  \n",
       "3548017.0      2021-06-21T23:12:11Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(29527, 71)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TransactionID</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3548013.0</th>\n",
       "      <td>0</td>\n",
       "      <td>569c4257-3d62-466d-a806-e3b456b2b372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548014.0</th>\n",
       "      <td>0</td>\n",
       "      <td>e951afe6-b895-42b8-adff-df0f812e9ee8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548015.0</th>\n",
       "      <td>0</td>\n",
       "      <td>cd69e301-8c15-42b3-9839-cc4c8b9d89db</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548016.0</th>\n",
       "      <td>0</td>\n",
       "      <td>71431bc1-19ec-49b6-a00f-4e8c7d121b02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3548017.0</th>\n",
       "      <td>0</td>\n",
       "      <td>de297b4c-d372-4fd3-8c66-ab6ff0c19e16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               EVENT_LABEL                              EVENT_ID\n",
       "TransactionID                                                   \n",
       "3548013.0                0  569c4257-3d62-466d-a806-e3b456b2b372\n",
       "3548014.0                0  e951afe6-b895-42b8-adff-df0f812e9ee8\n",
       "3548015.0                0  cd69e301-8c15-42b3-9839-cc4c8b9d89db\n",
       "3548016.0                0  71431bc1-19ec-49b6-a00f-4e8c7d121b02\n",
       "3548017.0                0  de297b4c-d372-4fd3-8c66-ab6ff0c19e16"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    28358\n",
      "1     1169\n",
      "Name: EVENT_LABEL, dtype: int64\n",
      "0    0.965252\n",
      "1    0.034748\n",
      "Name: EVENT_LABEL, dtype: float64\n",
      "========= \n",
      "\n",
      "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ccfraud\n",
      "Train set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>v1</th>\n",
       "      <th>v2</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v5</th>\n",
       "      <th>v6</th>\n",
       "      <th>v7</th>\n",
       "      <th>v8</th>\n",
       "      <th>v9</th>\n",
       "      <th>v10</th>\n",
       "      <th>v11</th>\n",
       "      <th>v12</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v15</th>\n",
       "      <th>v16</th>\n",
       "      <th>v17</th>\n",
       "      <th>v18</th>\n",
       "      <th>v19</th>\n",
       "      <th>v20</th>\n",
       "      <th>v21</th>\n",
       "      <th>v22</th>\n",
       "      <th>v23</th>\n",
       "      <th>v24</th>\n",
       "      <th>v25</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v28</th>\n",
       "      <th>amount</th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>LABEL_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.3598071336738</td>\n",
       "      <td>-0.0727811733098497</td>\n",
       "      <td>2.53634673796914</td>\n",
       "      <td>1.37815522427443</td>\n",
       "      <td>-0.338320769942518</td>\n",
       "      <td>0.462387777762292</td>\n",
       "      <td>0.239598554061257</td>\n",
       "      <td>0.0986979012610507</td>\n",
       "      <td>0.363786969611213</td>\n",
       "      <td>0.0907941719789316</td>\n",
       "      <td>-0.551599533260813</td>\n",
       "      <td>-0.617800855762348</td>\n",
       "      <td>-0.991389847235408</td>\n",
       "      <td>-0.311169353699879</td>\n",
       "      <td>1.46817697209427</td>\n",
       "      <td>-0.470400525259478</td>\n",
       "      <td>0.207971241929242</td>\n",
       "      <td>0.0257905801985591</td>\n",
       "      <td>0.403992960255733</td>\n",
       "      <td>0.251412098239705</td>\n",
       "      <td>-0.018306777944153</td>\n",
       "      <td>0.277837575558899</td>\n",
       "      <td>-0.110473910188767</td>\n",
       "      <td>0.0669280749146731</td>\n",
       "      <td>0.128539358273528</td>\n",
       "      <td>-0.189114843888824</td>\n",
       "      <td>0.133558376740387</td>\n",
       "      <td>-0.0210530534538215</td>\n",
       "      <td>149.62</td>\n",
       "      <td>0</td>\n",
       "      <td>f8e77dc0-44ef-490c-b0de-8b4054b5a031</td>\n",
       "      <td>266103ff-71f2-4057-981d-a54821367237</td>\n",
       "      <td>2021-09-01T00:00:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.19185711131486</td>\n",
       "      <td>0.26615071205963</td>\n",
       "      <td>0.16648011335321</td>\n",
       "      <td>0.448154078460911</td>\n",
       "      <td>0.0600176492822243</td>\n",
       "      <td>-0.0823608088155687</td>\n",
       "      <td>-0.0788029833323113</td>\n",
       "      <td>0.0851016549148104</td>\n",
       "      <td>-0.255425128109186</td>\n",
       "      <td>-0.166974414004614</td>\n",
       "      <td>1.61272666105479</td>\n",
       "      <td>1.06523531137287</td>\n",
       "      <td>0.48909501589608</td>\n",
       "      <td>-0.143772296441519</td>\n",
       "      <td>0.635558093258208</td>\n",
       "      <td>0.463917041022171</td>\n",
       "      <td>-0.114804663102346</td>\n",
       "      <td>-0.183361270123994</td>\n",
       "      <td>-0.145783041325259</td>\n",
       "      <td>-0.0690831352230203</td>\n",
       "      <td>-0.225775248033138</td>\n",
       "      <td>-0.638671952771851</td>\n",
       "      <td>0.101288021253234</td>\n",
       "      <td>-0.339846475529127</td>\n",
       "      <td>0.167170404418143</td>\n",
       "      <td>0.125894532368176</td>\n",
       "      <td>-0.00898309914322813</td>\n",
       "      <td>0.0147241691924927</td>\n",
       "      <td>2.69</td>\n",
       "      <td>0</td>\n",
       "      <td>b557449e-6b35-4be0-991e-337f764f5e21</td>\n",
       "      <td>f85083b2-d31f-4b9e-9d49-eb85c0476f6e</td>\n",
       "      <td>2021-09-01T00:00:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.35835406159823</td>\n",
       "      <td>-1.34016307473609</td>\n",
       "      <td>1.77320934263119</td>\n",
       "      <td>0.379779593034328</td>\n",
       "      <td>-0.503198133318193</td>\n",
       "      <td>1.80049938079263</td>\n",
       "      <td>0.791460956450422</td>\n",
       "      <td>0.247675786588991</td>\n",
       "      <td>-1.51465432260583</td>\n",
       "      <td>0.207642865216696</td>\n",
       "      <td>0.624501459424895</td>\n",
       "      <td>0.066083685268831</td>\n",
       "      <td>0.717292731410831</td>\n",
       "      <td>-0.165945922763554</td>\n",
       "      <td>2.34586494901581</td>\n",
       "      <td>-2.89008319444231</td>\n",
       "      <td>1.10996937869599</td>\n",
       "      <td>-0.121359313195888</td>\n",
       "      <td>-2.26185709530414</td>\n",
       "      <td>0.524979725224404</td>\n",
       "      <td>0.247998153469754</td>\n",
       "      <td>0.771679401917229</td>\n",
       "      <td>0.909412262347719</td>\n",
       "      <td>-0.689280956490685</td>\n",
       "      <td>-0.327641833735251</td>\n",
       "      <td>-0.139096571514147</td>\n",
       "      <td>-0.0553527940384261</td>\n",
       "      <td>-0.0597518405929204</td>\n",
       "      <td>378.66</td>\n",
       "      <td>0</td>\n",
       "      <td>d78d879c-eb7c-455d-8fde-6b1205080a4a</td>\n",
       "      <td>237ca488-c695-402c-b30f-0544554ea96c</td>\n",
       "      <td>2021-09-01T00:01:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.966271711572087</td>\n",
       "      <td>-0.185226008082898</td>\n",
       "      <td>1.79299333957872</td>\n",
       "      <td>-0.863291275036453</td>\n",
       "      <td>-0.0103088796030823</td>\n",
       "      <td>1.24720316752486</td>\n",
       "      <td>0.23760893977178</td>\n",
       "      <td>0.377435874652262</td>\n",
       "      <td>-1.38702406270197</td>\n",
       "      <td>-0.0549519224713749</td>\n",
       "      <td>-0.226487263835401</td>\n",
       "      <td>0.178228225877303</td>\n",
       "      <td>0.507756869957169</td>\n",
       "      <td>-0.28792374549456</td>\n",
       "      <td>-0.631418117709045</td>\n",
       "      <td>-1.0596472454325</td>\n",
       "      <td>-0.684092786345479</td>\n",
       "      <td>1.96577500349538</td>\n",
       "      <td>-1.2326219700892</td>\n",
       "      <td>-0.208037781160366</td>\n",
       "      <td>-0.108300452035545</td>\n",
       "      <td>0.00527359678253453</td>\n",
       "      <td>-0.190320518742841</td>\n",
       "      <td>-1.17557533186321</td>\n",
       "      <td>0.647376034602038</td>\n",
       "      <td>-0.221928844458407</td>\n",
       "      <td>0.0627228487293033</td>\n",
       "      <td>0.0614576285006353</td>\n",
       "      <td>123.5</td>\n",
       "      <td>0</td>\n",
       "      <td>ef448a36-2763-449c-a54a-a9e05af20967</td>\n",
       "      <td>9964b305-b591-4ed0-bff1-8adca81d0194</td>\n",
       "      <td>2021-09-01T00:01:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.15823309349523</td>\n",
       "      <td>0.877736754848451</td>\n",
       "      <td>1.548717846511</td>\n",
       "      <td>0.403033933955121</td>\n",
       "      <td>-0.407193377311653</td>\n",
       "      <td>0.0959214624684256</td>\n",
       "      <td>0.592940745385545</td>\n",
       "      <td>-0.270532677192282</td>\n",
       "      <td>0.817739308235294</td>\n",
       "      <td>0.753074431976354</td>\n",
       "      <td>-0.822842877946363</td>\n",
       "      <td>0.53819555014995</td>\n",
       "      <td>1.3458515932154</td>\n",
       "      <td>-1.11966983471731</td>\n",
       "      <td>0.175121130008994</td>\n",
       "      <td>-0.451449182813529</td>\n",
       "      <td>-0.237033239362776</td>\n",
       "      <td>-0.0381947870352842</td>\n",
       "      <td>0.803486924960175</td>\n",
       "      <td>0.408542360392758</td>\n",
       "      <td>-0.00943069713232919</td>\n",
       "      <td>0.79827849458971</td>\n",
       "      <td>-0.137458079619063</td>\n",
       "      <td>0.141266983824769</td>\n",
       "      <td>-0.206009587619756</td>\n",
       "      <td>0.502292224181569</td>\n",
       "      <td>0.219422229513348</td>\n",
       "      <td>0.215153147499206</td>\n",
       "      <td>69.99</td>\n",
       "      <td>0</td>\n",
       "      <td>e333b3c0-83ae-42dc-a865-178496653029</td>\n",
       "      <td>87b2fbf2-5b7d-479c-85f5-d989bd701f36</td>\n",
       "      <td>2021-09-01T00:02:00Z</td>\n",
       "      <td>2023-05-05T08:46:09Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   v1                   v2                v3  \\\n",
       "0    -1.3598071336738  -0.0727811733098497  2.53634673796914   \n",
       "1    1.19185711131486     0.26615071205963  0.16648011335321   \n",
       "2   -1.35835406159823    -1.34016307473609  1.77320934263119   \n",
       "3  -0.966271711572087   -0.185226008082898  1.79299333957872   \n",
       "4   -1.15823309349523    0.877736754848451    1.548717846511   \n",
       "\n",
       "                   v4                   v5                   v6  \\\n",
       "0    1.37815522427443   -0.338320769942518    0.462387777762292   \n",
       "1   0.448154078460911   0.0600176492822243  -0.0823608088155687   \n",
       "2   0.379779593034328   -0.503198133318193     1.80049938079263   \n",
       "3  -0.863291275036453  -0.0103088796030823     1.24720316752486   \n",
       "4   0.403033933955121   -0.407193377311653   0.0959214624684256   \n",
       "\n",
       "                    v7                  v8                  v9  \\\n",
       "0    0.239598554061257  0.0986979012610507   0.363786969611213   \n",
       "1  -0.0788029833323113  0.0851016549148104  -0.255425128109186   \n",
       "2    0.791460956450422   0.247675786588991   -1.51465432260583   \n",
       "3     0.23760893977178   0.377435874652262   -1.38702406270197   \n",
       "4    0.592940745385545  -0.270532677192282   0.817739308235294   \n",
       "\n",
       "                   v10                 v11                 v12  \\\n",
       "0   0.0907941719789316  -0.551599533260813  -0.617800855762348   \n",
       "1   -0.166974414004614    1.61272666105479    1.06523531137287   \n",
       "2    0.207642865216696   0.624501459424895   0.066083685268831   \n",
       "3  -0.0549519224713749  -0.226487263835401   0.178228225877303   \n",
       "4    0.753074431976354  -0.822842877946363    0.53819555014995   \n",
       "\n",
       "                  v13                 v14                 v15  \\\n",
       "0  -0.991389847235408  -0.311169353699879    1.46817697209427   \n",
       "1    0.48909501589608  -0.143772296441519   0.635558093258208   \n",
       "2   0.717292731410831  -0.165945922763554    2.34586494901581   \n",
       "3   0.507756869957169   -0.28792374549456  -0.631418117709045   \n",
       "4     1.3458515932154   -1.11966983471731   0.175121130008994   \n",
       "\n",
       "                  v16                 v17                  v18  \\\n",
       "0  -0.470400525259478   0.207971241929242   0.0257905801985591   \n",
       "1   0.463917041022171  -0.114804663102346   -0.183361270123994   \n",
       "2   -2.89008319444231    1.10996937869599   -0.121359313195888   \n",
       "3    -1.0596472454325  -0.684092786345479     1.96577500349538   \n",
       "4  -0.451449182813529  -0.237033239362776  -0.0381947870352842   \n",
       "\n",
       "                  v19                  v20                   v21  \\\n",
       "0   0.403992960255733    0.251412098239705    -0.018306777944153   \n",
       "1  -0.145783041325259  -0.0690831352230203    -0.225775248033138   \n",
       "2   -2.26185709530414    0.524979725224404     0.247998153469754   \n",
       "3    -1.2326219700892   -0.208037781160366    -0.108300452035545   \n",
       "4   0.803486924960175    0.408542360392758  -0.00943069713232919   \n",
       "\n",
       "                   v22                 v23                 v24  \\\n",
       "0    0.277837575558899  -0.110473910188767  0.0669280749146731   \n",
       "1   -0.638671952771851   0.101288021253234  -0.339846475529127   \n",
       "2    0.771679401917229   0.909412262347719  -0.689280956490685   \n",
       "3  0.00527359678253453  -0.190320518742841   -1.17557533186321   \n",
       "4     0.79827849458971  -0.137458079619063   0.141266983824769   \n",
       "\n",
       "                  v25                 v26                   v27  \\\n",
       "0   0.128539358273528  -0.189114843888824     0.133558376740387   \n",
       "1   0.167170404418143   0.125894532368176  -0.00898309914322813   \n",
       "2  -0.327641833735251  -0.139096571514147   -0.0553527940384261   \n",
       "3   0.647376034602038  -0.221928844458407    0.0627228487293033   \n",
       "4  -0.206009587619756   0.502292224181569     0.219422229513348   \n",
       "\n",
       "                   v28  amount  EVENT_LABEL  \\\n",
       "0  -0.0210530534538215  149.62            0   \n",
       "1   0.0147241691924927    2.69            0   \n",
       "2  -0.0597518405929204  378.66            0   \n",
       "3   0.0614576285006353   123.5            0   \n",
       "4    0.215153147499206   69.99            0   \n",
       "\n",
       "                               EVENT_ID                             ENTITY_ID  \\\n",
       "0  f8e77dc0-44ef-490c-b0de-8b4054b5a031  266103ff-71f2-4057-981d-a54821367237   \n",
       "1  b557449e-6b35-4be0-991e-337f764f5e21  f85083b2-d31f-4b9e-9d49-eb85c0476f6e   \n",
       "2  d78d879c-eb7c-455d-8fde-6b1205080a4a  237ca488-c695-402c-b30f-0544554ea96c   \n",
       "3  ef448a36-2763-449c-a54a-a9e05af20967  9964b305-b591-4ed0-bff1-8adca81d0194   \n",
       "4  e333b3c0-83ae-42dc-a865-178496653029  87b2fbf2-5b7d-479c-85f5-d989bd701f36   \n",
       "\n",
       "        EVENT_TIMESTAMP       LABEL_TIMESTAMP ENTITY_TYPE  \n",
       "0  2021-09-01T00:00:00Z  2023-05-05T08:46:09Z        user  \n",
       "1  2021-09-01T00:00:00Z  2023-05-05T08:46:09Z        user  \n",
       "2  2021-09-01T00:01:00Z  2023-05-05T08:46:09Z        user  \n",
       "3  2021-09-01T00:01:00Z  2023-05-05T08:46:09Z        user  \n",
       "4  2021-09-01T00:02:00Z  2023-05-05T08:46:09Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "35\n",
      "(227845, 35)\n",
      "Test set: \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>v1</th>\n",
       "      <th>v2</th>\n",
       "      <th>v3</th>\n",
       "      <th>v4</th>\n",
       "      <th>v5</th>\n",
       "      <th>v6</th>\n",
       "      <th>v7</th>\n",
       "      <th>v8</th>\n",
       "      <th>v9</th>\n",
       "      <th>v10</th>\n",
       "      <th>v11</th>\n",
       "      <th>v12</th>\n",
       "      <th>v13</th>\n",
       "      <th>v14</th>\n",
       "      <th>v15</th>\n",
       "      <th>v16</th>\n",
       "      <th>v17</th>\n",
       "      <th>v18</th>\n",
       "      <th>v19</th>\n",
       "      <th>v20</th>\n",
       "      <th>v21</th>\n",
       "      <th>v22</th>\n",
       "      <th>v23</th>\n",
       "      <th>v24</th>\n",
       "      <th>v25</th>\n",
       "      <th>v26</th>\n",
       "      <th>v27</th>\n",
       "      <th>v28</th>\n",
       "      <th>amount</th>\n",
       "      <th>EVENT_ID</th>\n",
       "      <th>ENTITY_ID</th>\n",
       "      <th>EVENT_TIMESTAMP</th>\n",
       "      <th>ENTITY_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>227845</th>\n",
       "      <td>1.91402682161454</td>\n",
       "      <td>-0.490067987909997</td>\n",
       "      <td>-0.326111312515118</td>\n",
       "      <td>0.604710739174721</td>\n",
       "      <td>-0.8501359998436</td>\n",
       "      <td>-0.736318677031096</td>\n",
       "      <td>-0.524057962475328</td>\n",
       "      <td>-0.0886141066361987</td>\n",
       "      <td>1.09112510472248</td>\n",
       "      <td>0.093484357816225</td>\n",
       "      <td>-0.892304625856107</td>\n",
       "      <td>0.0272205159068718</td>\n",
       "      <td>-0.243790209618721</td>\n",
       "      <td>0.0317740067189187</td>\n",
       "      <td>0.900623897113791</td>\n",
       "      <td>0.536032161644219</td>\n",
       "      <td>-0.648408094097169</td>\n",
       "      <td>0.183072340001028</td>\n",
       "      <td>-0.48632249422331</td>\n",
       "      <td>-0.13957876335222</td>\n",
       "      <td>0.210958428878652</td>\n",
       "      <td>0.639337879054097</td>\n",
       "      <td>0.147522551988298</td>\n",
       "      <td>0.0736542664022496</td>\n",
       "      <td>-0.318378246601246</td>\n",
       "      <td>0.350612262707235</td>\n",
       "      <td>-0.0238434747433154</td>\n",
       "      <td>-0.0371393315055126</td>\n",
       "      <td>50</td>\n",
       "      <td>bd64c6f1-1c1d-49ea-8561-6cc56bd2a173</td>\n",
       "      <td>ee6232a9-6ba4-4654-b406-72e582f01031</td>\n",
       "      <td>2021-12-10T20:48:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227846</th>\n",
       "      <td>2.15269624649984</td>\n",
       "      <td>-0.036160786158066</td>\n",
       "      <td>-2.23181098049803</td>\n",
       "      <td>0.0917658435583919</td>\n",
       "      <td>0.537612206488446</td>\n",
       "      <td>-1.36810250972644</td>\n",
       "      <td>0.613326738349479</td>\n",
       "      <td>-0.455251954849699</td>\n",
       "      <td>0.29181359004335</td>\n",
       "      <td>0.253161344559488</td>\n",
       "      <td>-1.50188197076942</td>\n",
       "      <td>-0.870607641524177</td>\n",
       "      <td>-1.44173756499372</td>\n",
       "      <td>0.988756626201074</td>\n",
       "      <td>0.496349234837293</td>\n",
       "      <td>-0.0686989613348823</td>\n",
       "      <td>-0.454073497932566</td>\n",
       "      <td>-0.299095262736551</td>\n",
       "      <td>0.267443131415241</td>\n",
       "      <td>-0.275777914750361</td>\n",
       "      <td>0.0171533555339963</td>\n",
       "      <td>0.0632416225359206</td>\n",
       "      <td>-0.0345611249491173</td>\n",
       "      <td>-0.626866212626912</td>\n",
       "      <td>0.249213129413917</td>\n",
       "      <td>0.773930519516097</td>\n",
       "      <td>-0.137114784582898</td>\n",
       "      <td>-0.0906106088420727</td>\n",
       "      <td>14.95</td>\n",
       "      <td>6728a9b7-ab9c-404e-93a8-fcf76baf7e8e</td>\n",
       "      <td>3dc93b80-f110-4355-b516-5174a0cd214d</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227847</th>\n",
       "      <td>-4.03479516717275</td>\n",
       "      <td>2.30507905571504</td>\n",
       "      <td>-1.46169292457709</td>\n",
       "      <td>-0.729887055238227</td>\n",
       "      <td>-1.5287503399573</td>\n",
       "      <td>-1.22567909778369</td>\n",
       "      <td>-0.893353679497868</td>\n",
       "      <td>1.62252199369554</td>\n",
       "      <td>1.29199841774415</td>\n",
       "      <td>-0.0409558359937061</td>\n",
       "      <td>-0.971425287697512</td>\n",
       "      <td>0.574743695630458</td>\n",
       "      <td>0.155656078919204</td>\n",
       "      <td>-0.729054997889385</td>\n",
       "      <td>0.477438947999659</td>\n",
       "      <td>1.06171851569252</td>\n",
       "      <td>0.93469475367536</td>\n",
       "      <td>0.403768792198479</td>\n",
       "      <td>-0.494929851777981</td>\n",
       "      <td>-0.0810925858921718</td>\n",
       "      <td>-0.392556502541116</td>\n",
       "      <td>-0.78759906251576</td>\n",
       "      <td>0.343467795972994</td>\n",
       "      <td>-0.0903313999840935</td>\n",
       "      <td>0.248286972151669</td>\n",
       "      <td>-0.238523845342424</td>\n",
       "      <td>0.26648354183946</td>\n",
       "      <td>-0.0622361634691654</td>\n",
       "      <td>7.7</td>\n",
       "      <td>1f4a3cae-3a95-48b7-8cc9-dd2258689f37</td>\n",
       "      <td>58879cd9-4053-4e16-9144-3b04c276f74e</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227848</th>\n",
       "      <td>-1.66874106862583</td>\n",
       "      <td>1.16805471760364</td>\n",
       "      <td>0.249642461553748</td>\n",
       "      <td>-1.26849748925032</td>\n",
       "      <td>0.785922573014156</td>\n",
       "      <td>-0.663958562166729</td>\n",
       "      <td>0.859432973616895</td>\n",
       "      <td>0.0681106263347446</td>\n",
       "      <td>-0.144183044927318</td>\n",
       "      <td>0.0432880841287975</td>\n",
       "      <td>0.542013736060061</td>\n",
       "      <td>1.00202450469061</td>\n",
       "      <td>0.400759595743433</td>\n",
       "      <td>0.136412487776037</td>\n",
       "      <td>-1.28964902448879</td>\n",
       "      <td>0.276827961550432</td>\n",
       "      <td>-0.868491702025561</td>\n",
       "      <td>-0.366839507131127</td>\n",
       "      <td>-0.187391599008302</td>\n",
       "      <td>-0.0335233340620367</td>\n",
       "      <td>-0.247543775399679</td>\n",
       "      <td>-0.592536769878023</td>\n",
       "      <td>-0.286693549546811</td>\n",
       "      <td>-0.378855664973759</td>\n",
       "      <td>-0.0774289041638705</td>\n",
       "      <td>0.0676084004301294</td>\n",
       "      <td>-0.27896200360197</td>\n",
       "      <td>-0.0641926690992577</td>\n",
       "      <td>6.99</td>\n",
       "      <td>930cd5cb-b226-4af5-8dda-574340d05a12</td>\n",
       "      <td>bb616582-e509-4c77-9154-755ca81039c4</td>\n",
       "      <td>2021-12-10T20:49:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227849</th>\n",
       "      <td>-0.550678353341949</td>\n",
       "      <td>-0.429004102182237</td>\n",
       "      <td>-1.29189255347072</td>\n",
       "      <td>-0.414409226593379</td>\n",
       "      <td>-0.292228538671312</td>\n",
       "      <td>0.071842939235058</td>\n",
       "      <td>2.42606795091335</td>\n",
       "      <td>-0.212729758223082</td>\n",
       "      <td>0.412374372851086</td>\n",
       "      <td>-1.93996940549555</td>\n",
       "      <td>-1.81011838293809</td>\n",
       "      <td>-1.22351031687552</td>\n",
       "      <td>-1.32491464932768</td>\n",
       "      <td>-1.46239178995552</td>\n",
       "      <td>-0.31164055759838</td>\n",
       "      <td>0.506707760378257</td>\n",
       "      <td>0.739932584638577</td>\n",
       "      <td>0.892422017204659</td>\n",
       "      <td>0.195042529037103</td>\n",
       "      <td>0.791126747715284</td>\n",
       "      <td>0.00303193944814891</td>\n",
       "      <td>-0.645782978858753</td>\n",
       "      <td>0.877016475964068</td>\n",
       "      <td>-1.22852893747944</td>\n",
       "      <td>-0.0362812174160739</td>\n",
       "      <td>-0.110609895882901</td>\n",
       "      <td>-0.0983803135271981</td>\n",
       "      <td>0.0959849443846813</td>\n",
       "      <td>460.71</td>\n",
       "      <td>2e909126-def3-4d82-9485-03798817c942</td>\n",
       "      <td>88ea4bc9-29fd-4302-913d-e6788cb7e6ab</td>\n",
       "      <td>2021-12-10T20:50:00Z</td>\n",
       "      <td>user</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        v1                  v2                  v3  \\\n",
       "227845    1.91402682161454  -0.490067987909997  -0.326111312515118   \n",
       "227846    2.15269624649984  -0.036160786158066   -2.23181098049803   \n",
       "227847   -4.03479516717275    2.30507905571504   -1.46169292457709   \n",
       "227848   -1.66874106862583    1.16805471760364   0.249642461553748   \n",
       "227849  -0.550678353341949  -0.429004102182237   -1.29189255347072   \n",
       "\n",
       "                        v4                  v5                  v6  \\\n",
       "227845   0.604710739174721    -0.8501359998436  -0.736318677031096   \n",
       "227846  0.0917658435583919   0.537612206488446   -1.36810250972644   \n",
       "227847  -0.729887055238227    -1.5287503399573   -1.22567909778369   \n",
       "227848   -1.26849748925032   0.785922573014156  -0.663958562166729   \n",
       "227849  -0.414409226593379  -0.292228538671312   0.071842939235058   \n",
       "\n",
       "                        v7                   v8                  v9  \\\n",
       "227845  -0.524057962475328  -0.0886141066361987    1.09112510472248   \n",
       "227846   0.613326738349479   -0.455251954849699    0.29181359004335   \n",
       "227847  -0.893353679497868     1.62252199369554    1.29199841774415   \n",
       "227848   0.859432973616895   0.0681106263347446  -0.144183044927318   \n",
       "227849    2.42606795091335   -0.212729758223082   0.412374372851086   \n",
       "\n",
       "                        v10                 v11                 v12  \\\n",
       "227845    0.093484357816225  -0.892304625856107  0.0272205159068718   \n",
       "227846    0.253161344559488   -1.50188197076942  -0.870607641524177   \n",
       "227847  -0.0409558359937061  -0.971425287697512   0.574743695630458   \n",
       "227848   0.0432880841287975   0.542013736060061    1.00202450469061   \n",
       "227849    -1.93996940549555   -1.81011838293809   -1.22351031687552   \n",
       "\n",
       "                       v13                 v14                v15  \\\n",
       "227845  -0.243790209618721  0.0317740067189187  0.900623897113791   \n",
       "227846   -1.44173756499372   0.988756626201074  0.496349234837293   \n",
       "227847   0.155656078919204  -0.729054997889385  0.477438947999659   \n",
       "227848   0.400759595743433   0.136412487776037  -1.28964902448879   \n",
       "227849   -1.32491464932768   -1.46239178995552  -0.31164055759838   \n",
       "\n",
       "                        v16                 v17                 v18  \\\n",
       "227845    0.536032161644219  -0.648408094097169   0.183072340001028   \n",
       "227846  -0.0686989613348823  -0.454073497932566  -0.299095262736551   \n",
       "227847     1.06171851569252    0.93469475367536   0.403768792198479   \n",
       "227848    0.276827961550432  -0.868491702025561  -0.366839507131127   \n",
       "227849    0.506707760378257   0.739932584638577   0.892422017204659   \n",
       "\n",
       "                       v19                  v20                  v21  \\\n",
       "227845   -0.48632249422331    -0.13957876335222    0.210958428878652   \n",
       "227846   0.267443131415241   -0.275777914750361   0.0171533555339963   \n",
       "227847  -0.494929851777981  -0.0810925858921718   -0.392556502541116   \n",
       "227848  -0.187391599008302  -0.0335233340620367   -0.247543775399679   \n",
       "227849   0.195042529037103    0.791126747715284  0.00303193944814891   \n",
       "\n",
       "                       v22                  v23                  v24  \\\n",
       "227845   0.639337879054097    0.147522551988298   0.0736542664022496   \n",
       "227846  0.0632416225359206  -0.0345611249491173   -0.626866212626912   \n",
       "227847   -0.78759906251576    0.343467795972994  -0.0903313999840935   \n",
       "227848  -0.592536769878023   -0.286693549546811   -0.378855664973759   \n",
       "227849  -0.645782978858753    0.877016475964068    -1.22852893747944   \n",
       "\n",
       "                        v25                 v26                  v27  \\\n",
       "227845   -0.318378246601246   0.350612262707235  -0.0238434747433154   \n",
       "227846    0.249213129413917   0.773930519516097   -0.137114784582898   \n",
       "227847    0.248286972151669  -0.238523845342424     0.26648354183946   \n",
       "227848  -0.0774289041638705  0.0676084004301294    -0.27896200360197   \n",
       "227849  -0.0362812174160739  -0.110609895882901  -0.0983803135271981   \n",
       "\n",
       "                        v28  amount                              EVENT_ID  \\\n",
       "227845  -0.0371393315055126      50  bd64c6f1-1c1d-49ea-8561-6cc56bd2a173   \n",
       "227846  -0.0906106088420727   14.95  6728a9b7-ab9c-404e-93a8-fcf76baf7e8e   \n",
       "227847  -0.0622361634691654     7.7  1f4a3cae-3a95-48b7-8cc9-dd2258689f37   \n",
       "227848  -0.0641926690992577    6.99  930cd5cb-b226-4af5-8dda-574340d05a12   \n",
       "227849   0.0959849443846813  460.71  2e909126-def3-4d82-9485-03798817c942   \n",
       "\n",
       "                                   ENTITY_ID       EVENT_TIMESTAMP ENTITY_TYPE  \n",
       "227845  ee6232a9-6ba4-4654-b406-72e582f01031  2021-12-10T20:48:00Z        user  \n",
       "227846  3dc93b80-f110-4355-b516-5174a0cd214d  2021-12-10T20:49:00Z        user  \n",
       "227847  58879cd9-4053-4e16-9144-3b04c276f74e  2021-12-10T20:49:00Z        user  \n",
       "227848  bb616582-e509-4c77-9154-755ca81039c4  2021-12-10T20:49:00Z        user  \n",
       "227849  88ea4bc9-29fd-4302-913d-e6788cb7e6ab  2021-12-10T20:50:00Z        user  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(56962, 33)\n",
      "Test scores\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>EVENT_LABEL</th>\n",
       "      <th>EVENT_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>227845</th>\n",
       "      <td>0</td>\n",
       "      <td>bd64c6f1-1c1d-49ea-8561-6cc56bd2a173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227846</th>\n",
       "      <td>0</td>\n",
       "      <td>6728a9b7-ab9c-404e-93a8-fcf76baf7e8e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227847</th>\n",
       "      <td>0</td>\n",
       "      <td>1f4a3cae-3a95-48b7-8cc9-dd2258689f37</td>\n",
       "    </tr>\n
Download .txt
gitextract_sn16q5ml/

├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── scripts/
│   ├── examples/
│   │   └── Test_FDB_Loader.ipynb
│   └── reproducibility/
│       ├── afd/
│       │   ├── README.md
│       │   ├── configs/
│       │   │   ├── CreditCardFraudDetection.json
│       │   │   ├── FakeJobPostingPrediction.json
│       │   │   ├── Fraudecommerce.json
│       │   │   ├── IEEECISFraudDetection.json
│       │   │   ├── IPBlocklist.json
│       │   │   ├── MaliciousURL.json
│       │   │   ├── SimulatedCreditCardTransactionsSparkov.json
│       │   │   ├── TwitterBotAccounts.json
│       │   │   └── VehicleLoanDefaultPrediction.json
│       │   ├── create_afd_resources.py
│       │   └── score_afd_model.py
│       ├── autogluon/
│       │   ├── README.md
│       │   ├── benchmark_ag.py
│       │   └── example-ag-ieeecis.ipynb
│       ├── autosklearn/
│       │   ├── README.md
│       │   └── benchmark_autosklearn.py
│       ├── benchmark_utils.py
│       ├── h2o/
│       │   ├── README.md
│       │   ├── benchmark_h2o.py
│       │   └── example-h2o-ieeecis.ipynb
│       └── label-noise/
│           ├── benchmark_experiments.ipynb
│           ├── feature_dict.py
│           ├── load_fdb_datasets.py
│           └── micro_models.py
├── setup.py
└── src/
    ├── __init__.py
    └── fdb/
        ├── __init__.py
        ├── datasets.py
        ├── kaggle_configs.py
        ├── preprocessing.py
        ├── preprocessing_objects.py
        └── versioned_datasets/
            ├── __init__.py
            └── ipblock/
                └── __init__.py
Download .txt
SYMBOL INDEX (125 symbols across 11 files)

FILE: scripts/reproducibility/afd/create_afd_resources.py
  function afd_train_model_demo (line 32) | def afd_train_model_demo(config):

FILE: scripts/reproducibility/afd/score_afd_model.py
  function create_outcomes (line 31) | def create_outcomes(outcomes):
  function create_rules (line 40) | def create_rules(score_cuts, outcomes):
  function ast_with_nan (line 88) | def ast_with_nan(x):
  function afd_train_model_demo (line 95) | def afd_train_model_demo():

FILE: scripts/reproducibility/autogluon/benchmark_ag.py
  function run_ag (line 29) | def run_ag(dataset, base_path, time_limit=3600, presets=None, hyperparam...

FILE: scripts/reproducibility/autosklearn/benchmark_autosklearn.py
  function load_data (line 77) | def load_data(dataset_path):
  function get_recall (line 101) | def get_recall(fpr, tpr, fpr_target=0.01):
  function run_autosklearn (line 105) | def run_autosklearn(dataset_path):

FILE: scripts/reproducibility/benchmark_utils.py
  function load_data (line 22) | def load_data(dataset, base_path):
  function get_recall (line 45) | def get_recall(fpr, tpr, fpr_target=0.01):

FILE: scripts/reproducibility/h2o/benchmark_h2o.py
  function run_h2o (line 29) | def run_h2o(dataset, base_path, connect_url=None, time_limit=None, inclu...

FILE: scripts/reproducibility/label-noise/load_fdb_datasets.py
  function noise_amount (line 20) | def noise_amount(df):
  function noise_rate (line 23) | def noise_rate(df):
  function type_1_noise_amount (line 29) | def type_1_noise_amount(df):
  function type_2_noise_amount (line 34) | def type_2_noise_amount(df):
  function actual_legit_amount (line 39) | def actual_legit_amount(df):
  function observed_legit_amount (line 42) | def observed_legit_amount(df):
  function actual_fraud_amount (line 45) | def actual_fraud_amount(df):
  function observed_fraud_amount (line 48) | def observed_fraud_amount(df):
  function actual_fraud_rate (line 51) | def actual_fraud_rate(df):
  function observed_fraud_rate (line 57) | def observed_fraud_rate(df):
  function type_1_noise_rate (line 63) | def type_1_noise_rate(df):
  function type_2_noise_rate (line 69) | def type_2_noise_rate(df):
  function prepare_data_fdb (line 75) | def prepare_data_fdb(key, drop_text_enr_features=True):
  function add_noise (line 212) | def add_noise(df, noise_type, noise_amount, *, time_index=None, features...
  function train_valid_split (line 273) | def train_valid_split(df, split=0.7, shuffle=True, sort_key='creation_da...
  function prepare_noisy_dataset (line 285) | def prepare_noisy_dataset(key, noise_type, noise_amount, split=0.7, shuf...
  function dataset_stats (line 345) | def dataset_stats(dataset):

FILE: scripts/reproducibility/label-noise/micro_models.py
  class MicroModelError (line 6) | class MicroModelError(Exception):
    method __init__ (line 10) | def __init__(self, error_message):
  class MicroModel (line 14) | class MicroModel:
    method __init__ (line 20) | def __init__(self, ModelClass, *args, **kwargs):
    method set_thresh (line 28) | def set_thresh(self, thresh):
    method fit (line 32) | def fit(self, x, y, *args, **kwargs):
    method predict_proba (line 36) | def predict_proba(self, x, *args, **kwargs):
    method predict (line 43) | def predict(self, x):
  class MicroModelEnsemble (line 54) | class MicroModelEnsemble:
    method __init__ (line 59) | def __init__(self, ModelClass, num_clfs=16, score_type='preds_avg', *a...
    method fit (line 85) | def fit(self, x, y, *args, **kwargs):
    method predict_proba (line 103) | def predict_proba(self, x, *args, **kwargs):
    method predict (line 117) | def predict(self, x, threshold=0.5, *args, **kwargs):
    method filter_noise (line 123) | def filter_noise(self, x, y, pulearning=True, threshold=0.5):
    method clean_noise (line 136) | def clean_noise(self, x, y, pulearning=True, threshold=0.5):
  class MicroModelCleaner (line 155) | class MicroModelCleaner:
    method __init__ (line 161) | def __init__(self, ModelClass, strategy='filter', pulearning=True, num...
    method fit (line 181) | def fit(self, x, y, *args, **kwargs):
    method predict (line 192) | def predict(self, x, *args, **kwargs):
    method predict_proba (line 195) | def predict_proba(self, x, *args, **kwargs):

FILE: src/fdb/datasets.py
  class FraudDatasetBenchmark (line 6) | class FraudDatasetBenchmark(ABC):
    method __init__ (line 7) | def __init__(
    method train (line 23) | def train(self):
    method test (line 27) | def test(self):
    method test_labels (line 31) | def test_labels(self):
    method eval (line 34) | def eval(self, y_pred):

FILE: src/fdb/preprocessing.py
  class BasePreProcessor (line 51) | class BasePreProcessor(ABC):
    method __init__ (line 52) | def __init__(
    method _download_kaggle_data_from_competetions (line 93) | def _download_kaggle_data_from_competetions(self):
    method _download_kaggle_data_from_datasets_with_given_filename (line 101) | def _download_kaggle_data_from_datasets_with_given_filename(self):
    method _download_kaggle_data_from_datasets_containing_single_file (line 114) | def _download_kaggle_data_from_datasets_containing_single_file(self):
    method download_kaggle_data (line 122) | def download_kaggle_data(self):
    method load_data (line 150) | def load_data(self):
    method timestamp_col (line 156) | def timestamp_col(self):
    method label_col (line 160) | def label_col(self):
    method event_id_col (line 167) | def event_id_col(self):
    method entity_id_col (line 171) | def entity_id_col(self):
    method standardize_timestamp_col (line 174) | def standardize_timestamp_col(self):
    method standardize_label_col (line 191) | def standardize_label_col(self):
    method standardize_event_id_col (line 195) | def standardize_event_id_col(self):
    method standardize_entity_id_col (line 204) | def standardize_entity_id_col(self):
    method rename_features (line 211) | def rename_features(self):
    method subset_features (line 215) | def subset_features(self):
    method drop_features (line 219) | def drop_features(self):
    method add_meta_data (line 222) | def add_meta_data(self):
    method sort_by_timestamp (line 226) | def sort_by_timestamp(self):
    method lower_case_col_names (line 229) | def lower_case_col_names(self):
    method preprocess (line 232) | def preprocess(self):
    method train_test_split (line 245) | def train_test_split(self):
  class FakejobPreProcessor (line 264) | class FakejobPreProcessor(BasePreProcessor):
    method __init__ (line 265) | def __init__(self, **kw):
  class VehicleloanPreProcessor (line 269) | class VehicleloanPreProcessor(BasePreProcessor):
    method __init__ (line 270) | def __init__(self, **kw):
  class MalurlPreProcessor (line 274) | class MalurlPreProcessor(BasePreProcessor):
    method __init__ (line 280) | def __init__(self, **kw):
    method standardize_label_col (line 283) | def standardize_label_col(self):
    method add_dummy_col (line 294) | def add_dummy_col(self):
    method preprocess (line 297) | def preprocess(self):
  class IEEEPreProcessor (line 301) | class IEEEPreProcessor(BasePreProcessor):
    method __init__ (line 312) | def __init__(self, **kw):
    method _dtypes_cols (line 316) | def _dtypes_cols():
    method load_data (line 372) | def load_data(self):
    method normalization (line 396) | def normalization(self):
    method standardize_entity_id_col (line 402) | def standardize_entity_id_col(self):
    method _add_seconds (line 412) | def _add_seconds(x):
    method standardize_timestamp_col (line 419) | def standardize_timestamp_col(self):
    method subset_features (line 425) | def subset_features(self):
    method preprocess (line 436) | def preprocess(self):
  class CCFraudPreProcessor (line 450) | class CCFraudPreProcessor(BasePreProcessor):
    method __init__ (line 451) | def __init__(self, **kw):
    method _add_minutes (line 455) | def _add_minutes(x):
    method standardize_timestamp_col (line 461) | def standardize_timestamp_col(self):
  class FraudecomPreProcessor (line 467) | class FraudecomPreProcessor(BasePreProcessor):
    method __init__ (line 468) | def __init__(self, ip_address_col, signup_time_col, **kw):
    method _add_years (line 474) | def _add_years(init_time):
    method standardize_timestamp_col (line 481) | def standardize_timestamp_col(self):
    method process_ip (line 490) | def process_ip(self):
    method create_time_since_signup (line 497) | def create_time_since_signup(self):
    method preprocess (line 502) | def preprocess(self):
  class SparknovPreProcessor (line 517) | class SparknovPreProcessor(BasePreProcessor):
    method __init__ (line 518) | def __init__(self, **kw):
    method load_data (line 521) | def load_data(self):
    method _add_months (line 538) | def _add_months(x):
    method standardize_timestamp_col (line 545) | def standardize_timestamp_col(self):
    method standardize_entity_id_col (line 551) | def standardize_entity_id_col(self):
    method train_test_split (line 558) | def train_test_split(self):
  class TwitterbotPreProcessor (line 574) | class TwitterbotPreProcessor(BasePreProcessor):
    method __init__ (line 575) | def __init__(self, **kw):
    method standardize_label_col (line 578) | def standardize_label_col(self):
  class IPBlocklistPreProcessor (line 588) | class IPBlocklistPreProcessor(BasePreProcessor):
    method __init__ (line 598) | def __init__(self, version, **kw):
    method load_data (line 602) | def load_data(self):
    method add_dummy_col (line 628) | def add_dummy_col(self):
    method train_test_split (line 631) | def train_test_split(self):
    method preprocess (line 635) | def preprocess(self):

FILE: src/fdb/preprocessing_objects.py
  function load_data (line 4) | def load_data(key, load_pre_downloaded, delete_downloaded, add_random_va...
Condensed preview — 39 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (696K chars).
[
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 309,
    "preview": "## Code of Conduct\nThis project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-condu"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 3160,
    "preview": "# Contributing Guidelines\n\nThank you for your interest in contributing to our project. Whether it's a bug report, new fe"
  },
  {
    "path": "LICENSE",
    "chars": 1288,
    "preview": "MIT License\n\nCopyright (c) 2021-2022 Prince Grover\nCopyright (c) 2021-2022 Zheng Li\nCopyright (c) 2022 Jianbo Liu\nCopyri"
  },
  {
    "path": "README.md",
    "chars": 20016,
    "preview": "# FDB: Fraud Dataset Benchmark\n\n*By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin T"
  },
  {
    "path": "scripts/examples/Test_FDB_Loader.ipynb",
    "chars": 287022,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "scripts/reproducibility/afd/README.md",
    "chars": 2082,
    "preview": "## Steps to reproduce AFD models\nAmazon Fraud Detector (AFD) models can be either run via AWS Console or using API calls"
  },
  {
    "path": "scripts/reproducibility/afd/configs/CreditCardFraudDetection.json",
    "chars": 3967,
    "preview": "{\n    \"dataset\": \"Credit Card Fraud Detection\",\n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"v1\",\n"
  },
  {
    "path": "scripts/reproducibility/afd/configs/FakeJobPostingPrediction.json",
    "chars": 2517,
    "preview": "{\n    \"dataset\": \"Fake Job Posting Prediction\", \n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"titl"
  },
  {
    "path": "scripts/reproducibility/afd/configs/Fraudecommerce.json",
    "chars": 1023,
    "preview": "{\n    \"dataset\": \"Fraud ecommerce\",\n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"purchase_value\",\n"
  },
  {
    "path": "scripts/reproducibility/afd/configs/IEEECISFraudDetection.json",
    "chars": 9041,
    "preview": "{\n    \"dataset\": \"IEEE-CIS Fraud Detection\",\n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"transact"
  },
  {
    "path": "scripts/reproducibility/afd/configs/IPBlocklist.json",
    "chars": 462,
    "preview": "{\n    \"dataset\": \"IP-BlockList\",\n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"ip\",\n            \"va"
  },
  {
    "path": "scripts/reproducibility/afd/configs/MaliciousURL.json",
    "chars": 490,
    "preview": "{\n    \"dataset\": \"Malicious URLs Dataset\",\n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"url\",\n    "
  },
  {
    "path": "scripts/reproducibility/afd/configs/SimulatedCreditCardTransactionsSparkov.json",
    "chars": 2551,
    "preview": "{\n    \"dataset\": \"Simulated Credit Card Transactions generated using Sparkov\",\n    \"variable_mappings\": [\n        {\n    "
  },
  {
    "path": "scripts/reproducibility/afd/configs/TwitterBotAccounts.json",
    "chars": 2527,
    "preview": "{\n    \"dataset\": \"Twitter Bots Accounts\",\n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"default_pro"
  },
  {
    "path": "scripts/reproducibility/afd/configs/VehicleLoanDefaultPrediction.json",
    "chars": 5738,
    "preview": "{\n    \"dataset\": \"Vehicle Loan Default Prediction\",\n    \"variable_mappings\": [\n        {\n            \"variable_name\": \"d"
  },
  {
    "path": "scripts/reproducibility/afd/create_afd_resources.py",
    "chars": 7544,
    "preview": "# TO BE UPDATED BY USER\nIAM_ROLE = \"<IAM ROLE with acceess to S3 bucket containing the data and access to Amazon Fraud D"
  },
  {
    "path": "scripts/reproducibility/afd/score_afd_model.py",
    "chars": 9111,
    "preview": "# TO BE UPDATED BY USER\nIAM_ROLE = \"<IAM ROLE with acceess to S3 bucket containing the data and access to Amazon Fraud D"
  },
  {
    "path": "scripts/reproducibility/autogluon/README.md",
    "chars": 304,
    "preview": " - benchmark_ag.py: a script for autogluon benchmarking\n - example-ag-ieeecis.ipynb: an example notebook using benchmark"
  },
  {
    "path": "scripts/reproducibility/autogluon/benchmark_ag.py",
    "chars": 2971,
    "preview": "import pandas as pd\nimport os\nimport gc\nimport joblib\nimport datetime\n\nimport matplotlib as mpl\nfrom sklearn.metrics imp"
  },
  {
    "path": "scripts/reproducibility/autogluon/example-ag-ieeecis.ipynb",
    "chars": 97746,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"7d350d0d\",\n   \"metadata\": {},\n   \"outputs\":"
  },
  {
    "path": "scripts/reproducibility/autosklearn/README.md",
    "chars": 879,
    "preview": "## Steps to reproduce Auto-sklearn models\n\n\n1. Load and save the datasets locally using [FDB Loader](../../examples/Test"
  },
  {
    "path": "scripts/reproducibility/autosklearn/benchmark_autosklearn.py",
    "chars": 5336,
    "preview": "\nimport json\nimport joblib\nimport datetime\nimport numpy as np\nimport pandas as pd\nimport os, sys, shutil\n\nfrom autosklea"
  },
  {
    "path": "scripts/reproducibility/benchmark_utils.py",
    "chars": 1448,
    "preview": "import numpy as np\nimport pandas as pd\nimport os\n\nimport matplotlib as mpl\n\nmpl.rcParams['figure.dpi'] = 150\npd.set_opti"
  },
  {
    "path": "scripts/reproducibility/h2o/README.md",
    "chars": 292,
    "preview": "- benchmark_h2o.py: a script for h2o benchmarking\n- example-h2o-ieeecis.ipynb: an example notebook using benchmark_h2o.p"
  },
  {
    "path": "scripts/reproducibility/h2o/benchmark_h2o.py",
    "chars": 3597,
    "preview": "import pandas as pd\nimport os\nimport gc\nimport joblib\n\nimport matplotlib as mpl\nfrom sklearn.metrics import roc_auc_scor"
  },
  {
    "path": "scripts/reproducibility/h2o/example-h2o-ieeecis.ipynb",
    "chars": 85367,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"afc2eecf\",\n   \"metadata\": {},\n   \"outputs\":"
  },
  {
    "path": "scripts/reproducibility/label-noise/benchmark_experiments.ipynb",
    "chars": 20437,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"c77e5eb5\",\n   \"metadata\": {},\n   \"output"
  },
  {
    "path": "scripts/reproducibility/label-noise/feature_dict.py",
    "chars": 5505,
    "preview": "feature_dict = {\n  'ieeecis': {\n    'transactionamt': 'numeric',\n    'productcd': 'categorical',\n    'card1': 'numeric',"
  },
  {
    "path": "scripts/reproducibility/label-noise/load_fdb_datasets.py",
    "chars": 14703,
    "preview": "import os\nimport re\nimport json\nimport pandas as pd\nimport numpy as np\nimport warnings\nfrom datetime import datetime\n\nfr"
  },
  {
    "path": "scripts/reproducibility/label-noise/micro_models.py",
    "chars": 8233,
    "preview": "import logging\nimport pandas as pd\nimport numpy as np\n\n\nclass MicroModelError(Exception):\n    \"\"\"\n    basic exception ty"
  },
  {
    "path": "setup.py",
    "chars": 655,
    "preview": "import os\nfrom glob import glob\n\nfrom setuptools import find_packages, setup\n\n\nsetup(\n    name='fraud_dataset_benchmark'"
  },
  {
    "path": "src/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/fdb/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/fdb/datasets.py",
    "chars": 1300,
    "preview": "from abc import abstractmethod, ABC\nfrom fdb.preprocessing import *\nfrom fdb.preprocessing_objects import load_data\nfrom"
  },
  {
    "path": "src/fdb/kaggle_configs.py",
    "chars": 1861,
    "preview": "KAGGLE_CONFIGS = {\n\n    \"fakejob\":\n    {\n        \"owner\": \"shivamb\",\n        \"dataset\": \"real-or-fake-fake-jobposting-pr"
  },
  {
    "path": "src/fdb/preprocessing.py",
    "chars": 26223,
    "preview": "\n\nimport os\nimport re\nimport shutil\nimport kaggle\nimport pkgutil\nimport requests\nimport zipfile\nimport numpy as np\nfrom "
  },
  {
    "path": "src/fdb/preprocessing_objects.py",
    "chars": 2949,
    "preview": "from fdb.preprocessing import *\n\n\ndef load_data(key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_n"
  },
  {
    "path": "src/fdb/versioned_datasets/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/fdb/versioned_datasets/ipblock/__init__.py",
    "chars": 0,
    "preview": ""
  }
]

About this extraction

This page contains the full source code of the amazon-science/fraud-dataset-benchmark GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 39 files (623.7 KB), approximately 219.8k tokens, and a symbol index with 125 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!