Showing preview only (647K chars total). Download the full file or copy to clipboard to get everything.
Repository: amazon-science/fraud-dataset-benchmark
Branch: main
Commit: f100cb829599
Files: 39
Total size: 623.7 KB
Directory structure:
gitextract_sn16q5ml/
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── scripts/
│ ├── examples/
│ │ └── Test_FDB_Loader.ipynb
│ └── reproducibility/
│ ├── afd/
│ │ ├── README.md
│ │ ├── configs/
│ │ │ ├── CreditCardFraudDetection.json
│ │ │ ├── FakeJobPostingPrediction.json
│ │ │ ├── Fraudecommerce.json
│ │ │ ├── IEEECISFraudDetection.json
│ │ │ ├── IPBlocklist.json
│ │ │ ├── MaliciousURL.json
│ │ │ ├── SimulatedCreditCardTransactionsSparkov.json
│ │ │ ├── TwitterBotAccounts.json
│ │ │ └── VehicleLoanDefaultPrediction.json
│ │ ├── create_afd_resources.py
│ │ └── score_afd_model.py
│ ├── autogluon/
│ │ ├── README.md
│ │ ├── benchmark_ag.py
│ │ └── example-ag-ieeecis.ipynb
│ ├── autosklearn/
│ │ ├── README.md
│ │ └── benchmark_autosklearn.py
│ ├── benchmark_utils.py
│ ├── h2o/
│ │ ├── README.md
│ │ ├── benchmark_h2o.py
│ │ └── example-h2o-ieeecis.ipynb
│ └── label-noise/
│ ├── benchmark_experiments.ipynb
│ ├── feature_dict.py
│ ├── load_fdb_datasets.py
│ └── micro_models.py
├── setup.py
└── src/
├── __init__.py
└── fdb/
├── __init__.py
├── datasets.py
├── kaggle_configs.py
├── preprocessing.py
├── preprocessing_objects.py
└── versioned_datasets/
├── __init__.py
└── ipblock/
└── __init__.py
================================================
FILE CONTENTS
================================================
================================================
FILE: CODE_OF_CONDUCT.md
================================================
## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing Guidelines
Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
documentation, we greatly value feedback and contributions from our community.
Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
information to effectively respond to your bug report or contribution.
## Reporting Bugs/Feature Requests
We welcome you to use the GitHub issue tracker to report bugs or suggest features.
When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
* A reproducible test case or series of steps
* The version of our code being used
* Any modifications you've made relevant to the bug
* Anything unusual about your environment or deployment
## Contributing via Pull Requests
Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
1. You are working against the latest source on the *main* branch.
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
To send us a pull request, please:
1. Fork the repository.
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull request interface.
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
## Finding contributions to work on
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.
## Security issue notifications
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
## Licensing
See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2021-2022 Prince Grover
Copyright (c) 2021-2022 Zheng Li
Copyright (c) 2022 Jianbo Liu
Copyright (c) 2022 Jakub Zablocki
Copyright (c) 2022 Jianbo Liu
Copyright (c) 2022 Hao Zhou
Copyright (c) 2022 Julia Xu
Copyright (c) 2022 Anqi Cheng
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# FDB: Fraud Dataset Benchmark
*By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin Tittelfitz](jtittelfitz), Anqi Cheng, [Jakub Zablocki](qbaza), Jianbo Liu, and [Hao Zhou](haozhouamzn)*
[](https://www.python.org/) [](https://opensource.org/licenses/MIT)
The **Fraud Dataset Benchmark (FDB)** is a compilation of publicly available datasets relevant to **fraud detection** ([arXiv Link](https://arxiv.org/abs/2208.14417)). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning.
## Datasets used in FDB
Brief summary of the datasets used in FDB. Each dataset is described in detail in [data source section](#data-sources).
| **#** | **Dataset name** | **Dataset key** | **Fraud category** | **#Train** | **#Test** | **Class ratio (train)** | **#Feats** | **#Cat** | **#Num** | **#Text** | **#Enrichable** |
|-------|------------------------------------------------------------|-----------------|-------------------------------------|------------|-----------|-------------------------|------------|----------|----------|-----------|-----------------|
| 1 | IEEE-CIS Fraud Detection | ieeecis | Card Not Present Transactions Fraud | 561,013 | 28,527 | 3.50% | 67 | 6 | 61 | 0 | 0 |
| 2 | Credit Card Fraud Detection | ccfraud | Card Not Present Transactions Fraud | 227,845 | 56,962 | 0.18% | 28 | 0 | 28 | 0 | 0 |
| 3 | Fraud ecommerce | fraudecom | Card Not Present Transactions Fraud | 120,889 | 30,223 | 10.60% | 6 | 2 | 3 | 0 | 1 |
| 4 | Simulated Credit Card Transactions generated using Sparkov | sparknov | Card Not Present Transactions Fraud | 1,296,675 | 20,000 | 5.70% | 17 | 10 | 6 | 1 | 0 |
| 5 | Twitter Bots Accounts | twitterbot | Bot Attacks | 29,950 | 7,488 | 33.10% | 16 | 6 | 6 | 4 | 0 |
| 6 | Malicious URLs dataset | malurl | Malicious Traffic | 586,072 | 65,119 | 34.20% | 2 | 0 | 1 | 1 | 0 |
| 7 | Fake Job Posting Prediction | fakejob | Content Moderation | 14,304 | 3,576 | 4.70% | 16 | 10 | 1 | 5 | 0 |
| 8 | Vehicle Loan Default Prediction | vehicleloan | Credit Risk | 186,523 | 46,631 | 21.60% | 38 | 13 | 22 | 3 | 0 |
| 9 | IP Blocklist | ipblock | Malicious Traffic | 172,000 | 43,000 | 7% | 1 | 0 | 0 | 0 | 1 |
## Installation
### Requirements
- Kaggle account
- **Important**: `ieeecis` dataset requires you to [**join IEEE-CIS competetion**](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call fdb API. Otherwise you will get <span style="color:red">ApiException: (403)</span>.
- AWS account
- Python 3.7+
- Python requirements
```
autogluon==0.4.2
h2o==3.36.1.2
boto3==1.20.21
click==8.0.3
click-plugins==1.1.1
Faker==4.14.2
joblib==1.0.0
kaggle==1.5.12
numpy==1.19.5
pandas==1.1.2
regex==2020.7.14
scikit-learn==0.22.1
scipy==1.5.4
auto-sklearn==0.14.7
dask==2022.8.1
```
### Step 1: Setup Kaggle CLI
The `FraudDatasetBenchmark` object is going to load datasets from the source (which in most of the cases is Kaggle), and then it will modify/standardize on the fly, and provide train-test splits. So, the first step is to setup Kaggle CLI in the machine being used to run Python.
Use intructions from [How to Use Kaggle](https://www.kaggle.com/docs/api) guide. The steps include:
Remember to download the authentication token from "My Account" on Kaggle, and save token at `~/.kaggle/kaggle.json` on Linux, OSX and at `C:\Users<Windows-username>.kaggle\kaggle.json` on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder.
#### Step 1.2. [Join IEEE-CIS competetion](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call `fdb.datasets` with `ieeecis`. Otherwise you will get <span style="color:red">ApiException: (403)</span>.
### Step 2: Clone Repo
Once Kaggle CLI is setup and installed, clone the github repo using `git clone https://github.com/amazon-research/fraud-dataset-benchmark.git` if using HTTPS, or `git clone git@github.com:amazon-research/fraud-dataset-benchmark.git` if using SSH.
### Step 3: Install
Once repo is cloned, from your terminal, `cd` to the repo and type `pip install .`, which will install the required classes and methods.
## FraudDatasetBenchmark Usage
The usage is straightforward, where you create a `dataset` object of `FraudDatasetBenchmark` class, and extract useful goodies like train/test splits and eval_metrics.
**Important note**: If you are running multiple experiments that require re-loading dataframes multiple times, default setting of downloading from Kaggle before loading into dataframe exceed the account level API limits. So, use the setting to persist the downloaded dataset and then load from the persisted data. During the first call of FraudDatasetBenchmark(), use `load_pre_downloaded=False, delete_downloaded=False` and for subsequent calls, use `load_pre_downloaded=True, delete_downloaded=False`. The default setting is
`load_pre_downloaded=False, delete_downloaded=True`
```
from fdb.datasets import FraudDatasetBenchmark
# all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud', 'fraudecom', 'twitterbot', 'ipblock']
key = 'ipblock'
obj = FraudDatasetBenchmark(
key=key,
load_pre_downloaded=False, # default
delete_downloaded=True, # default
add_random_values_if_real_na = {
"EVENT_TIMESTAMP": True,
"LABEL_TIMESTAMP": True,
"ENTITY_ID": True,
"ENTITY_TYPE": True,
"ENTITY_ID": True,
"EVENT_ID": True
} # default
)
print(obj.key)
print('Train set: ')
display(obj.train.head())
print(len(obj.train.columns))
print(obj.train.shape)
print('Test set: ')
display(obj.test.head())
print(obj.test.shape)
print('Test scores')
display(obj.test_labels.head())
print(obj.test_labels['EVENT_LABEL'].value_counts())
print(obj.train['EVENT_LABEL'].value_counts(normalize=True))
print('=========')
```
Notebook template to load dataset using FDB data-loader is available at [scripts/examples/Test_FDB_Loader.ipynb](scripts/examples/Test_FDB_Loader.ipynb)
## Reproducibility
Reproducibility scripts are available at [scripts/reproducibility/](scripts/reproducibility/) in respective folders for [afd](scripts/reproducibility/afd), [autogluon](scripts/reproducibility/autogluon) and [h2o](scripts/reproducibility/h2o). Each folder also had README with steps to reproduce.
## Benchmark Results
<!-- | **Dataset key** | **AUC-ROC** | | | | | **Recall at 1% FPR** | | | | |
|:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:|:--------------------:|:-----------:|:-------------:|:----------------:|:----------------:|
| | **AFD OFI** | **AFD TFI** | **AutoGluon** | **H2O** | **Auto-sklearn** | **AFD OFI** | **AFD TFI** | **AutoGluon** | **H2O** | **Auto-sklearn** |
| ccfraud | 0.985 | 0.99 | 0.99 | **0.992** | 0.988 | 0.88 | 0.88 | 0.88 | 0.853 | 0.88 |
| fakejob | 0.987 | - | **0.998** | 0.99 | 0.983 | 0.786 | - | 0.925 | 0.781 | 0.781 |
| fraudecom | 0.519 | **0.636** | 0.522 | 0.518 | 0.515 | 0.011 | 0.099 | 0.012 | 0.009 | 0.012 |
| ieeecis | 0.938 | **0.94** | 0.855 | 0.89 | 0.932 | 0.587 | 0.56 | 0.425 | 0.442 | 0.569 |
| malurl | 0.985 | - | **0.998** | Training failure | 0.5 | 0.868 | - | 0.976 | Training failure | 0.01 |
| sparknov | **0.998** | - | 0.997 | 0.997 | 0.995 | 1 | - | 0.927 | 0.896 | 0.868 |
| twitterbot | 0.934 | - | **0.943** | 0.938 | 0.936 | 0.518 | - | 0.419 | 0.382 | 0.369 |
| vehicleloan | **0.673** | - | 0.669 | 0.67 | 0.664 | 0.036 | - | 0.04 | 0.037 | 0.035 |
| ipblock | **0.937** | - | 0.804 | Training failure | 0.5 | 0.466 | - | 0.32 | Training failure | 0.01 | -->
| **Dataset key** | **AUC-ROC** | | | | |
|:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:|
| | **AFD OFI** | **AFD TFI** | **AutoGluon** | **H2O** | **Auto-sklearn** |
| ccfraud | 0.985 | 0.99 | 0.99 | **0.992** | 0.988 |
| fakejob | 0.987 | - | **0.998** | 0.99 | 0.983 |
| fraudecom | 0.519 | **0.636** | 0.522 | 0.518 | 0.515 |
| ieeecis | 0.938 | **0.94** | 0.855 | 0.89 | 0.932 |
| malurl | 0.985 | - | **0.998** | Training failure | 0.5 |
| sparknov | **0.998** | - | 0.997 | 0.997 | 0.995 |
| twitterbot | 0.934 | - | **0.943** | 0.938 | 0.936 |
| vehicleloan | **0.673** | - | 0.669 | 0.67 | 0.664 |
| ipblock | **0.937** | - | 0.804 | Training failure | 0.5 |
### ROC Curves
The numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML.

## Data Sources
1. **IEEE-CIS Fraud Detection**
- Source URL: https://www.kaggle.com/c/ieee-fraud-detection/overview
- Source license: https://www.kaggle.com/competitions/ieee-fraud-detection/rules
- Variables: Anonymized product, card, address, email domain, device, transaction date information. Numeric columns with name prefixes as V, C, D and M, and meaning hidden from public.
- Fraud category: Card Not Present Transaction Fraud
- Provider: [Vesta Corporation](https://www.vesta.io/)
- Release date: 2019-10-03
- Description: Prepared by IEEE Computational Intelligence Society, this card-non-present transaction fraud dataset was launched during IEEE-CIS Fraud Detection Kaggle competition, and was provided by Vesta Corporation. The original dataset contains 393 features which are reduced to 67 features in the benchmark. Feature selection was performed based on highly voted Kaggle kernels. The fraud rate in training segment of source dataset is 3.5%. We only used training files (train transaction and train identity) containing 590,540 transactions in the benchmark, and split that into train (95%) and test (5%) segments based on time. Based on the insights from a Kaggle kernel written by the competition winner, we added UUID (called it as ENTITY_ID) that represents a fingerprint and was created using card, address, time and D1 features.
2. **Credit Card Fraud Detection**
- Source URL: https://www.kaggle.com/mlg-ulb/creditcardfraud/
- Source license: https://opendatacommons.org/licenses/dbcl/1-0/
- Variables: PCA transformed features, time, amount (highly imbalanced)
- Fraud category: Card Not Present Transaction Fraud
- Provider: [Machine Learning Group - ULB](https://mlg.ulb.ac.be/)
- Release date: 2018-03-23
- Description: This dataset contains anonymized credit card transactions by European cardholders in September 2013. The dataset contains 492 frauds out of 284,807 transactions over 2 days. Data only contains numerical features that are the result of a PCA transformation, plus non transformed time and amount.
3. **Fraud ecommerce**
- Source URL: https://www.kaggle.com/vbinh002/fraud-ecommerce
- Source license: None
- Variables: The features include sign up time, purchase time, purchase value, device id, user id, browser, and IP address. We added a new feature that measured the time difference between sign up and purchase, as the age of an account is often an important variable in fraud detection.
- Fraud category: Card Not Present Transaction Fraud
- Provider: [Binh Vu](https://www.kaggle.com/vbinh002)
- Release date: 2018-12-09
- Description: This dataset contains ~150k e-commerce transactions.
4. **Simulated Credit Card Transactions generated using Sparkov**
- Source URL: https://www.kaggle.com/kartik2112/fraud-detection
- Source license: https://creativecommons.org/publicdomain/zero/1.0/
- Variables: Transaction date, credit card number, merchant, category, amount, name, street, gender. All variables are synthetically generated using the Sparknov tool.
- Fraud category: Card Not Present Transaction Fraud
- Provider: [Kartik Shenoy](https://www.kaggle.com/kartik2112)
- Release date: 2020-08-05
- Description: This is a simulated credit card transaction dataset. The dataset was generated using Sparkov Data Generation tool and we modified a version of dataset created for Kaggle. It covers transactions of 1000 customers with a pool of 800 merchants over 6 months. We used both train and test segments directly from the source and randomly down sampled test segment.
5. **Twitter Bots Accounts**
- Source URL: https://www.kaggle.com/code/davidmartngutirrez/bots-accounts-eda/data?select=twitter_human_bots_dataset.csv
- Source license: https://creativecommons.org/publicdomain/zero/1.0/
- Variables: Features like account creation date, follower and following counts, profile description, account age, meta data about profile picture and account activity, and a label indicating whether the account is human or bot.
- Fraud category: Bot Attacks
- Provider: [David Martín Gutiérrez](https://www.kaggle.com/davidmartngutirrez)
- Release date: 2020-08-20
- Description: The dataset composes of 37,438 rows corresponding to different user accounts from Twitter.
6. **Malicious URLs dataset**
- Source URL: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
- Source license: https://creativecommons.org/publicdomain/zero/1.0/
- Variables: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label.
- Fraud category: Malicious Traffic
- Provider: [Manu Siddhartha](https://www.kaggle.com/sid321axn)
- Release date: 2021-07-23
- Description: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. There is no timestamp information from the source. Therefore, we generate a dummy timestamp column for consistency.
7. **Real / Fake Job Posting Prediction**
- Source URL: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction
- Source license: https://creativecommons.org/publicdomain/zero/1.0/
- Variables: Title, location, department, company, salary range, requirements, description, benefits, telecommuting. Most of the variables are categorical and free form text in nature.
- Fraud category: Content Moderation
- Provider: [Shivam Bansal](https://www.kaggle.com/shivamb)
- Release date: 2020-02-29
- Description: This Kaggle dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The task is to train classification model to detect which job posts are fraudulent.
8. **Vehicle Loan Default Prediction**
- Source URL: https://www.kaggle.com/avikpaul4u/vehicle-loan-default-prediction
- Source license: Unknown
- Variables: Loanee information, loan information, credit bureau data, and history.
- Fraud category: Credit Risk
- Provider: [Avik Paul](https://www.kaggle.com/avikpaul4u)
- Release date: 2019-11-12
- Description: The task in this dataset is to determine the probability of vehicle loan default, particularly the risk of default on the first monthly installments. It contains data for 233k loans with 21.7% default rate.
9. **IP Blocklist**
- Source URL: http://cinsscore.com/list/ci-badguys.txt
- Source license: Unknown
- Variables: The dataset contains IP address and label telling malicious or fake. A dummy categorical variable that has no relation label is added.
- Fraud category: Malicious Traffic
- Provider: [CINSscore.com](http://cinsscore.com)
- Release date: 2017-09-25
- Description: This dataset is made up from malicious IP address from cinsscore.com. To the list of malicious IP addresses, we added randomly generated IP address using Faker labeled as benign.
## Citation
```
@misc{grover2023fraud,
title={Fraud Dataset Benchmark and Applications},
author={Prince Grover and Julia Xu and Justin Tittelfitz and Anqi Cheng and Zheng Li and Jakub Zablocki and Jianbo Liu and Hao Zhou},
year={2023},
eprint={2208.14417},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## License
This project is licensed under the MIT-0 License.
## Acknowledgement
We thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes.
================================================
FILE: scripts/examples/Test_FDB_Loader.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append('../../src/')\n",
"from fdb.datasets import FraudDatasetBenchmark\n",
"from fdb.kaggle_configs import KAGGLE_CONFIGS"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>.container { width:90% }</style>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Notebook setups\n",
"\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"from io import StringIO\n",
"\n",
"from IPython.core.display import display, HTML\n",
"from IPython.display import clear_output\n",
"display(HTML(\"<style>.container { width:90% }</style>\"))\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_colwidth', 200)\n",
"pd.set_option('display.max_rows', 500)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"\n",
"if os.path.exists('tmp'):\n",
" shutil.rmtree('tmp')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT IF YOU NEED TO UPLOAD DATA TO AN S3 BUCKET IN YOUR ACCOUNT\n",
"\n",
"# import boto3\n",
"# BUCKET='<ADD S3 BUCKET NAME IF YOU WANT TO UPLOAD DATA TO YOUR ACCOUNT>'\n",
"\n",
"# def _s3_upload(df):\n",
"# csv_memory=StringIO()\n",
"# df.to_csv(csv_memory, index=False)\n",
"# content = csv_memory.getvalue()\n",
"# s3_client.put_object(\n",
"# Body=content,\n",
"# Bucket=BUCKET,\n",
"# Key=KEY,\n",
"# ACL='bucket-owner-full-control')\n",
"# s3_client = boto3.client('s3')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# All options for keys\n",
"all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud','fraudecom', 'twitterbot', 'ipblock']\n",
"# all_keys = ['ipblock']\n",
"# all_keys = ['twitterbot']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Default setting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Default setting pulls data from the source in your system, modified the data and adds random values for columns that are missing, if add_random_values_if_real_na flags are True.\n",
"\n",
"Defalt parameters: \n",
"- load_pre_downloaded: False\n",
"- delete_downloaded: True\n",
"- add_random_values_if_real_na = ```\n",
"{\n",
"\"EVENT_TIMESTAMP\": True,\n",
"\"LABEL_TIMESTAMP\": True,\n",
"\"ENTITY_ID\": True,\n",
"\"ENTITY_TYPE\": True,\n",
"\"EVENT_ID\": True\n",
"}\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
"fakejob\n",
"Train set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_ID</th>\n",
" <th>title</th>\n",
" <th>location</th>\n",
" <th>department</th>\n",
" <th>salary_range</th>\n",
" <th>company_profile</th>\n",
" <th>description</th>\n",
" <th>requirements</th>\n",
" <th>benefits</th>\n",
" <th>telecommuting</th>\n",
" <th>has_company_logo</th>\n",
" <th>has_questions</th>\n",
" <th>employment_type</th>\n",
" <th>required_experience</th>\n",
" <th>required_education</th>\n",
" <th>industry</th>\n",
" <th>function</th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>LABEL_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5736</th>\n",
" <td>5737</td>\n",
" <td>Jr. Business Analyst & Quality Analyst (entry level)</td>\n",
" <td>US, NJ, PISCATAWAY</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Duration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial &amp; Health care clients.Candidate should have knowledge or experience in ...</td>\n",
" <td>What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad...</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Full-time</td>\n",
" <td>Entry level</td>\n",
" <td>Master's Degree</td>\n",
" <td>Financial Services</td>\n",
" <td>Finance</td>\n",
" <td>0</td>\n",
" <td>382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b</td>\n",
" <td>2022-12-13T13:05:21Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7106</th>\n",
" <td>7107</td>\n",
" <td>English Teacher Abroad</td>\n",
" <td>US, PA, Scranton</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>We help teachers get safe &amp; secure jobs abroad :)</td>\n",
" <td>Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr...</td>\n",
" <td>University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only</td>\n",
" <td>See job description</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Contract</td>\n",
" <td>NaN</td>\n",
" <td>Bachelor's Degree</td>\n",
" <td>Education Management</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>deadb697-08d2-4dca-83ec-a15d5e501a5b</td>\n",
" <td>2022-07-26T01:40:53Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11978</th>\n",
" <td>11979</td>\n",
" <td>SQL Server Database Developer Job opportunity at Barrington, IL</td>\n",
" <td>US, IL, Barrington</td>\n",
" <td>NaN</td>\n",
" <td>90000-100000</td>\n",
" <td>We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc...</td>\n",
" <td>Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...</td>\n",
" <td>Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...</td>\n",
" <td>Benefits - FullBonus Eligible - Yes</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Full-time</td>\n",
" <td>Mid-Senior level</td>\n",
" <td>Bachelor's Degree</td>\n",
" <td>Information Technology and Services</td>\n",
" <td>Information Technology</td>\n",
" <td>0</td>\n",
" <td>f5fcea87-6798-4529-a6c7-205d893b9b24</td>\n",
" <td>2023-03-09T13:06:59Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9374</th>\n",
" <td>9375</td>\n",
" <td>Legal Analyst - 12 Month FTC</td>\n",
" <td>GB, LND, London</td>\n",
" <td>Legal</td>\n",
" <td>NaN</td>\n",
" <td>MarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo...</td>\n",
" <td>DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki...</td>\n",
" <td>Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col...</td>\n",
" <td>Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible.</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Full-time</td>\n",
" <td>Associate</td>\n",
" <td>Professional</td>\n",
" <td>Financial Services</td>\n",
" <td>Legal</td>\n",
" <td>0</td>\n",
" <td>114fbd01-0573-42cf-9365-78729264e1aa</td>\n",
" <td>2022-12-09T08:17:07Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1300</th>\n",
" <td>1301</td>\n",
" <td>Part-Time Finance Assistant</td>\n",
" <td>GB, LND,</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Salary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a...</td>\n",
" <td>Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec...</td>\n",
" <td>Salary:£9 - £10 per hour</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Part-time</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Accounting</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>05a5dbdb-9778-4e4a-b967-7850dd483a54</td>\n",
" <td>2022-08-28T17:32:28Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_ID \\\n",
"5736 5737 \n",
"7106 7107 \n",
"11978 11979 \n",
"9374 9375 \n",
"1300 1301 \n",
"\n",
" title \\\n",
"5736 Jr. Business Analyst & Quality Analyst (entry level) \n",
"7106 English Teacher Abroad \n",
"11978 SQL Server Database Developer Job opportunity at Barrington, IL \n",
"9374 Legal Analyst - 12 Month FTC \n",
"1300 Part-Time Finance Assistant \n",
"\n",
" location department salary_range \\\n",
"5736 US, NJ, PISCATAWAY NaN NaN \n",
"7106 US, PA, Scranton NaN NaN \n",
"11978 US, IL, Barrington NaN 90000-100000 \n",
"9374 GB, LND, London Legal NaN \n",
"1300 GB, LND, NaN NaN \n",
"\n",
" company_profile \\\n",
"5736 NaN \n",
"7106 We help teachers get safe & secure jobs abroad :) \n",
"11978 We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc... \n",
"9374 MarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo... \n",
"1300 NaN \n",
"\n",
" description \\\n",
"5736 Duration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial & Health care clients.Candidate should have knowledge or experience in ... \n",
"7106 Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr... \n",
"11978 Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil... \n",
"9374 DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki... \n",
"1300 Salary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a... \n",
"\n",
" requirements \\\n",
"5736 What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad... \n",
"7106 University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only \n",
"11978 Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil... \n",
"9374 Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col... \n",
"1300 Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec... \n",
"\n",
" benefits \\\n",
"5736 NaN \n",
"7106 See job description \n",
"11978 Benefits - FullBonus Eligible - Yes \n",
"9374 Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible. \n",
"1300 Salary:£9 - £10 per hour \n",
"\n",
" telecommuting has_company_logo has_questions employment_type \\\n",
"5736 0 0 0 Full-time \n",
"7106 0 1 1 Contract \n",
"11978 0 0 0 Full-time \n",
"9374 0 1 0 Full-time \n",
"1300 0 0 0 Part-time \n",
"\n",
" required_experience required_education \\\n",
"5736 Entry level Master's Degree \n",
"7106 NaN Bachelor's Degree \n",
"11978 Mid-Senior level Bachelor's Degree \n",
"9374 Associate Professional \n",
"1300 NaN NaN \n",
"\n",
" industry function \\\n",
"5736 Financial Services Finance \n",
"7106 Education Management NaN \n",
"11978 Information Technology and Services Information Technology \n",
"9374 Financial Services Legal \n",
"1300 Accounting NaN \n",
"\n",
" EVENT_LABEL ENTITY_ID \\\n",
"5736 0 382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b \n",
"7106 0 deadb697-08d2-4dca-83ec-a15d5e501a5b \n",
"11978 0 f5fcea87-6798-4529-a6c7-205d893b9b24 \n",
"9374 0 114fbd01-0573-42cf-9365-78729264e1aa \n",
"1300 0 05a5dbdb-9778-4e4a-b967-7850dd483a54 \n",
"\n",
" EVENT_TIMESTAMP LABEL_TIMESTAMP ENTITY_TYPE \n",
"5736 2022-12-13T13:05:21Z 2023-05-05T08:46:09Z user \n",
"7106 2022-07-26T01:40:53Z 2023-05-05T08:46:09Z user \n",
"11978 2023-03-09T13:06:59Z 2023-05-05T08:46:09Z user \n",
"9374 2022-12-09T08:17:07Z 2023-05-05T08:46:09Z user \n",
"1300 2022-08-28T17:32:28Z 2023-05-05T08:46:09Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"22\n",
"(14304, 22)\n",
"Test set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_ID</th>\n",
" <th>title</th>\n",
" <th>location</th>\n",
" <th>department</th>\n",
" <th>salary_range</th>\n",
" <th>company_profile</th>\n",
" <th>description</th>\n",
" <th>requirements</th>\n",
" <th>benefits</th>\n",
" <th>telecommuting</th>\n",
" <th>has_company_logo</th>\n",
" <th>has_questions</th>\n",
" <th>employment_type</th>\n",
" <th>required_experience</th>\n",
" <th>required_education</th>\n",
" <th>industry</th>\n",
" <th>function</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10</td>\n",
" <td>Customer Service Associate - Part Time</td>\n",
" <td>US, AZ, Phoenix</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr...</td>\n",
" <td>The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai...</td>\n",
" <td>Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre...</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Part-time</td>\n",
" <td>Entry level</td>\n",
" <td>High School or equivalent</td>\n",
" <td>Financial Services</td>\n",
" <td>Customer Service</td>\n",
" <td>1743dd4b-f989-4227-8480-cbafa760b4de</td>\n",
" <td>2022-12-31T18:14:06Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>15</td>\n",
" <td>Account Executive - Sydney</td>\n",
" <td>AU, NSW, Sydney</td>\n",
" <td>Sales</td>\n",
" <td>NaN</td>\n",
" <td>Adthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ...</td>\n",
" <td>Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th...</td>\n",
" <td>You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication...</td>\n",
" <td>In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Full-time</td>\n",
" <td>Associate</td>\n",
" <td>Bachelor's Degree</td>\n",
" <td>Internet</td>\n",
" <td>Sales</td>\n",
" <td>d5a82588-fcff-495b-aeda-20a8de0737d0</td>\n",
" <td>2022-06-20T15:25:47Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>16</td>\n",
" <td>VP of Sales - Vault Dragon</td>\n",
" <td>SG, 01, Singapore</td>\n",
" <td>Sales</td>\n",
" <td>120000-150000</td>\n",
" <td>Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ...</td>\n",
" <td>About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count...</td>\n",
" <td>Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ...</td>\n",
" <td>Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Full-time</td>\n",
" <td>Executive</td>\n",
" <td>Bachelor's Degree</td>\n",
" <td>Facilities Services</td>\n",
" <td>Sales</td>\n",
" <td>298d3508-76bb-4362-9ad4-f843fa3f99fa</td>\n",
" <td>2022-10-30T20:49:56Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>19</td>\n",
" <td>Visual Designer</td>\n",
" <td>US, NY, New York</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Kettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer...</td>\n",
" <td>Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>cad2f705-4b22-4110-bb06-b34a47c62a6d</td>\n",
" <td>2022-05-30T19:30:26Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>21</td>\n",
" <td>Marketing Assistant</td>\n",
" <td>US, TX, Austin</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>IntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ...</td>\n",
" <td>IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential...</td>\n",
" <td>Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website...</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Marketing</td>\n",
" <td>24c31ad9-95a9-479c-87c5-de6af06ddef6</td>\n",
" <td>2022-12-05T07:48:39Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_ID title location \\\n",
"0 10 Customer Service Associate - Part Time US, AZ, Phoenix \n",
"1 15 Account Executive - Sydney AU, NSW, Sydney \n",
"2 16 VP of Sales - Vault Dragon SG, 01, Singapore \n",
"3 19 Visual Designer US, NY, New York \n",
"4 21 Marketing Assistant US, TX, Austin \n",
"\n",
" department salary_range \\\n",
"0 NaN NaN \n",
"1 Sales NaN \n",
"2 Sales 120000-150000 \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" company_profile \\\n",
"0 Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr... \n",
"1 Adthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ... \n",
"2 Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ... \n",
"3 Kettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer... \n",
"4 IntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ... \n",
"\n",
" description \\\n",
"0 The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai... \n",
"1 Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th... \n",
"2 About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count... \n",
"3 Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ... \n",
"4 IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential... \n",
"\n",
" requirements \\\n",
"0 Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre... \n",
"1 You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication... \n",
"2 Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ... \n",
"3 NaN \n",
"4 Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website... \n",
"\n",
" benefits \\\n",
"0 NaN \n",
"1 In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ... \n",
"2 Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa... \n",
"3 NaN \n",
"4 NaN \n",
"\n",
" telecommuting has_company_logo has_questions employment_type \\\n",
"0 0 1 0 Part-time \n",
"1 0 1 0 Full-time \n",
"2 0 1 1 Full-time \n",
"3 0 1 0 NaN \n",
"4 0 1 0 NaN \n",
"\n",
" required_experience required_education industry \\\n",
"0 Entry level High School or equivalent Financial Services \n",
"1 Associate Bachelor's Degree Internet \n",
"2 Executive Bachelor's Degree Facilities Services \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
" function ENTITY_ID \\\n",
"0 Customer Service 1743dd4b-f989-4227-8480-cbafa760b4de \n",
"1 Sales d5a82588-fcff-495b-aeda-20a8de0737d0 \n",
"2 Sales 298d3508-76bb-4362-9ad4-f843fa3f99fa \n",
"3 NaN cad2f705-4b22-4110-bb06-b34a47c62a6d \n",
"4 Marketing 24c31ad9-95a9-479c-87c5-de6af06ddef6 \n",
"\n",
" EVENT_TIMESTAMP ENTITY_TYPE \n",
"0 2022-12-31T18:14:06Z user \n",
"1 2022-06-20T15:25:47Z user \n",
"2 2022-10-30T20:49:56Z user \n",
"3 2022-05-30T19:30:26Z user \n",
"4 2022-12-05T07:48:39Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3576, 20)\n",
"Test scores\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_LABEL</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_LABEL\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 3389\n",
"1 187\n",
"Name: EVENT_LABEL, dtype: int64\n",
"0 0.952531\n",
"1 0.047469\n",
"Name: EVENT_LABEL, dtype: float64\n",
"========= \n",
"\n",
"Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
"vehicleloan\n",
"Train set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_ID</th>\n",
" <th>disbursed_amount</th>\n",
" <th>asset_cost</th>\n",
" <th>ltv</th>\n",
" <th>branch_id</th>\n",
" <th>supplier_id</th>\n",
" <th>manufacturer_id</th>\n",
" <th>current_pincode_id</th>\n",
" <th>date_of_birth</th>\n",
" <th>employment_type</th>\n",
" <th>state_id</th>\n",
" <th>employee_code_id</th>\n",
" <th>mobileno_avl_flag</th>\n",
" <th>aadhar_flag</th>\n",
" <th>pan_flag</th>\n",
" <th>voterid_flag</th>\n",
" <th>driving_flag</th>\n",
" <th>passport_flag</th>\n",
" <th>perform_cns_score</th>\n",
" <th>perform_cns_score_description</th>\n",
" <th>pri_no_of_accts</th>\n",
" <th>pri_active_accts</th>\n",
" <th>pri_overdue_accts</th>\n",
" <th>pri_current_balance</th>\n",
" <th>pri_sanctioned_amount</th>\n",
" <th>pri_disbursed_amount</th>\n",
" <th>sec_no_of_accts</th>\n",
" <th>sec_active_accts</th>\n",
" <th>sec_overdue_accts</th>\n",
" <th>sec_current_balance</th>\n",
" <th>sec_sanctioned_amount</th>\n",
" <th>sec_disbursed_amount</th>\n",
" <th>primary_instal_amt</th>\n",
" <th>sec_instal_amt</th>\n",
" <th>new_accts_in_last_six_months</th>\n",
" <th>delinquent_accts_in_last_six_months</th>\n",
" <th>average_acct_age</th>\n",
" <th>credit_history_length</th>\n",
" <th>no_of_inquiries</th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>LABEL_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8976</th>\n",
" <td>462711</td>\n",
" <td>33484</td>\n",
" <td>62644</td>\n",
" <td>55.23</td>\n",
" <td>67</td>\n",
" <td>22727</td>\n",
" <td>45</td>\n",
" <td>1511</td>\n",
" <td>16-06-1991</td>\n",
" <td>Salaried</td>\n",
" <td>6</td>\n",
" <td>1201</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>743</td>\n",
" <td>C-Very Low Risk</td>\n",
" <td>9</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>160423</td>\n",
" <td>230489</td>\n",
" <td>194538</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>9149</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0yrs 7mon</td>\n",
" <td>1yrs 4mon</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>27b9d5e1-69de-47f2-a559-cfba34dffb5f</td>\n",
" <td>2022-09-20T06:58:09Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76007</th>\n",
" <td>558674</td>\n",
" <td>66882</td>\n",
" <td>81187</td>\n",
" <td>84.37</td>\n",
" <td>2</td>\n",
" <td>23508</td>\n",
" <td>86</td>\n",
" <td>1708</td>\n",
" <td>15-09-1994</td>\n",
" <td>Salaried</td>\n",
" <td>4</td>\n",
" <td>1060</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>No Bureau History Available</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1c58aced-df31-4170-8f85-e0dd95d1ff21</td>\n",
" <td>2022-08-25T18:27:59Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77677</th>\n",
" <td>528251</td>\n",
" <td>59113</td>\n",
" <td>71757</td>\n",
" <td>84.87</td>\n",
" <td>48</td>\n",
" <td>21478</td>\n",
" <td>86</td>\n",
" <td>6322</td>\n",
" <td>01-01-1995</td>\n",
" <td>Self employed</td>\n",
" <td>5</td>\n",
" <td>1189</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>738</td>\n",
" <td>C-Very Low Risk</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>45828</td>\n",
" <td>58582</td>\n",
" <td>58582</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4240</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0yrs 2mon</td>\n",
" <td>0yrs 4mon</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>fa383d19-de52-4a71-8222-77e328fcf387</td>\n",
" <td>2022-10-13T07:51:51Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>209438</th>\n",
" <td>633950</td>\n",
" <td>56059</td>\n",
" <td>71307</td>\n",
" <td>81.34</td>\n",
" <td>146</td>\n",
" <td>18317</td>\n",
" <td>86</td>\n",
" <td>2989</td>\n",
" <td>01-01-1971</td>\n",
" <td>Salaried</td>\n",
" <td>14</td>\n",
" <td>2964</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>No Bureau History Available</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>6aa0b3ef-8fff-4094-bc16-2a7ec4c00e37</td>\n",
" <td>2022-08-09T09:25:01Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>143261</th>\n",
" <td>476747</td>\n",
" <td>56759</td>\n",
" <td>67100</td>\n",
" <td>85.69</td>\n",
" <td>136</td>\n",
" <td>17783</td>\n",
" <td>86</td>\n",
" <td>3793</td>\n",
" <td>03-12-1975</td>\n",
" <td>Self employed</td>\n",
" <td>8</td>\n",
" <td>1295</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>No Bureau History Available</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>e00bb721-ce37-4d32-99e8-84f8a46cf82f</td>\n",
" <td>2022-06-27T20:32:23Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_ID disbursed_amount asset_cost ltv branch_id supplier_id \\\n",
"8976 462711 33484 62644 55.23 67 22727 \n",
"76007 558674 66882 81187 84.37 2 23508 \n",
"77677 528251 59113 71757 84.87 48 21478 \n",
"209438 633950 56059 71307 81.34 146 18317 \n",
"143261 476747 56759 67100 85.69 136 17783 \n",
"\n",
" manufacturer_id current_pincode_id date_of_birth employment_type \\\n",
"8976 45 1511 16-06-1991 Salaried \n",
"76007 86 1708 15-09-1994 Salaried \n",
"77677 86 6322 01-01-1995 Self employed \n",
"209438 86 2989 01-01-1971 Salaried \n",
"143261 86 3793 03-12-1975 Self employed \n",
"\n",
" state_id employee_code_id mobileno_avl_flag aadhar_flag pan_flag \\\n",
"8976 6 1201 1 1 0 \n",
"76007 4 1060 1 1 0 \n",
"77677 5 1189 1 1 0 \n",
"209438 14 2964 1 1 0 \n",
"143261 8 1295 1 1 0 \n",
"\n",
" voterid_flag driving_flag passport_flag perform_cns_score \\\n",
"8976 0 0 0 743 \n",
"76007 0 0 0 0 \n",
"77677 0 0 0 738 \n",
"209438 0 0 0 0 \n",
"143261 0 0 0 0 \n",
"\n",
" perform_cns_score_description pri_no_of_accts pri_active_accts \\\n",
"8976 C-Very Low Risk 9 5 \n",
"76007 No Bureau History Available 0 0 \n",
"77677 C-Very Low Risk 3 3 \n",
"209438 No Bureau History Available 0 0 \n",
"143261 No Bureau History Available 0 0 \n",
"\n",
" pri_overdue_accts pri_current_balance pri_sanctioned_amount \\\n",
"8976 0 160423 230489 \n",
"76007 0 0 0 \n",
"77677 0 45828 58582 \n",
"209438 0 0 0 \n",
"143261 0 0 0 \n",
"\n",
" pri_disbursed_amount sec_no_of_accts sec_active_accts \\\n",
"8976 194538 0 0 \n",
"76007 0 0 0 \n",
"77677 58582 0 0 \n",
"209438 0 0 0 \n",
"143261 0 0 0 \n",
"\n",
" sec_overdue_accts sec_current_balance sec_sanctioned_amount \\\n",
"8976 0 0 0 \n",
"76007 0 0 0 \n",
"77677 0 0 0 \n",
"209438 0 0 0 \n",
"143261 0 0 0 \n",
"\n",
" sec_disbursed_amount primary_instal_amt sec_instal_amt \\\n",
"8976 0 9149 0 \n",
"76007 0 0 0 \n",
"77677 0 4240 0 \n",
"209438 0 0 0 \n",
"143261 0 0 0 \n",
"\n",
" new_accts_in_last_six_months delinquent_accts_in_last_six_months \\\n",
"8976 4 0 \n",
"76007 0 0 \n",
"77677 3 0 \n",
"209438 0 0 \n",
"143261 0 0 \n",
"\n",
" average_acct_age credit_history_length no_of_inquiries EVENT_LABEL \\\n",
"8976 0yrs 7mon 1yrs 4mon 1 0 \n",
"76007 0yrs 0mon 0yrs 0mon 0 0 \n",
"77677 0yrs 2mon 0yrs 4mon 0 1 \n",
"209438 0yrs 0mon 0yrs 0mon 0 1 \n",
"143261 0yrs 0mon 0yrs 0mon 0 0 \n",
"\n",
" ENTITY_ID EVENT_TIMESTAMP \\\n",
"8976 27b9d5e1-69de-47f2-a559-cfba34dffb5f 2022-09-20T06:58:09Z \n",
"76007 1c58aced-df31-4170-8f85-e0dd95d1ff21 2022-08-25T18:27:59Z \n",
"77677 fa383d19-de52-4a71-8222-77e328fcf387 2022-10-13T07:51:51Z \n",
"209438 6aa0b3ef-8fff-4094-bc16-2a7ec4c00e37 2022-08-09T09:25:01Z \n",
"143261 e00bb721-ce37-4d32-99e8-84f8a46cf82f 2022-06-27T20:32:23Z \n",
"\n",
" LABEL_TIMESTAMP ENTITY_TYPE \n",
"8976 2023-05-05T08:46:09Z user \n",
"76007 2023-05-05T08:46:09Z user \n",
"77677 2023-05-05T08:46:09Z user \n",
"209438 2023-05-05T08:46:09Z user \n",
"143261 2023-05-05T08:46:09Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"44\n",
"(186523, 44)\n",
"Test set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_ID</th>\n",
" <th>disbursed_amount</th>\n",
" <th>asset_cost</th>\n",
" <th>ltv</th>\n",
" <th>branch_id</th>\n",
" <th>supplier_id</th>\n",
" <th>manufacturer_id</th>\n",
" <th>current_pincode_id</th>\n",
" <th>date_of_birth</th>\n",
" <th>employment_type</th>\n",
" <th>state_id</th>\n",
" <th>employee_code_id</th>\n",
" <th>mobileno_avl_flag</th>\n",
" <th>aadhar_flag</th>\n",
" <th>pan_flag</th>\n",
" <th>voterid_flag</th>\n",
" <th>driving_flag</th>\n",
" <th>passport_flag</th>\n",
" <th>perform_cns_score</th>\n",
" <th>perform_cns_score_description</th>\n",
" <th>pri_no_of_accts</th>\n",
" <th>pri_active_accts</th>\n",
" <th>pri_overdue_accts</th>\n",
" <th>pri_current_balance</th>\n",
" <th>pri_sanctioned_amount</th>\n",
" <th>pri_disbursed_amount</th>\n",
" <th>sec_no_of_accts</th>\n",
" <th>sec_active_accts</th>\n",
" <th>sec_overdue_accts</th>\n",
" <th>sec_current_balance</th>\n",
" <th>sec_sanctioned_amount</th>\n",
" <th>sec_disbursed_amount</th>\n",
" <th>primary_instal_amt</th>\n",
" <th>sec_instal_amt</th>\n",
" <th>new_accts_in_last_six_months</th>\n",
" <th>delinquent_accts_in_last_six_months</th>\n",
" <th>average_acct_age</th>\n",
" <th>credit_history_length</th>\n",
" <th>no_of_inquiries</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>420825</td>\n",
" <td>50578</td>\n",
" <td>58400</td>\n",
" <td>89.55</td>\n",
" <td>67</td>\n",
" <td>22807</td>\n",
" <td>45</td>\n",
" <td>1441</td>\n",
" <td>01-01-1984</td>\n",
" <td>Salaried</td>\n",
" <td>6</td>\n",
" <td>1998</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>No Bureau History Available</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0yrs 0mon</td>\n",
" <td>0</td>\n",
" <td>03cf53e2-5c0b-4809-8333-04560101987b</td>\n",
" <td>2022-12-29T10:25:40Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>518279</td>\n",
" <td>54513</td>\n",
" <td>61900</td>\n",
" <td>89.66</td>\n",
" <td>67</td>\n",
" <td>22807</td>\n",
" <td>45</td>\n",
" <td>1501</td>\n",
" <td>08-09-1990</td>\n",
" <td>Self employed</td>\n",
" <td>6</td>\n",
" <td>1998</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>825</td>\n",
" <td>A-Very Low Risk</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1347</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1yrs 9mon</td>\n",
" <td>2yrs 0mon</td>\n",
" <td>0</td>\n",
" <td>03166b12-ee18-4144-aa73-10a3d2ac999a</td>\n",
" <td>2022-08-07T20:17:18Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>510278</td>\n",
" <td>43894</td>\n",
" <td>61900</td>\n",
" <td>71.89</td>\n",
" <td>67</td>\n",
" <td>22807</td>\n",
" <td>45</td>\n",
" <td>1501</td>\n",
" <td>04-10-1989</td>\n",
" <td>Salaried</td>\n",
" <td>6</td>\n",
" <td>1998</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17</td>\n",
" <td>Not Scored: Not Enough Info available on the customer</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>72879</td>\n",
" <td>74500</td>\n",
" <td>74500</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0yrs 2mon</td>\n",
" <td>0yrs 2mon</td>\n",
" <td>0</td>\n",
" <td>ff0fc8f9-c524-45cc-99b4-139dd726d7cd</td>\n",
" <td>2022-11-03T09:35:54Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>510980</td>\n",
" <td>52603</td>\n",
" <td>61300</td>\n",
" <td>86.95</td>\n",
" <td>67</td>\n",
" <td>22807</td>\n",
" <td>45</td>\n",
" <td>1492</td>\n",
" <td>01-06-1968</td>\n",
" <td>Salaried</td>\n",
" <td>6</td>\n",
" <td>1998</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>818</td>\n",
" <td>A-Very Low Risk</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2608</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1yrs 7mon</td>\n",
" <td>1yrs 7mon</td>\n",
" <td>0</td>\n",
" <td>8955bac7-5812-4e5f-b3ae-22738ee5e701</td>\n",
" <td>2023-02-19T06:55:03Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>513916</td>\n",
" <td>57713</td>\n",
" <td>65750</td>\n",
" <td>89.28</td>\n",
" <td>67</td>\n",
" <td>22807</td>\n",
" <td>45</td>\n",
" <td>1440</td>\n",
" <td>01-06-1976</td>\n",
" <td>Self employed</td>\n",
" <td>6</td>\n",
" <td>1998</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>300</td>\n",
" <td>M-Very High Risk</td>\n",
" <td>6</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>29069</td>\n",
" <td>1067200</td>\n",
" <td>1067200</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>47100</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2yrs 6mon</td>\n",
" <td>5yrs 6mon</td>\n",
" <td>0</td>\n",
" <td>a8154baa-1407-493a-bbc2-4bc1fd30d1f9</td>\n",
" <td>2022-08-14T11:20:39Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_ID disbursed_amount asset_cost ltv branch_id supplier_id \\\n",
"0 420825 50578 58400 89.55 67 22807 \n",
"1 518279 54513 61900 89.66 67 22807 \n",
"2 510278 43894 61900 71.89 67 22807 \n",
"3 510980 52603 61300 86.95 67 22807 \n",
"4 513916 57713 65750 89.28 67 22807 \n",
"\n",
" manufacturer_id current_pincode_id date_of_birth employment_type state_id \\\n",
"0 45 1441 01-01-1984 Salaried 6 \n",
"1 45 1501 08-09-1990 Self employed 6 \n",
"2 45 1501 04-10-1989 Salaried 6 \n",
"3 45 1492 01-06-1968 Salaried 6 \n",
"4 45 1440 01-06-1976 Self employed 6 \n",
"\n",
" employee_code_id mobileno_avl_flag aadhar_flag pan_flag voterid_flag \\\n",
"0 1998 1 1 0 0 \n",
"1 1998 1 1 0 0 \n",
"2 1998 1 1 0 0 \n",
"3 1998 1 0 0 1 \n",
"4 1998 1 1 0 0 \n",
"\n",
" driving_flag passport_flag perform_cns_score \\\n",
"0 0 0 0 \n",
"1 0 0 825 \n",
"2 0 0 17 \n",
"3 0 0 818 \n",
"4 0 0 300 \n",
"\n",
" perform_cns_score_description pri_no_of_accts \\\n",
"0 No Bureau History Available 0 \n",
"1 A-Very Low Risk 2 \n",
"2 Not Scored: Not Enough Info available on the customer 1 \n",
"3 A-Very Low Risk 1 \n",
"4 M-Very High Risk 6 \n",
"\n",
" pri_active_accts pri_overdue_accts pri_current_balance \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 1 0 72879 \n",
"3 0 0 0 \n",
"4 4 2 29069 \n",
"\n",
" pri_sanctioned_amount pri_disbursed_amount sec_no_of_accts sec_active_accts \\\n",
"0 0 0 0 0 \n",
"1 0 0 0 0 \n",
"2 74500 74500 0 0 \n",
"3 0 0 0 0 \n",
"4 1067200 1067200 0 0 \n",
"\n",
" sec_overdue_accts sec_current_balance sec_sanctioned_amount \\\n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" sec_disbursed_amount primary_instal_amt sec_instal_amt \\\n",
"0 0 0 0 \n",
"1 0 1347 0 \n",
"2 0 0 0 \n",
"3 0 2608 0 \n",
"4 0 47100 0 \n",
"\n",
" new_accts_in_last_six_months delinquent_accts_in_last_six_months \\\n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 0 0 \n",
"4 1 1 \n",
"\n",
" average_acct_age credit_history_length no_of_inquiries \\\n",
"0 0yrs 0mon 0yrs 0mon 0 \n",
"1 1yrs 9mon 2yrs 0mon 0 \n",
"2 0yrs 2mon 0yrs 2mon 0 \n",
"3 1yrs 7mon 1yrs 7mon 0 \n",
"4 2yrs 6mon 5yrs 6mon 0 \n",
"\n",
" ENTITY_ID EVENT_TIMESTAMP ENTITY_TYPE \n",
"0 03cf53e2-5c0b-4809-8333-04560101987b 2022-12-29T10:25:40Z user \n",
"1 03166b12-ee18-4144-aa73-10a3d2ac999a 2022-08-07T20:17:18Z user \n",
"2 ff0fc8f9-c524-45cc-99b4-139dd726d7cd 2022-11-03T09:35:54Z user \n",
"3 8955bac7-5812-4e5f-b3ae-22738ee5e701 2023-02-19T06:55:03Z user \n",
"4 a8154baa-1407-493a-bbc2-4bc1fd30d1f9 2022-08-14T11:20:39Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(46631, 42)\n",
"Test scores\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_LABEL</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_LABEL\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 36323\n",
"1 10308\n",
"Name: EVENT_LABEL, dtype: int64\n",
"0 0.783925\n",
"1 0.216075\n",
"Name: EVENT_LABEL, dtype: float64\n",
"========= \n",
"\n",
"Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n",
"malurl\n",
"Train set: \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>url</th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>EVENT_ID</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>LABEL_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" <th>dummy_cat</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>167113</th>\n",
" <td>apolloduck.co.za/</td>\n",
" <td>0</td>\n",
" <td>d16773dd-0077-4129-a39d-f935464bd07f</td>\n",
" <td>5e694594-fcfa-418e-8417-21c5e99b8d8a</td>\n",
" <td>2022-05-15T15:36:37Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" <td>87edb1a6-7936-4afa-b7be-4c35b7f1a5c6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>387680</th>\n",
" <td>acronyms.thefreedictionary.com/WDOM</td>\n",
" <td>0</td>\n",
" <td>b40b1f9e-9218-4a65-8b8e-870d45feb368</td>\n",
" <td>8d1aea20-97bb-46c4-bf56-3dc935f5c116</td>\n",
" <td>2022-06-28T06:32:21Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" <td>864a0704-ab05-49c3-8a0c-5b0b23b3eeef</td>\n",
" </tr>\n",
" <tr>\n",
" <th>528900</th>\n",
" <td>https://nepan.org.np/Alibaba/Alibaba.com/Login.htm</td>\n",
" <td>1</td>\n",
" <td>86c52fda-2f6f-41ee-aa15-a7b682138cc9</td>\n",
" <td>fce90a90-3ce2-475c-ac7d-a0d6c8fa784a</td>\n",
" <td>2022-06-11T21:40:20Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" <td>7ef071fc-a143-4d52-bd88-2a21f2b16c56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>251286</th>\n",
" <td>soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes</td>\n",
" <td>0</td>\n",
" <td>447529b9-923c-43e0-afed-c570e037f1aa</td>\n",
" <td>c4a96aba-24b1-4cc4-a7b8-f9c0a9a34546</td>\n",
" <td>2022-08-15T12:11:14Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" <td>2709ea1a-f5a7-4ecc-8dbe-767910778226</td>\n",
" </tr>\n",
" <tr>\n",
" <th>433650</th>\n",
" <td>ottawakiosk.com/hill_cam.html</td>\n",
" <td>0</td>\n",
" <td>976080b6-500f-4de3-95c4-a4c2679e672b</td>\n",
" <td>21497a05-52ce-4a25-a4d4-361b8298dbc1</td>\n",
" <td>2022-08-19T15:47:51Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" <td>752bff63-ad3b-4845-b975-7f6f7302402c</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" url \\\n",
"167113 apolloduck.co.za/ \n",
"387680 acronyms.thefreedictionary.com/WDOM \n",
"528900 https://nepan.org.np/Alibaba/Alibaba.com/Login.htm \n",
"251286 soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes \n",
"433650 ottawakiosk.com/hill_cam.html \n",
"\n",
" EVENT_LABEL EVENT_ID \\\n",
"167113 0 d16773dd-0077-4129-a39d-f935464bd07f \n",
"387680 0 b40b1f9e-9218-4a65-8b8e-870d45feb368 \n",
"528900 1 86c52fda-2f6f-41ee-aa15-a7b682138cc9 \n",
"251286 0 447529b9-923c-43e0-afed-c570e037f1aa \n",
"433650 0 976080b6-500f-4de3-95c4-a4c2679e672b \n",
"\n",
" ENTITY_ID EVENT_TIMESTAMP \\\n",
"167113 5e694594-fcfa-418e-8417-21c5e99b8d8a 2022-05-15T15:36:37Z \n",
"387680 8d1aea20-97bb-46c4-bf56-3dc935f5c116 2022-06-28T06:32:21Z \n",
"528900 fce90a90-3ce2-475c-ac7d-a0d6c8fa784a 2022-06-11T21:40:20Z \n",
"251286 c4a96aba-24b1-4cc4-a7b8-f9c0a9a34546 2022-08-15T12:11:14Z \n",
"433650 21497a05-52ce-4a25-a4d4-361b8298dbc1 2022-08-19T15:47:51Z \n",
"\n",
" LABEL_TIMESTAMP ENTITY_TYPE dummy_cat \n",
"167113 2023-05-05T08:46:09Z user 87edb1a6-7936-4afa-b7be-4c35b7f1a5c6 \n",
"387680 2023-05-05T08:46:09Z user 864a0704-ab05-49c3-8a0c-5b0b23b3eeef \n",
"528900 2023-05-05T08:46:09Z user 7ef071fc-a143-4d52-bd88-2a21f2b16c56 \n",
"251286 2023-05-05T08:46:09Z user 2709ea1a-f5a7-4ecc-8dbe-767910778226 \n",
"433650 2023-05-05T08:46:09Z user 752bff63-ad3b-4845-b975-7f6f7302402c "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"8\n",
"(586072, 8)\n",
"Test set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>url</th>\n",
" <th>EVENT_ID</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" <th>dummy_cat</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.html</td>\n",
" <td>b4233390-3167-401d-a85f-27331078ff27</td>\n",
" <td>3fd82c9f-b26a-44dc-ac26-4a635690938c</td>\n",
" <td>2022-11-20T12:29:18Z</td>\n",
" <td>user</td>\n",
" <td>f45a2001-81b6-4b29-bba9-e376cc9a4ca9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>cyndislist.com/us/pa/counties</td>\n",
" <td>77d73435-251f-43fa-a82c-cc6ab4dbce6b</td>\n",
" <td>7ac20b7a-ee66-46ce-83da-703e095e9c87</td>\n",
" <td>2022-12-26T07:01:46Z</td>\n",
" <td>user</td>\n",
" <td>a54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ</td>\n",
" <td>87a47093-0039-445f-8002-87b6af3e709d</td>\n",
" <td>eaea621e-895d-43cf-8bbb-93acac029c47</td>\n",
" <td>2022-06-25T00:29:41Z</td>\n",
" <td>user</td>\n",
" <td>20e00a79-d5fc-49d1-b563-173e69f09434</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery</td>\n",
" <td>3143022e-ce02-441b-8ad0-5ebbf3c1c829</td>\n",
" <td>ba97f126-6159-4655-9c11-807c99807059</td>\n",
" <td>2023-03-07T14:27:10Z</td>\n",
" <td>user</td>\n",
" <td>5398bd49-ce09-4438-bfc3-24fce419c612</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>kitsapsun.com/photos/2011/feb/25/177999/</td>\n",
" <td>8885745c-4494-4f04-92a0-bb57006fe7aa</td>\n",
" <td>b51cdf46-1467-45f0-9c9c-62233be01d0e</td>\n",
" <td>2022-12-07T01:31:11Z</td>\n",
" <td>user</td>\n",
" <td>0ac04255-86df-47bc-8990-557f4c65fe0d</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" url \\\n",
"0 http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.html \n",
"1 cyndislist.com/us/pa/counties \n",
"2 https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ \n",
"3 articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery \n",
"4 kitsapsun.com/photos/2011/feb/25/177999/ \n",
"\n",
" EVENT_ID ENTITY_ID \\\n",
"0 b4233390-3167-401d-a85f-27331078ff27 3fd82c9f-b26a-44dc-ac26-4a635690938c \n",
"1 77d73435-251f-43fa-a82c-cc6ab4dbce6b 7ac20b7a-ee66-46ce-83da-703e095e9c87 \n",
"2 87a47093-0039-445f-8002-87b6af3e709d eaea621e-895d-43cf-8bbb-93acac029c47 \n",
"3 3143022e-ce02-441b-8ad0-5ebbf3c1c829 ba97f126-6159-4655-9c11-807c99807059 \n",
"4 8885745c-4494-4f04-92a0-bb57006fe7aa b51cdf46-1467-45f0-9c9c-62233be01d0e \n",
"\n",
" EVENT_TIMESTAMP ENTITY_TYPE dummy_cat \n",
"0 2022-11-20T12:29:18Z user f45a2001-81b6-4b29-bba9-e376cc9a4ca9 \n",
"1 2022-12-26T07:01:46Z user a54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2 \n",
"2 2022-06-25T00:29:41Z user 20e00a79-d5fc-49d1-b563-173e69f09434 \n",
"3 2023-03-07T14:27:10Z user 5398bd49-ce09-4438-bfc3-24fce419c612 \n",
"4 2022-12-07T01:31:11Z user 0ac04255-86df-47bc-8990-557f4c65fe0d "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(65119, 6)\n",
"Test scores\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>EVENT_ID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>b4233390-3167-401d-a85f-27331078ff27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>77d73435-251f-43fa-a82c-cc6ab4dbce6b</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>87a47093-0039-445f-8002-87b6af3e709d</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>3143022e-ce02-441b-8ad0-5ebbf3c1c829</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>8885745c-4494-4f04-92a0-bb57006fe7aa</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_LABEL EVENT_ID\n",
"0 0 b4233390-3167-401d-a85f-27331078ff27\n",
"1 0 77d73435-251f-43fa-a82c-cc6ab4dbce6b\n",
"2 1 87a47093-0039-445f-8002-87b6af3e709d\n",
"3 0 3143022e-ce02-441b-8ad0-5ebbf3c1c829\n",
"4 0 8885745c-4494-4f04-92a0-bb57006fe7aa"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 42695\n",
"1 22424\n",
"Name: EVENT_LABEL, dtype: int64\n",
"0 0.657612\n",
"1 0.342388\n",
"Name: EVENT_LABEL, dtype: float64\n",
"========= \n",
"\n",
"Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ieeecis\n",
"Train set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>transactionamt</th>\n",
" <th>productcd</th>\n",
" <th>card1</th>\n",
" <th>card2</th>\n",
" <th>card3</th>\n",
" <th>card5</th>\n",
" <th>card6</th>\n",
" <th>addr1</th>\n",
" <th>dist1</th>\n",
" <th>p_emaildomain</th>\n",
" <th>r_emaildomain</th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" <th>c4</th>\n",
" <th>c5</th>\n",
" <th>c6</th>\n",
" <th>c7</th>\n",
" <th>c8</th>\n",
" <th>c9</th>\n",
" <th>c10</th>\n",
" <th>c11</th>\n",
" <th>c12</th>\n",
" <th>c13</th>\n",
" <th>c14</th>\n",
" <th>v62</th>\n",
" <th>v70</th>\n",
" <th>v76</th>\n",
" <th>v78</th>\n",
" <th>v82</th>\n",
" <th>v91</th>\n",
" <th>v127</th>\n",
" <th>v130</th>\n",
" <th>v139</th>\n",
" <th>v160</th>\n",
" <th>v165</th>\n",
" <th>v187</th>\n",
" <th>v203</th>\n",
" <th>v207</th>\n",
" <th>v209</th>\n",
" <th>v210</th>\n",
" <th>v221</th>\n",
" <th>v234</th>\n",
" <th>v257</th>\n",
" <th>v258</th>\n",
" <th>v261</th>\n",
" <th>v264</th>\n",
" <th>v266</th>\n",
" <th>v267</th>\n",
" <th>v271</th>\n",
" <th>v274</th>\n",
" <th>v277</th>\n",
" <th>v283</th>\n",
" <th>v285</th>\n",
" <th>v289</th>\n",
" <th>v291</th>\n",
" <th>v294</th>\n",
" <th>id_01</th>\n",
" <th>id_02</th>\n",
" <th>id_05</th>\n",
" <th>id_06</th>\n",
" <th>id_09</th>\n",
" <th>id_13</th>\n",
" <th>id_17</th>\n",
" <th>id_19</th>\n",
" <th>id_20</th>\n",
" <th>devicetype</th>\n",
" <th>deviceinfo</th>\n",
" <th>EVENT_ID</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>LABEL_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" <tr>\n",
" <th>TransactionID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2987000.0</th>\n",
" <td>0</td>\n",
" <td>68.5</td>\n",
" <td>W</td>\n",
" <td>13926.0</td>\n",
" <td>NaN</td>\n",
" <td>150.0</td>\n",
" <td>142.0</td>\n",
" <td>credit</td>\n",
" <td>315.0</td>\n",
" <td>19.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>117.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>c5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f7</td>\n",
" <td>13926.0_315.0_-13.0</td>\n",
" <td>2021-01-02T00:00:00Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2987001.0</th>\n",
" <td>0</td>\n",
" <td>29.0</td>\n",
" <td>W</td>\n",
" <td>2755.0</td>\n",
" <td>404.0</td>\n",
" <td>150.0</td>\n",
" <td>102.0</td>\n",
" <td>credit</td>\n",
" <td>325.0</td>\n",
" <td>NaN</td>\n",
" <td>gmail.com</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>9aa1d670-7446-4979-8c09-87f02311d2ca</td>\n",
" <td>2755.0_325.0_1.0</td>\n",
" <td>2021-01-02T00:00:01Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2987002.0</th>\n",
" <td>0</td>\n",
" <td>59.0</td>\n",
" <td>W</td>\n",
" <td>4663.0</td>\n",
" <td>490.0</td>\n",
" <td>150.0</td>\n",
" <td>166.0</td>\n",
" <td>debit</td>\n",
" <td>330.0</td>\n",
" <td>287.0</td>\n",
" <td>outlook.com</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4cdb1e2e-3c63-4e96-80a6-382d0ec97fe3</td>\n",
" <td>4663.0_330.0_1.0</td>\n",
" <td>2021-01-02T00:01:09Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2987003.0</th>\n",
" <td>0</td>\n",
" <td>50.0</td>\n",
" <td>W</td>\n",
" <td>18132.0</td>\n",
" <td>567.0</td>\n",
" <td>150.0</td>\n",
" <td>117.0</td>\n",
" <td>debit</td>\n",
" <td>476.0</td>\n",
" <td>NaN</td>\n",
" <td>yahoo.com</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>25.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1758.0</td>\n",
" <td>354.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>10.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>38.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>d3e3803c-b1a3-4dfd-841d-30b8d2611364</td>\n",
" <td>18132.0_476.0_-111.0</td>\n",
" <td>2021-01-02T00:01:39Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2987004.0</th>\n",
" <td>0</td>\n",
" <td>50.0</td>\n",
" <td>H</td>\n",
" <td>4497.0</td>\n",
" <td>514.0</td>\n",
" <td>150.0</td>\n",
" <td>102.0</td>\n",
" <td>credit</td>\n",
" <td>420.0</td>\n",
" <td>NaN</td>\n",
" <td>gmail.com</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>169690.796875</td>\n",
" <td>5155.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>70787.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>166.0</td>\n",
" <td>542.0</td>\n",
" <td>144.0</td>\n",
" <td>mobile</td>\n",
" <td>SAMSUNG SM-G892A Build/NRD90M</td>\n",
" <td>2c013afb-7779-45db-a330-a5808d531372</td>\n",
" <td>4497.0_420.0_1.0</td>\n",
" <td>2021-01-02T00:01:46Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_LABEL transactionamt productcd card1 card2 card3 \\\n",
"TransactionID \n",
"2987000.0 0 68.5 W 13926.0 NaN 150.0 \n",
"2987001.0 0 29.0 W 2755.0 404.0 150.0 \n",
"2987002.0 0 59.0 W 4663.0 490.0 150.0 \n",
"2987003.0 0 50.0 W 18132.0 567.0 150.0 \n",
"2987004.0 0 50.0 H 4497.0 514.0 150.0 \n",
"\n",
" card5 card6 addr1 dist1 p_emaildomain r_emaildomain c1 \\\n",
"TransactionID \n",
"2987000.0 142.0 credit 315.0 19.0 NaN NaN 1.0 \n",
"2987001.0 102.0 credit 325.0 NaN gmail.com NaN 1.0 \n",
"2987002.0 166.0 debit 330.0 287.0 outlook.com NaN 1.0 \n",
"2987003.0 117.0 debit 476.0 NaN yahoo.com NaN 2.0 \n",
"2987004.0 102.0 credit 420.0 NaN gmail.com NaN 1.0 \n",
"\n",
" c2 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 \\\n",
"TransactionID \n",
"2987000.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 \n",
"2987001.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 \n",
"2987002.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 \n",
"2987003.0 5.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 \n",
"2987004.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 \n",
"\n",
" v62 v70 v76 v78 v82 v91 v127 v130 v139 \\\n",
"TransactionID \n",
"2987000.0 1.0 0.0 1.0 1.0 0.0 0.0 117.0 0.0 NaN \n",
"2987001.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 NaN \n",
"2987002.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 NaN \n",
"2987003.0 1.0 0.0 1.0 1.0 1.0 0.0 1758.0 354.0 NaN \n",
"2987004.0 NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 \n",
"\n",
" v160 v165 v187 v203 v207 v209 v210 v221 \\\n",
"TransactionID \n",
"2987000.0 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987001.0 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987002.0 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987003.0 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987004.0 169690.796875 5155.0 1.0 0.0 0.0 0.0 0.0 1.0 \n",
"\n",
" v234 v257 v258 v261 v264 v266 v267 v271 v274 v277 \\\n",
"TransactionID \n",
"2987000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987001.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987002.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987003.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2987004.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" v283 v285 v289 v291 v294 id_01 id_02 id_05 id_06 \\\n",
"TransactionID \n",
"2987000.0 1.0 0.0 0.0 1.0 1.0 NaN NaN NaN NaN \n",
"2987001.0 1.0 0.0 0.0 1.0 0.0 NaN NaN NaN NaN \n",
"2987002.0 1.0 0.0 0.0 1.0 0.0 NaN NaN NaN NaN \n",
"2987003.0 0.0 10.0 0.0 1.0 38.0 NaN NaN NaN NaN \n",
"2987004.0 1.0 0.0 0.0 1.0 0.0 0.0 70787.0 NaN NaN \n",
"\n",
" id_09 id_13 id_17 id_19 id_20 devicetype \\\n",
"TransactionID \n",
"2987000.0 NaN NaN NaN NaN NaN NaN \n",
"2987001.0 NaN NaN NaN NaN NaN NaN \n",
"2987002.0 NaN NaN NaN NaN NaN NaN \n",
"2987003.0 NaN NaN NaN NaN NaN NaN \n",
"2987004.0 NaN NaN 166.0 542.0 144.0 mobile \n",
"\n",
" deviceinfo \\\n",
"TransactionID \n",
"2987000.0 NaN \n",
"2987001.0 NaN \n",
"2987002.0 NaN \n",
"2987003.0 NaN \n",
"2987004.0 SAMSUNG SM-G892A Build/NRD90M \n",
"\n",
" EVENT_ID ENTITY_ID \\\n",
"TransactionID \n",
"2987000.0 c5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f7 13926.0_315.0_-13.0 \n",
"2987001.0 9aa1d670-7446-4979-8c09-87f02311d2ca 2755.0_325.0_1.0 \n",
"2987002.0 4cdb1e2e-3c63-4e96-80a6-382d0ec97fe3 4663.0_330.0_1.0 \n",
"2987003.0 d3e3803c-b1a3-4dfd-841d-30b8d2611364 18132.0_476.0_-111.0 \n",
"2987004.0 2c013afb-7779-45db-a330-a5808d531372 4497.0_420.0_1.0 \n",
"\n",
" EVENT_TIMESTAMP LABEL_TIMESTAMP ENTITY_TYPE \n",
"TransactionID \n",
"2987000.0 2021-01-02T00:00:00Z 2023-05-05T08:46:09Z user \n",
"2987001.0 2021-01-02T00:00:01Z 2023-05-05T08:46:09Z user \n",
"2987002.0 2021-01-02T00:01:09Z 2023-05-05T08:46:09Z user \n",
"2987003.0 2021-01-02T00:01:39Z 2023-05-05T08:46:09Z user \n",
"2987004.0 2021-01-02T00:01:46Z 2023-05-05T08:46:09Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"73\n",
"(561013, 73)\n",
"Test set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>transactionamt</th>\n",
" <th>productcd</th>\n",
" <th>card1</th>\n",
" <th>card2</th>\n",
" <th>card3</th>\n",
" <th>card5</th>\n",
" <th>card6</th>\n",
" <th>addr1</th>\n",
" <th>dist1</th>\n",
" <th>p_emaildomain</th>\n",
" <th>r_emaildomain</th>\n",
" <th>c1</th>\n",
" <th>c2</th>\n",
" <th>c4</th>\n",
" <th>c5</th>\n",
" <th>c6</th>\n",
" <th>c7</th>\n",
" <th>c8</th>\n",
" <th>c9</th>\n",
" <th>c10</th>\n",
" <th>c11</th>\n",
" <th>c12</th>\n",
" <th>c13</th>\n",
" <th>c14</th>\n",
" <th>v62</th>\n",
" <th>v70</th>\n",
" <th>v76</th>\n",
" <th>v78</th>\n",
" <th>v82</th>\n",
" <th>v91</th>\n",
" <th>v127</th>\n",
" <th>v130</th>\n",
" <th>v139</th>\n",
" <th>v160</th>\n",
" <th>v165</th>\n",
" <th>v187</th>\n",
" <th>v203</th>\n",
" <th>v207</th>\n",
" <th>v209</th>\n",
" <th>v210</th>\n",
" <th>v221</th>\n",
" <th>v234</th>\n",
" <th>v257</th>\n",
" <th>v258</th>\n",
" <th>v261</th>\n",
" <th>v264</th>\n",
" <th>v266</th>\n",
" <th>v267</th>\n",
" <th>v271</th>\n",
" <th>v274</th>\n",
" <th>v277</th>\n",
" <th>v283</th>\n",
" <th>v285</th>\n",
" <th>v289</th>\n",
" <th>v291</th>\n",
" <th>v294</th>\n",
" <th>id_01</th>\n",
" <th>id_02</th>\n",
" <th>id_05</th>\n",
" <th>id_06</th>\n",
" <th>id_09</th>\n",
" <th>id_13</th>\n",
" <th>id_17</th>\n",
" <th>id_19</th>\n",
" <th>id_20</th>\n",
" <th>devicetype</th>\n",
" <th>deviceinfo</th>\n",
" <th>EVENT_ID</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" <tr>\n",
" <th>TransactionID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3548013.0</th>\n",
" <td>125.000000</td>\n",
" <td>S</td>\n",
" <td>15775.0</td>\n",
" <td>481.0</td>\n",
" <td>150.0</td>\n",
" <td>102.0</td>\n",
" <td>credit</td>\n",
" <td>330.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>yahoo.com</td>\n",
" <td>5.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>61.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>109411.000000</td>\n",
" <td>2301.000000</td>\n",
" <td>0.0</td>\n",
" <td>2401.0</td>\n",
" <td>66104.0</td>\n",
" <td>1.0</td>\n",
" <td>103183.0</td>\n",
" <td>877.0</td>\n",
" <td>1961.0</td>\n",
" <td>465.0</td>\n",
" <td>0.0</td>\n",
" <td>73.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>26.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>926.0</td>\n",
" <td>-10.0</td>\n",
" <td>1411.0</td>\n",
" <td>6.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>52.0</td>\n",
" <td>166.0</td>\n",
" <td>633.0</td>\n",
" <td>533.0</td>\n",
" <td>desktop</td>\n",
" <td>Windows</td>\n",
" <td>569c4257-3d62-466d-a806-e3b456b2b372</td>\n",
" <td>15775.0_330.0_129.0</td>\n",
" <td>2021-06-21T23:11:15Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548014.0</th>\n",
" <td>125.000000</td>\n",
" <td>S</td>\n",
" <td>15775.0</td>\n",
" <td>481.0</td>\n",
" <td>150.0</td>\n",
" <td>102.0</td>\n",
" <td>credit</td>\n",
" <td>330.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>yahoo.com</td>\n",
" <td>5.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>61.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>109536.000000</td>\n",
" <td>2301.000000</td>\n",
" <td>0.0</td>\n",
" <td>2401.0</td>\n",
" <td>66229.0</td>\n",
" <td>1.0</td>\n",
" <td>103308.0</td>\n",
" <td>877.0</td>\n",
" <td>1961.0</td>\n",
" <td>465.0</td>\n",
" <td>0.0</td>\n",
" <td>73.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>26.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>927.0</td>\n",
" <td>-10.0</td>\n",
" <td>693.0</td>\n",
" <td>6.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>52.0</td>\n",
" <td>166.0</td>\n",
" <td>633.0</td>\n",
" <td>533.0</td>\n",
" <td>desktop</td>\n",
" <td>Windows</td>\n",
" <td>e951afe6-b895-42b8-adff-df0f812e9ee8</td>\n",
" <td>15775.0_330.0_129.0</td>\n",
" <td>2021-06-21T23:11:29Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548015.0</th>\n",
" <td>125.000000</td>\n",
" <td>S</td>\n",
" <td>15775.0</td>\n",
" <td>481.0</td>\n",
" <td>150.0</td>\n",
" <td>102.0</td>\n",
" <td>credit</td>\n",
" <td>330.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>yahoo.com</td>\n",
" <td>5.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>61.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>109661.000000</td>\n",
" <td>2301.000000</td>\n",
" <td>0.0</td>\n",
" <td>2401.0</td>\n",
" <td>66354.0</td>\n",
" <td>1.0</td>\n",
" <td>103433.0</td>\n",
" <td>877.0</td>\n",
" <td>1961.0</td>\n",
" <td>465.0</td>\n",
" <td>0.0</td>\n",
" <td>73.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>26.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>928.0</td>\n",
" <td>-10.0</td>\n",
" <td>1116.0</td>\n",
" <td>6.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>52.0</td>\n",
" <td>166.0</td>\n",
" <td>633.0</td>\n",
" <td>533.0</td>\n",
" <td>desktop</td>\n",
" <td>Windows</td>\n",
" <td>cd69e301-8c15-42b3-9839-cc4c8b9d89db</td>\n",
" <td>15775.0_330.0_129.0</td>\n",
" <td>2021-06-21T23:11:45Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548016.0</th>\n",
" <td>125.000000</td>\n",
" <td>S</td>\n",
" <td>15775.0</td>\n",
" <td>481.0</td>\n",
" <td>150.0</td>\n",
" <td>102.0</td>\n",
" <td>credit</td>\n",
" <td>330.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>yahoo.com</td>\n",
" <td>5.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>61.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>109786.000000</td>\n",
" <td>2301.000000</td>\n",
" <td>0.0</td>\n",
" <td>2401.0</td>\n",
" <td>66479.0</td>\n",
" <td>1.0</td>\n",
" <td>103558.0</td>\n",
" <td>877.0</td>\n",
" <td>1961.0</td>\n",
" <td>465.0</td>\n",
" <td>0.0</td>\n",
" <td>73.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>26.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>929.0</td>\n",
" <td>-10.0</td>\n",
" <td>1589.0</td>\n",
" <td>6.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>52.0</td>\n",
" <td>166.0</td>\n",
" <td>633.0</td>\n",
" <td>533.0</td>\n",
" <td>desktop</td>\n",
" <td>Windows</td>\n",
" <td>71431bc1-19ec-49b6-a00f-4e8c7d121b02</td>\n",
" <td>15775.0_330.0_129.0</td>\n",
" <td>2021-06-21T23:12:00Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548017.0</th>\n",
" <td>31.950001</td>\n",
" <td>W</td>\n",
" <td>9500.0</td>\n",
" <td>321.0</td>\n",
" <td>150.0</td>\n",
" <td>226.0</td>\n",
" <td>debit</td>\n",
" <td>204.0</td>\n",
" <td>74.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>6.0</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>27.950001</td>\n",
" <td>27.950001</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>de297b4c-d372-4fd3-8c66-ab6ff0c19e16</td>\n",
" <td>9500.0_204.0_150.0</td>\n",
" <td>2021-06-21T23:12:11Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" transactionamt productcd card1 card2 card3 card5 card6 \\\n",
"TransactionID \n",
"3548013.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n",
"3548014.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n",
"3548015.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n",
"3548016.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n",
"3548017.0 31.950001 W 9500.0 321.0 150.0 226.0 debit \n",
"\n",
" addr1 dist1 p_emaildomain r_emaildomain c1 c2 c4 c5 \\\n",
"TransactionID \n",
"3548013.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n",
"3548014.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n",
"3548015.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n",
"3548016.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n",
"3548017.0 204.0 74.0 NaN NaN 3.0 3.0 0.0 1.0 \n",
"\n",
" c6 c7 c8 c9 c10 c11 c12 c13 c14 v62 v70 v76 \\\n",
"TransactionID \n",
"3548013.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n",
"3548014.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n",
"3548015.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n",
"3548016.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n",
"3548017.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 6.0 3.0 1.0 1.0 1.0 \n",
"\n",
" v78 v82 v91 v127 v130 v139 v160 \\\n",
"TransactionID \n",
"3548013.0 NaN NaN NaN 109411.000000 2301.000000 0.0 2401.0 \n",
"3548014.0 NaN NaN NaN 109536.000000 2301.000000 0.0 2401.0 \n",
"3548015.0 NaN NaN NaN 109661.000000 2301.000000 0.0 2401.0 \n",
"3548016.0 NaN NaN NaN 109786.000000 2301.000000 0.0 2401.0 \n",
"3548017.0 2.0 1.0 1.0 27.950001 27.950001 NaN NaN \n",
"\n",
" v165 v187 v203 v207 v209 v210 v221 v234 \\\n",
"TransactionID \n",
"3548013.0 66104.0 1.0 103183.0 877.0 1961.0 465.0 0.0 73.0 \n",
"3548014.0 66229.0 1.0 103308.0 877.0 1961.0 465.0 0.0 73.0 \n",
"3548015.0 66354.0 1.0 103433.0 877.0 1961.0 465.0 0.0 73.0 \n",
"3548016.0 66479.0 1.0 103558.0 877.0 1961.0 465.0 0.0 73.0 \n",
"3548017.0 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
" v257 v258 v261 v264 v266 v267 v271 v274 v277 v283 \\\n",
"TransactionID \n",
"3548013.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n",
"3548014.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n",
"3548015.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n",
"3548016.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n",
"3548017.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 \n",
"\n",
" v285 v289 v291 v294 id_01 id_02 id_05 id_06 id_09 \\\n",
"TransactionID \n",
"3548013.0 26.0 1.0 2.0 926.0 -10.0 1411.0 6.0 0.0 0.0 \n",
"3548014.0 26.0 1.0 2.0 927.0 -10.0 693.0 6.0 0.0 0.0 \n",
"3548015.0 26.0 1.0 2.0 928.0 -10.0 1116.0 6.0 0.0 0.0 \n",
"3548016.0 26.0 1.0 2.0 929.0 -10.0 1589.0 6.0 0.0 0.0 \n",
"3548017.0 1.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN \n",
"\n",
" id_13 id_17 id_19 id_20 devicetype deviceinfo \\\n",
"TransactionID \n",
"3548013.0 52.0 166.0 633.0 533.0 desktop Windows \n",
"3548014.0 52.0 166.0 633.0 533.0 desktop Windows \n",
"3548015.0 52.0 166.0 633.0 533.0 desktop Windows \n",
"3548016.0 52.0 166.0 633.0 533.0 desktop Windows \n",
"3548017.0 NaN NaN NaN NaN NaN NaN \n",
"\n",
" EVENT_ID ENTITY_ID \\\n",
"TransactionID \n",
"3548013.0 569c4257-3d62-466d-a806-e3b456b2b372 15775.0_330.0_129.0 \n",
"3548014.0 e951afe6-b895-42b8-adff-df0f812e9ee8 15775.0_330.0_129.0 \n",
"3548015.0 cd69e301-8c15-42b3-9839-cc4c8b9d89db 15775.0_330.0_129.0 \n",
"3548016.0 71431bc1-19ec-49b6-a00f-4e8c7d121b02 15775.0_330.0_129.0 \n",
"3548017.0 de297b4c-d372-4fd3-8c66-ab6ff0c19e16 9500.0_204.0_150.0 \n",
"\n",
" EVENT_TIMESTAMP ENTITY_TYPE \n",
"TransactionID \n",
"3548013.0 2021-06-21T23:11:15Z user \n",
"3548014.0 2021-06-21T23:11:29Z user \n",
"3548015.0 2021-06-21T23:11:45Z user \n",
"3548016.0 2021-06-21T23:12:00Z user \n",
"3548017.0 2021-06-21T23:12:11Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(29527, 71)\n",
"Test scores\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>EVENT_ID</th>\n",
" </tr>\n",
" <tr>\n",
" <th>TransactionID</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3548013.0</th>\n",
" <td>0</td>\n",
" <td>569c4257-3d62-466d-a806-e3b456b2b372</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548014.0</th>\n",
" <td>0</td>\n",
" <td>e951afe6-b895-42b8-adff-df0f812e9ee8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548015.0</th>\n",
" <td>0</td>\n",
" <td>cd69e301-8c15-42b3-9839-cc4c8b9d89db</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548016.0</th>\n",
" <td>0</td>\n",
" <td>71431bc1-19ec-49b6-a00f-4e8c7d121b02</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3548017.0</th>\n",
" <td>0</td>\n",
" <td>de297b4c-d372-4fd3-8c66-ab6ff0c19e16</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" EVENT_LABEL EVENT_ID\n",
"TransactionID \n",
"3548013.0 0 569c4257-3d62-466d-a806-e3b456b2b372\n",
"3548014.0 0 e951afe6-b895-42b8-adff-df0f812e9ee8\n",
"3548015.0 0 cd69e301-8c15-42b3-9839-cc4c8b9d89db\n",
"3548016.0 0 71431bc1-19ec-49b6-a00f-4e8c7d121b02\n",
"3548017.0 0 de297b4c-d372-4fd3-8c66-ab6ff0c19e16"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 28358\n",
"1 1169\n",
"Name: EVENT_LABEL, dtype: int64\n",
"0 0.965252\n",
"1 0.034748\n",
"Name: EVENT_LABEL, dtype: float64\n",
"========= \n",
"\n",
"Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ccfraud\n",
"Train set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>v1</th>\n",
" <th>v2</th>\n",
" <th>v3</th>\n",
" <th>v4</th>\n",
" <th>v5</th>\n",
" <th>v6</th>\n",
" <th>v7</th>\n",
" <th>v8</th>\n",
" <th>v9</th>\n",
" <th>v10</th>\n",
" <th>v11</th>\n",
" <th>v12</th>\n",
" <th>v13</th>\n",
" <th>v14</th>\n",
" <th>v15</th>\n",
" <th>v16</th>\n",
" <th>v17</th>\n",
" <th>v18</th>\n",
" <th>v19</th>\n",
" <th>v20</th>\n",
" <th>v21</th>\n",
" <th>v22</th>\n",
" <th>v23</th>\n",
" <th>v24</th>\n",
" <th>v25</th>\n",
" <th>v26</th>\n",
" <th>v27</th>\n",
" <th>v28</th>\n",
" <th>amount</th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>EVENT_ID</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>LABEL_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-1.3598071336738</td>\n",
" <td>-0.0727811733098497</td>\n",
" <td>2.53634673796914</td>\n",
" <td>1.37815522427443</td>\n",
" <td>-0.338320769942518</td>\n",
" <td>0.462387777762292</td>\n",
" <td>0.239598554061257</td>\n",
" <td>0.0986979012610507</td>\n",
" <td>0.363786969611213</td>\n",
" <td>0.0907941719789316</td>\n",
" <td>-0.551599533260813</td>\n",
" <td>-0.617800855762348</td>\n",
" <td>-0.991389847235408</td>\n",
" <td>-0.311169353699879</td>\n",
" <td>1.46817697209427</td>\n",
" <td>-0.470400525259478</td>\n",
" <td>0.207971241929242</td>\n",
" <td>0.0257905801985591</td>\n",
" <td>0.403992960255733</td>\n",
" <td>0.251412098239705</td>\n",
" <td>-0.018306777944153</td>\n",
" <td>0.277837575558899</td>\n",
" <td>-0.110473910188767</td>\n",
" <td>0.0669280749146731</td>\n",
" <td>0.128539358273528</td>\n",
" <td>-0.189114843888824</td>\n",
" <td>0.133558376740387</td>\n",
" <td>-0.0210530534538215</td>\n",
" <td>149.62</td>\n",
" <td>0</td>\n",
" <td>f8e77dc0-44ef-490c-b0de-8b4054b5a031</td>\n",
" <td>266103ff-71f2-4057-981d-a54821367237</td>\n",
" <td>2021-09-01T00:00:00Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.19185711131486</td>\n",
" <td>0.26615071205963</td>\n",
" <td>0.16648011335321</td>\n",
" <td>0.448154078460911</td>\n",
" <td>0.0600176492822243</td>\n",
" <td>-0.0823608088155687</td>\n",
" <td>-0.0788029833323113</td>\n",
" <td>0.0851016549148104</td>\n",
" <td>-0.255425128109186</td>\n",
" <td>-0.166974414004614</td>\n",
" <td>1.61272666105479</td>\n",
" <td>1.06523531137287</td>\n",
" <td>0.48909501589608</td>\n",
" <td>-0.143772296441519</td>\n",
" <td>0.635558093258208</td>\n",
" <td>0.463917041022171</td>\n",
" <td>-0.114804663102346</td>\n",
" <td>-0.183361270123994</td>\n",
" <td>-0.145783041325259</td>\n",
" <td>-0.0690831352230203</td>\n",
" <td>-0.225775248033138</td>\n",
" <td>-0.638671952771851</td>\n",
" <td>0.101288021253234</td>\n",
" <td>-0.339846475529127</td>\n",
" <td>0.167170404418143</td>\n",
" <td>0.125894532368176</td>\n",
" <td>-0.00898309914322813</td>\n",
" <td>0.0147241691924927</td>\n",
" <td>2.69</td>\n",
" <td>0</td>\n",
" <td>b557449e-6b35-4be0-991e-337f764f5e21</td>\n",
" <td>f85083b2-d31f-4b9e-9d49-eb85c0476f6e</td>\n",
" <td>2021-09-01T00:00:00Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-1.35835406159823</td>\n",
" <td>-1.34016307473609</td>\n",
" <td>1.77320934263119</td>\n",
" <td>0.379779593034328</td>\n",
" <td>-0.503198133318193</td>\n",
" <td>1.80049938079263</td>\n",
" <td>0.791460956450422</td>\n",
" <td>0.247675786588991</td>\n",
" <td>-1.51465432260583</td>\n",
" <td>0.207642865216696</td>\n",
" <td>0.624501459424895</td>\n",
" <td>0.066083685268831</td>\n",
" <td>0.717292731410831</td>\n",
" <td>-0.165945922763554</td>\n",
" <td>2.34586494901581</td>\n",
" <td>-2.89008319444231</td>\n",
" <td>1.10996937869599</td>\n",
" <td>-0.121359313195888</td>\n",
" <td>-2.26185709530414</td>\n",
" <td>0.524979725224404</td>\n",
" <td>0.247998153469754</td>\n",
" <td>0.771679401917229</td>\n",
" <td>0.909412262347719</td>\n",
" <td>-0.689280956490685</td>\n",
" <td>-0.327641833735251</td>\n",
" <td>-0.139096571514147</td>\n",
" <td>-0.0553527940384261</td>\n",
" <td>-0.0597518405929204</td>\n",
" <td>378.66</td>\n",
" <td>0</td>\n",
" <td>d78d879c-eb7c-455d-8fde-6b1205080a4a</td>\n",
" <td>237ca488-c695-402c-b30f-0544554ea96c</td>\n",
" <td>2021-09-01T00:01:00Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.966271711572087</td>\n",
" <td>-0.185226008082898</td>\n",
" <td>1.79299333957872</td>\n",
" <td>-0.863291275036453</td>\n",
" <td>-0.0103088796030823</td>\n",
" <td>1.24720316752486</td>\n",
" <td>0.23760893977178</td>\n",
" <td>0.377435874652262</td>\n",
" <td>-1.38702406270197</td>\n",
" <td>-0.0549519224713749</td>\n",
" <td>-0.226487263835401</td>\n",
" <td>0.178228225877303</td>\n",
" <td>0.507756869957169</td>\n",
" <td>-0.28792374549456</td>\n",
" <td>-0.631418117709045</td>\n",
" <td>-1.0596472454325</td>\n",
" <td>-0.684092786345479</td>\n",
" <td>1.96577500349538</td>\n",
" <td>-1.2326219700892</td>\n",
" <td>-0.208037781160366</td>\n",
" <td>-0.108300452035545</td>\n",
" <td>0.00527359678253453</td>\n",
" <td>-0.190320518742841</td>\n",
" <td>-1.17557533186321</td>\n",
" <td>0.647376034602038</td>\n",
" <td>-0.221928844458407</td>\n",
" <td>0.0627228487293033</td>\n",
" <td>0.0614576285006353</td>\n",
" <td>123.5</td>\n",
" <td>0</td>\n",
" <td>ef448a36-2763-449c-a54a-a9e05af20967</td>\n",
" <td>9964b305-b591-4ed0-bff1-8adca81d0194</td>\n",
" <td>2021-09-01T00:01:00Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-1.15823309349523</td>\n",
" <td>0.877736754848451</td>\n",
" <td>1.548717846511</td>\n",
" <td>0.403033933955121</td>\n",
" <td>-0.407193377311653</td>\n",
" <td>0.0959214624684256</td>\n",
" <td>0.592940745385545</td>\n",
" <td>-0.270532677192282</td>\n",
" <td>0.817739308235294</td>\n",
" <td>0.753074431976354</td>\n",
" <td>-0.822842877946363</td>\n",
" <td>0.53819555014995</td>\n",
" <td>1.3458515932154</td>\n",
" <td>-1.11966983471731</td>\n",
" <td>0.175121130008994</td>\n",
" <td>-0.451449182813529</td>\n",
" <td>-0.237033239362776</td>\n",
" <td>-0.0381947870352842</td>\n",
" <td>0.803486924960175</td>\n",
" <td>0.408542360392758</td>\n",
" <td>-0.00943069713232919</td>\n",
" <td>0.79827849458971</td>\n",
" <td>-0.137458079619063</td>\n",
" <td>0.141266983824769</td>\n",
" <td>-0.206009587619756</td>\n",
" <td>0.502292224181569</td>\n",
" <td>0.219422229513348</td>\n",
" <td>0.215153147499206</td>\n",
" <td>69.99</td>\n",
" <td>0</td>\n",
" <td>e333b3c0-83ae-42dc-a865-178496653029</td>\n",
" <td>87b2fbf2-5b7d-479c-85f5-d989bd701f36</td>\n",
" <td>2021-09-01T00:02:00Z</td>\n",
" <td>2023-05-05T08:46:09Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" v1 v2 v3 \\\n",
"0 -1.3598071336738 -0.0727811733098497 2.53634673796914 \n",
"1 1.19185711131486 0.26615071205963 0.16648011335321 \n",
"2 -1.35835406159823 -1.34016307473609 1.77320934263119 \n",
"3 -0.966271711572087 -0.185226008082898 1.79299333957872 \n",
"4 -1.15823309349523 0.877736754848451 1.548717846511 \n",
"\n",
" v4 v5 v6 \\\n",
"0 1.37815522427443 -0.338320769942518 0.462387777762292 \n",
"1 0.448154078460911 0.0600176492822243 -0.0823608088155687 \n",
"2 0.379779593034328 -0.503198133318193 1.80049938079263 \n",
"3 -0.863291275036453 -0.0103088796030823 1.24720316752486 \n",
"4 0.403033933955121 -0.407193377311653 0.0959214624684256 \n",
"\n",
" v7 v8 v9 \\\n",
"0 0.239598554061257 0.0986979012610507 0.363786969611213 \n",
"1 -0.0788029833323113 0.0851016549148104 -0.255425128109186 \n",
"2 0.791460956450422 0.247675786588991 -1.51465432260583 \n",
"3 0.23760893977178 0.377435874652262 -1.38702406270197 \n",
"4 0.592940745385545 -0.270532677192282 0.817739308235294 \n",
"\n",
" v10 v11 v12 \\\n",
"0 0.0907941719789316 -0.551599533260813 -0.617800855762348 \n",
"1 -0.166974414004614 1.61272666105479 1.06523531137287 \n",
"2 0.207642865216696 0.624501459424895 0.066083685268831 \n",
"3 -0.0549519224713749 -0.226487263835401 0.178228225877303 \n",
"4 0.753074431976354 -0.822842877946363 0.53819555014995 \n",
"\n",
" v13 v14 v15 \\\n",
"0 -0.991389847235408 -0.311169353699879 1.46817697209427 \n",
"1 0.48909501589608 -0.143772296441519 0.635558093258208 \n",
"2 0.717292731410831 -0.165945922763554 2.34586494901581 \n",
"3 0.507756869957169 -0.28792374549456 -0.631418117709045 \n",
"4 1.3458515932154 -1.11966983471731 0.175121130008994 \n",
"\n",
" v16 v17 v18 \\\n",
"0 -0.470400525259478 0.207971241929242 0.0257905801985591 \n",
"1 0.463917041022171 -0.114804663102346 -0.183361270123994 \n",
"2 -2.89008319444231 1.10996937869599 -0.121359313195888 \n",
"3 -1.0596472454325 -0.684092786345479 1.96577500349538 \n",
"4 -0.451449182813529 -0.237033239362776 -0.0381947870352842 \n",
"\n",
" v19 v20 v21 \\\n",
"0 0.403992960255733 0.251412098239705 -0.018306777944153 \n",
"1 -0.145783041325259 -0.0690831352230203 -0.225775248033138 \n",
"2 -2.26185709530414 0.524979725224404 0.247998153469754 \n",
"3 -1.2326219700892 -0.208037781160366 -0.108300452035545 \n",
"4 0.803486924960175 0.408542360392758 -0.00943069713232919 \n",
"\n",
" v22 v23 v24 \\\n",
"0 0.277837575558899 -0.110473910188767 0.0669280749146731 \n",
"1 -0.638671952771851 0.101288021253234 -0.339846475529127 \n",
"2 0.771679401917229 0.909412262347719 -0.689280956490685 \n",
"3 0.00527359678253453 -0.190320518742841 -1.17557533186321 \n",
"4 0.79827849458971 -0.137458079619063 0.141266983824769 \n",
"\n",
" v25 v26 v27 \\\n",
"0 0.128539358273528 -0.189114843888824 0.133558376740387 \n",
"1 0.167170404418143 0.125894532368176 -0.00898309914322813 \n",
"2 -0.327641833735251 -0.139096571514147 -0.0553527940384261 \n",
"3 0.647376034602038 -0.221928844458407 0.0627228487293033 \n",
"4 -0.206009587619756 0.502292224181569 0.219422229513348 \n",
"\n",
" v28 amount EVENT_LABEL \\\n",
"0 -0.0210530534538215 149.62 0 \n",
"1 0.0147241691924927 2.69 0 \n",
"2 -0.0597518405929204 378.66 0 \n",
"3 0.0614576285006353 123.5 0 \n",
"4 0.215153147499206 69.99 0 \n",
"\n",
" EVENT_ID ENTITY_ID \\\n",
"0 f8e77dc0-44ef-490c-b0de-8b4054b5a031 266103ff-71f2-4057-981d-a54821367237 \n",
"1 b557449e-6b35-4be0-991e-337f764f5e21 f85083b2-d31f-4b9e-9d49-eb85c0476f6e \n",
"2 d78d879c-eb7c-455d-8fde-6b1205080a4a 237ca488-c695-402c-b30f-0544554ea96c \n",
"3 ef448a36-2763-449c-a54a-a9e05af20967 9964b305-b591-4ed0-bff1-8adca81d0194 \n",
"4 e333b3c0-83ae-42dc-a865-178496653029 87b2fbf2-5b7d-479c-85f5-d989bd701f36 \n",
"\n",
" EVENT_TIMESTAMP LABEL_TIMESTAMP ENTITY_TYPE \n",
"0 2021-09-01T00:00:00Z 2023-05-05T08:46:09Z user \n",
"1 2021-09-01T00:00:00Z 2023-05-05T08:46:09Z user \n",
"2 2021-09-01T00:01:00Z 2023-05-05T08:46:09Z user \n",
"3 2021-09-01T00:01:00Z 2023-05-05T08:46:09Z user \n",
"4 2021-09-01T00:02:00Z 2023-05-05T08:46:09Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"35\n",
"(227845, 35)\n",
"Test set: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>v1</th>\n",
" <th>v2</th>\n",
" <th>v3</th>\n",
" <th>v4</th>\n",
" <th>v5</th>\n",
" <th>v6</th>\n",
" <th>v7</th>\n",
" <th>v8</th>\n",
" <th>v9</th>\n",
" <th>v10</th>\n",
" <th>v11</th>\n",
" <th>v12</th>\n",
" <th>v13</th>\n",
" <th>v14</th>\n",
" <th>v15</th>\n",
" <th>v16</th>\n",
" <th>v17</th>\n",
" <th>v18</th>\n",
" <th>v19</th>\n",
" <th>v20</th>\n",
" <th>v21</th>\n",
" <th>v22</th>\n",
" <th>v23</th>\n",
" <th>v24</th>\n",
" <th>v25</th>\n",
" <th>v26</th>\n",
" <th>v27</th>\n",
" <th>v28</th>\n",
" <th>amount</th>\n",
" <th>EVENT_ID</th>\n",
" <th>ENTITY_ID</th>\n",
" <th>EVENT_TIMESTAMP</th>\n",
" <th>ENTITY_TYPE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>227845</th>\n",
" <td>1.91402682161454</td>\n",
" <td>-0.490067987909997</td>\n",
" <td>-0.326111312515118</td>\n",
" <td>0.604710739174721</td>\n",
" <td>-0.8501359998436</td>\n",
" <td>-0.736318677031096</td>\n",
" <td>-0.524057962475328</td>\n",
" <td>-0.0886141066361987</td>\n",
" <td>1.09112510472248</td>\n",
" <td>0.093484357816225</td>\n",
" <td>-0.892304625856107</td>\n",
" <td>0.0272205159068718</td>\n",
" <td>-0.243790209618721</td>\n",
" <td>0.0317740067189187</td>\n",
" <td>0.900623897113791</td>\n",
" <td>0.536032161644219</td>\n",
" <td>-0.648408094097169</td>\n",
" <td>0.183072340001028</td>\n",
" <td>-0.48632249422331</td>\n",
" <td>-0.13957876335222</td>\n",
" <td>0.210958428878652</td>\n",
" <td>0.639337879054097</td>\n",
" <td>0.147522551988298</td>\n",
" <td>0.0736542664022496</td>\n",
" <td>-0.318378246601246</td>\n",
" <td>0.350612262707235</td>\n",
" <td>-0.0238434747433154</td>\n",
" <td>-0.0371393315055126</td>\n",
" <td>50</td>\n",
" <td>bd64c6f1-1c1d-49ea-8561-6cc56bd2a173</td>\n",
" <td>ee6232a9-6ba4-4654-b406-72e582f01031</td>\n",
" <td>2021-12-10T20:48:00Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227846</th>\n",
" <td>2.15269624649984</td>\n",
" <td>-0.036160786158066</td>\n",
" <td>-2.23181098049803</td>\n",
" <td>0.0917658435583919</td>\n",
" <td>0.537612206488446</td>\n",
" <td>-1.36810250972644</td>\n",
" <td>0.613326738349479</td>\n",
" <td>-0.455251954849699</td>\n",
" <td>0.29181359004335</td>\n",
" <td>0.253161344559488</td>\n",
" <td>-1.50188197076942</td>\n",
" <td>-0.870607641524177</td>\n",
" <td>-1.44173756499372</td>\n",
" <td>0.988756626201074</td>\n",
" <td>0.496349234837293</td>\n",
" <td>-0.0686989613348823</td>\n",
" <td>-0.454073497932566</td>\n",
" <td>-0.299095262736551</td>\n",
" <td>0.267443131415241</td>\n",
" <td>-0.275777914750361</td>\n",
" <td>0.0171533555339963</td>\n",
" <td>0.0632416225359206</td>\n",
" <td>-0.0345611249491173</td>\n",
" <td>-0.626866212626912</td>\n",
" <td>0.249213129413917</td>\n",
" <td>0.773930519516097</td>\n",
" <td>-0.137114784582898</td>\n",
" <td>-0.0906106088420727</td>\n",
" <td>14.95</td>\n",
" <td>6728a9b7-ab9c-404e-93a8-fcf76baf7e8e</td>\n",
" <td>3dc93b80-f110-4355-b516-5174a0cd214d</td>\n",
" <td>2021-12-10T20:49:00Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227847</th>\n",
" <td>-4.03479516717275</td>\n",
" <td>2.30507905571504</td>\n",
" <td>-1.46169292457709</td>\n",
" <td>-0.729887055238227</td>\n",
" <td>-1.5287503399573</td>\n",
" <td>-1.22567909778369</td>\n",
" <td>-0.893353679497868</td>\n",
" <td>1.62252199369554</td>\n",
" <td>1.29199841774415</td>\n",
" <td>-0.0409558359937061</td>\n",
" <td>-0.971425287697512</td>\n",
" <td>0.574743695630458</td>\n",
" <td>0.155656078919204</td>\n",
" <td>-0.729054997889385</td>\n",
" <td>0.477438947999659</td>\n",
" <td>1.06171851569252</td>\n",
" <td>0.93469475367536</td>\n",
" <td>0.403768792198479</td>\n",
" <td>-0.494929851777981</td>\n",
" <td>-0.0810925858921718</td>\n",
" <td>-0.392556502541116</td>\n",
" <td>-0.78759906251576</td>\n",
" <td>0.343467795972994</td>\n",
" <td>-0.0903313999840935</td>\n",
" <td>0.248286972151669</td>\n",
" <td>-0.238523845342424</td>\n",
" <td>0.26648354183946</td>\n",
" <td>-0.0622361634691654</td>\n",
" <td>7.7</td>\n",
" <td>1f4a3cae-3a95-48b7-8cc9-dd2258689f37</td>\n",
" <td>58879cd9-4053-4e16-9144-3b04c276f74e</td>\n",
" <td>2021-12-10T20:49:00Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227848</th>\n",
" <td>-1.66874106862583</td>\n",
" <td>1.16805471760364</td>\n",
" <td>0.249642461553748</td>\n",
" <td>-1.26849748925032</td>\n",
" <td>0.785922573014156</td>\n",
" <td>-0.663958562166729</td>\n",
" <td>0.859432973616895</td>\n",
" <td>0.0681106263347446</td>\n",
" <td>-0.144183044927318</td>\n",
" <td>0.0432880841287975</td>\n",
" <td>0.542013736060061</td>\n",
" <td>1.00202450469061</td>\n",
" <td>0.400759595743433</td>\n",
" <td>0.136412487776037</td>\n",
" <td>-1.28964902448879</td>\n",
" <td>0.276827961550432</td>\n",
" <td>-0.868491702025561</td>\n",
" <td>-0.366839507131127</td>\n",
" <td>-0.187391599008302</td>\n",
" <td>-0.0335233340620367</td>\n",
" <td>-0.247543775399679</td>\n",
" <td>-0.592536769878023</td>\n",
" <td>-0.286693549546811</td>\n",
" <td>-0.378855664973759</td>\n",
" <td>-0.0774289041638705</td>\n",
" <td>0.0676084004301294</td>\n",
" <td>-0.27896200360197</td>\n",
" <td>-0.0641926690992577</td>\n",
" <td>6.99</td>\n",
" <td>930cd5cb-b226-4af5-8dda-574340d05a12</td>\n",
" <td>bb616582-e509-4c77-9154-755ca81039c4</td>\n",
" <td>2021-12-10T20:49:00Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227849</th>\n",
" <td>-0.550678353341949</td>\n",
" <td>-0.429004102182237</td>\n",
" <td>-1.29189255347072</td>\n",
" <td>-0.414409226593379</td>\n",
" <td>-0.292228538671312</td>\n",
" <td>0.071842939235058</td>\n",
" <td>2.42606795091335</td>\n",
" <td>-0.212729758223082</td>\n",
" <td>0.412374372851086</td>\n",
" <td>-1.93996940549555</td>\n",
" <td>-1.81011838293809</td>\n",
" <td>-1.22351031687552</td>\n",
" <td>-1.32491464932768</td>\n",
" <td>-1.46239178995552</td>\n",
" <td>-0.31164055759838</td>\n",
" <td>0.506707760378257</td>\n",
" <td>0.739932584638577</td>\n",
" <td>0.892422017204659</td>\n",
" <td>0.195042529037103</td>\n",
" <td>0.791126747715284</td>\n",
" <td>0.00303193944814891</td>\n",
" <td>-0.645782978858753</td>\n",
" <td>0.877016475964068</td>\n",
" <td>-1.22852893747944</td>\n",
" <td>-0.0362812174160739</td>\n",
" <td>-0.110609895882901</td>\n",
" <td>-0.0983803135271981</td>\n",
" <td>0.0959849443846813</td>\n",
" <td>460.71</td>\n",
" <td>2e909126-def3-4d82-9485-03798817c942</td>\n",
" <td>88ea4bc9-29fd-4302-913d-e6788cb7e6ab</td>\n",
" <td>2021-12-10T20:50:00Z</td>\n",
" <td>user</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" v1 v2 v3 \\\n",
"227845 1.91402682161454 -0.490067987909997 -0.326111312515118 \n",
"227846 2.15269624649984 -0.036160786158066 -2.23181098049803 \n",
"227847 -4.03479516717275 2.30507905571504 -1.46169292457709 \n",
"227848 -1.66874106862583 1.16805471760364 0.249642461553748 \n",
"227849 -0.550678353341949 -0.429004102182237 -1.29189255347072 \n",
"\n",
" v4 v5 v6 \\\n",
"227845 0.604710739174721 -0.8501359998436 -0.736318677031096 \n",
"227846 0.0917658435583919 0.537612206488446 -1.36810250972644 \n",
"227847 -0.729887055238227 -1.5287503399573 -1.22567909778369 \n",
"227848 -1.26849748925032 0.785922573014156 -0.663958562166729 \n",
"227849 -0.414409226593379 -0.292228538671312 0.071842939235058 \n",
"\n",
" v7 v8 v9 \\\n",
"227845 -0.524057962475328 -0.0886141066361987 1.09112510472248 \n",
"227846 0.613326738349479 -0.455251954849699 0.29181359004335 \n",
"227847 -0.893353679497868 1.62252199369554 1.29199841774415 \n",
"227848 0.859432973616895 0.0681106263347446 -0.144183044927318 \n",
"227849 2.42606795091335 -0.212729758223082 0.412374372851086 \n",
"\n",
" v10 v11 v12 \\\n",
"227845 0.093484357816225 -0.892304625856107 0.0272205159068718 \n",
"227846 0.253161344559488 -1.50188197076942 -0.870607641524177 \n",
"227847 -0.0409558359937061 -0.971425287697512 0.574743695630458 \n",
"227848 0.0432880841287975 0.542013736060061 1.00202450469061 \n",
"227849 -1.93996940549555 -1.81011838293809 -1.22351031687552 \n",
"\n",
" v13 v14 v15 \\\n",
"227845 -0.243790209618721 0.0317740067189187 0.900623897113791 \n",
"227846 -1.44173756499372 0.988756626201074 0.496349234837293 \n",
"227847 0.155656078919204 -0.729054997889385 0.477438947999659 \n",
"227848 0.400759595743433 0.136412487776037 -1.28964902448879 \n",
"227849 -1.32491464932768 -1.46239178995552 -0.31164055759838 \n",
"\n",
" v16 v17 v18 \\\n",
"227845 0.536032161644219 -0.648408094097169 0.183072340001028 \n",
"227846 -0.0686989613348823 -0.454073497932566 -0.299095262736551 \n",
"227847 1.06171851569252 0.93469475367536 0.403768792198479 \n",
"227848 0.276827961550432 -0.868491702025561 -0.366839507131127 \n",
"227849 0.506707760378257 0.739932584638577 0.892422017204659 \n",
"\n",
" v19 v20 v21 \\\n",
"227845 -0.48632249422331 -0.13957876335222 0.210958428878652 \n",
"227846 0.267443131415241 -0.275777914750361 0.0171533555339963 \n",
"227847 -0.494929851777981 -0.0810925858921718 -0.392556502541116 \n",
"227848 -0.187391599008302 -0.0335233340620367 -0.247543775399679 \n",
"227849 0.195042529037103 0.791126747715284 0.00303193944814891 \n",
"\n",
" v22 v23 v24 \\\n",
"227845 0.639337879054097 0.147522551988298 0.0736542664022496 \n",
"227846 0.0632416225359206 -0.0345611249491173 -0.626866212626912 \n",
"227847 -0.78759906251576 0.343467795972994 -0.0903313999840935 \n",
"227848 -0.592536769878023 -0.286693549546811 -0.378855664973759 \n",
"227849 -0.645782978858753 0.877016475964068 -1.22852893747944 \n",
"\n",
" v25 v26 v27 \\\n",
"227845 -0.318378246601246 0.350612262707235 -0.0238434747433154 \n",
"227846 0.249213129413917 0.773930519516097 -0.137114784582898 \n",
"227847 0.248286972151669 -0.238523845342424 0.26648354183946 \n",
"227848 -0.0774289041638705 0.0676084004301294 -0.27896200360197 \n",
"227849 -0.0362812174160739 -0.110609895882901 -0.0983803135271981 \n",
"\n",
" v28 amount EVENT_ID \\\n",
"227845 -0.0371393315055126 50 bd64c6f1-1c1d-49ea-8561-6cc56bd2a173 \n",
"227846 -0.0906106088420727 14.95 6728a9b7-ab9c-404e-93a8-fcf76baf7e8e \n",
"227847 -0.0622361634691654 7.7 1f4a3cae-3a95-48b7-8cc9-dd2258689f37 \n",
"227848 -0.0641926690992577 6.99 930cd5cb-b226-4af5-8dda-574340d05a12 \n",
"227849 0.0959849443846813 460.71 2e909126-def3-4d82-9485-03798817c942 \n",
"\n",
" ENTITY_ID EVENT_TIMESTAMP ENTITY_TYPE \n",
"227845 ee6232a9-6ba4-4654-b406-72e582f01031 2021-12-10T20:48:00Z user \n",
"227846 3dc93b80-f110-4355-b516-5174a0cd214d 2021-12-10T20:49:00Z user \n",
"227847 58879cd9-4053-4e16-9144-3b04c276f74e 2021-12-10T20:49:00Z user \n",
"227848 bb616582-e509-4c77-9154-755ca81039c4 2021-12-10T20:49:00Z user \n",
"227849 88ea4bc9-29fd-4302-913d-e6788cb7e6ab 2021-12-10T20:50:00Z user "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(56962, 33)\n",
"Test scores\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>EVENT_LABEL</th>\n",
" <th>EVENT_ID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>227845</th>\n",
" <td>0</td>\n",
" <td>bd64c6f1-1c1d-49ea-8561-6cc56bd2a173</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227846</th>\n",
" <td>0</td>\n",
" <td>6728a9b7-ab9c-404e-93a8-fcf76baf7e8e</td>\n",
" </tr>\n",
" <tr>\n",
" <th>227847</th>\n",
" <td>0</td>\n",
" <td>1f4a3cae-3a95-48b7-8cc9-dd2258689f37</td>\n",
" </tr>\n
gitextract_sn16q5ml/
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── scripts/
│ ├── examples/
│ │ └── Test_FDB_Loader.ipynb
│ └── reproducibility/
│ ├── afd/
│ │ ├── README.md
│ │ ├── configs/
│ │ │ ├── CreditCardFraudDetection.json
│ │ │ ├── FakeJobPostingPrediction.json
│ │ │ ├── Fraudecommerce.json
│ │ │ ├── IEEECISFraudDetection.json
│ │ │ ├── IPBlocklist.json
│ │ │ ├── MaliciousURL.json
│ │ │ ├── SimulatedCreditCardTransactionsSparkov.json
│ │ │ ├── TwitterBotAccounts.json
│ │ │ └── VehicleLoanDefaultPrediction.json
│ │ ├── create_afd_resources.py
│ │ └── score_afd_model.py
│ ├── autogluon/
│ │ ├── README.md
│ │ ├── benchmark_ag.py
│ │ └── example-ag-ieeecis.ipynb
│ ├── autosklearn/
│ │ ├── README.md
│ │ └── benchmark_autosklearn.py
│ ├── benchmark_utils.py
│ ├── h2o/
│ │ ├── README.md
│ │ ├── benchmark_h2o.py
│ │ └── example-h2o-ieeecis.ipynb
│ └── label-noise/
│ ├── benchmark_experiments.ipynb
│ ├── feature_dict.py
│ ├── load_fdb_datasets.py
│ └── micro_models.py
├── setup.py
└── src/
├── __init__.py
└── fdb/
├── __init__.py
├── datasets.py
├── kaggle_configs.py
├── preprocessing.py
├── preprocessing_objects.py
└── versioned_datasets/
├── __init__.py
└── ipblock/
└── __init__.py
SYMBOL INDEX (125 symbols across 11 files)
FILE: scripts/reproducibility/afd/create_afd_resources.py
function afd_train_model_demo (line 32) | def afd_train_model_demo(config):
FILE: scripts/reproducibility/afd/score_afd_model.py
function create_outcomes (line 31) | def create_outcomes(outcomes):
function create_rules (line 40) | def create_rules(score_cuts, outcomes):
function ast_with_nan (line 88) | def ast_with_nan(x):
function afd_train_model_demo (line 95) | def afd_train_model_demo():
FILE: scripts/reproducibility/autogluon/benchmark_ag.py
function run_ag (line 29) | def run_ag(dataset, base_path, time_limit=3600, presets=None, hyperparam...
FILE: scripts/reproducibility/autosklearn/benchmark_autosklearn.py
function load_data (line 77) | def load_data(dataset_path):
function get_recall (line 101) | def get_recall(fpr, tpr, fpr_target=0.01):
function run_autosklearn (line 105) | def run_autosklearn(dataset_path):
FILE: scripts/reproducibility/benchmark_utils.py
function load_data (line 22) | def load_data(dataset, base_path):
function get_recall (line 45) | def get_recall(fpr, tpr, fpr_target=0.01):
FILE: scripts/reproducibility/h2o/benchmark_h2o.py
function run_h2o (line 29) | def run_h2o(dataset, base_path, connect_url=None, time_limit=None, inclu...
FILE: scripts/reproducibility/label-noise/load_fdb_datasets.py
function noise_amount (line 20) | def noise_amount(df):
function noise_rate (line 23) | def noise_rate(df):
function type_1_noise_amount (line 29) | def type_1_noise_amount(df):
function type_2_noise_amount (line 34) | def type_2_noise_amount(df):
function actual_legit_amount (line 39) | def actual_legit_amount(df):
function observed_legit_amount (line 42) | def observed_legit_amount(df):
function actual_fraud_amount (line 45) | def actual_fraud_amount(df):
function observed_fraud_amount (line 48) | def observed_fraud_amount(df):
function actual_fraud_rate (line 51) | def actual_fraud_rate(df):
function observed_fraud_rate (line 57) | def observed_fraud_rate(df):
function type_1_noise_rate (line 63) | def type_1_noise_rate(df):
function type_2_noise_rate (line 69) | def type_2_noise_rate(df):
function prepare_data_fdb (line 75) | def prepare_data_fdb(key, drop_text_enr_features=True):
function add_noise (line 212) | def add_noise(df, noise_type, noise_amount, *, time_index=None, features...
function train_valid_split (line 273) | def train_valid_split(df, split=0.7, shuffle=True, sort_key='creation_da...
function prepare_noisy_dataset (line 285) | def prepare_noisy_dataset(key, noise_type, noise_amount, split=0.7, shuf...
function dataset_stats (line 345) | def dataset_stats(dataset):
FILE: scripts/reproducibility/label-noise/micro_models.py
class MicroModelError (line 6) | class MicroModelError(Exception):
method __init__ (line 10) | def __init__(self, error_message):
class MicroModel (line 14) | class MicroModel:
method __init__ (line 20) | def __init__(self, ModelClass, *args, **kwargs):
method set_thresh (line 28) | def set_thresh(self, thresh):
method fit (line 32) | def fit(self, x, y, *args, **kwargs):
method predict_proba (line 36) | def predict_proba(self, x, *args, **kwargs):
method predict (line 43) | def predict(self, x):
class MicroModelEnsemble (line 54) | class MicroModelEnsemble:
method __init__ (line 59) | def __init__(self, ModelClass, num_clfs=16, score_type='preds_avg', *a...
method fit (line 85) | def fit(self, x, y, *args, **kwargs):
method predict_proba (line 103) | def predict_proba(self, x, *args, **kwargs):
method predict (line 117) | def predict(self, x, threshold=0.5, *args, **kwargs):
method filter_noise (line 123) | def filter_noise(self, x, y, pulearning=True, threshold=0.5):
method clean_noise (line 136) | def clean_noise(self, x, y, pulearning=True, threshold=0.5):
class MicroModelCleaner (line 155) | class MicroModelCleaner:
method __init__ (line 161) | def __init__(self, ModelClass, strategy='filter', pulearning=True, num...
method fit (line 181) | def fit(self, x, y, *args, **kwargs):
method predict (line 192) | def predict(self, x, *args, **kwargs):
method predict_proba (line 195) | def predict_proba(self, x, *args, **kwargs):
FILE: src/fdb/datasets.py
class FraudDatasetBenchmark (line 6) | class FraudDatasetBenchmark(ABC):
method __init__ (line 7) | def __init__(
method train (line 23) | def train(self):
method test (line 27) | def test(self):
method test_labels (line 31) | def test_labels(self):
method eval (line 34) | def eval(self, y_pred):
FILE: src/fdb/preprocessing.py
class BasePreProcessor (line 51) | class BasePreProcessor(ABC):
method __init__ (line 52) | def __init__(
method _download_kaggle_data_from_competetions (line 93) | def _download_kaggle_data_from_competetions(self):
method _download_kaggle_data_from_datasets_with_given_filename (line 101) | def _download_kaggle_data_from_datasets_with_given_filename(self):
method _download_kaggle_data_from_datasets_containing_single_file (line 114) | def _download_kaggle_data_from_datasets_containing_single_file(self):
method download_kaggle_data (line 122) | def download_kaggle_data(self):
method load_data (line 150) | def load_data(self):
method timestamp_col (line 156) | def timestamp_col(self):
method label_col (line 160) | def label_col(self):
method event_id_col (line 167) | def event_id_col(self):
method entity_id_col (line 171) | def entity_id_col(self):
method standardize_timestamp_col (line 174) | def standardize_timestamp_col(self):
method standardize_label_col (line 191) | def standardize_label_col(self):
method standardize_event_id_col (line 195) | def standardize_event_id_col(self):
method standardize_entity_id_col (line 204) | def standardize_entity_id_col(self):
method rename_features (line 211) | def rename_features(self):
method subset_features (line 215) | def subset_features(self):
method drop_features (line 219) | def drop_features(self):
method add_meta_data (line 222) | def add_meta_data(self):
method sort_by_timestamp (line 226) | def sort_by_timestamp(self):
method lower_case_col_names (line 229) | def lower_case_col_names(self):
method preprocess (line 232) | def preprocess(self):
method train_test_split (line 245) | def train_test_split(self):
class FakejobPreProcessor (line 264) | class FakejobPreProcessor(BasePreProcessor):
method __init__ (line 265) | def __init__(self, **kw):
class VehicleloanPreProcessor (line 269) | class VehicleloanPreProcessor(BasePreProcessor):
method __init__ (line 270) | def __init__(self, **kw):
class MalurlPreProcessor (line 274) | class MalurlPreProcessor(BasePreProcessor):
method __init__ (line 280) | def __init__(self, **kw):
method standardize_label_col (line 283) | def standardize_label_col(self):
method add_dummy_col (line 294) | def add_dummy_col(self):
method preprocess (line 297) | def preprocess(self):
class IEEEPreProcessor (line 301) | class IEEEPreProcessor(BasePreProcessor):
method __init__ (line 312) | def __init__(self, **kw):
method _dtypes_cols (line 316) | def _dtypes_cols():
method load_data (line 372) | def load_data(self):
method normalization (line 396) | def normalization(self):
method standardize_entity_id_col (line 402) | def standardize_entity_id_col(self):
method _add_seconds (line 412) | def _add_seconds(x):
method standardize_timestamp_col (line 419) | def standardize_timestamp_col(self):
method subset_features (line 425) | def subset_features(self):
method preprocess (line 436) | def preprocess(self):
class CCFraudPreProcessor (line 450) | class CCFraudPreProcessor(BasePreProcessor):
method __init__ (line 451) | def __init__(self, **kw):
method _add_minutes (line 455) | def _add_minutes(x):
method standardize_timestamp_col (line 461) | def standardize_timestamp_col(self):
class FraudecomPreProcessor (line 467) | class FraudecomPreProcessor(BasePreProcessor):
method __init__ (line 468) | def __init__(self, ip_address_col, signup_time_col, **kw):
method _add_years (line 474) | def _add_years(init_time):
method standardize_timestamp_col (line 481) | def standardize_timestamp_col(self):
method process_ip (line 490) | def process_ip(self):
method create_time_since_signup (line 497) | def create_time_since_signup(self):
method preprocess (line 502) | def preprocess(self):
class SparknovPreProcessor (line 517) | class SparknovPreProcessor(BasePreProcessor):
method __init__ (line 518) | def __init__(self, **kw):
method load_data (line 521) | def load_data(self):
method _add_months (line 538) | def _add_months(x):
method standardize_timestamp_col (line 545) | def standardize_timestamp_col(self):
method standardize_entity_id_col (line 551) | def standardize_entity_id_col(self):
method train_test_split (line 558) | def train_test_split(self):
class TwitterbotPreProcessor (line 574) | class TwitterbotPreProcessor(BasePreProcessor):
method __init__ (line 575) | def __init__(self, **kw):
method standardize_label_col (line 578) | def standardize_label_col(self):
class IPBlocklistPreProcessor (line 588) | class IPBlocklistPreProcessor(BasePreProcessor):
method __init__ (line 598) | def __init__(self, version, **kw):
method load_data (line 602) | def load_data(self):
method add_dummy_col (line 628) | def add_dummy_col(self):
method train_test_split (line 631) | def train_test_split(self):
method preprocess (line 635) | def preprocess(self):
FILE: src/fdb/preprocessing_objects.py
function load_data (line 4) | def load_data(key, load_pre_downloaded, delete_downloaded, add_random_va...
Condensed preview — 39 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (696K chars).
[
{
"path": "CODE_OF_CONDUCT.md",
"chars": 309,
"preview": "## Code of Conduct\nThis project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-condu"
},
{
"path": "CONTRIBUTING.md",
"chars": 3160,
"preview": "# Contributing Guidelines\n\nThank you for your interest in contributing to our project. Whether it's a bug report, new fe"
},
{
"path": "LICENSE",
"chars": 1288,
"preview": "MIT License\n\nCopyright (c) 2021-2022 Prince Grover\nCopyright (c) 2021-2022 Zheng Li\nCopyright (c) 2022 Jianbo Liu\nCopyri"
},
{
"path": "README.md",
"chars": 20016,
"preview": "# FDB: Fraud Dataset Benchmark\n\n*By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin T"
},
{
"path": "scripts/examples/Test_FDB_Loader.ipynb",
"chars": 287022,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": [\n "
},
{
"path": "scripts/reproducibility/afd/README.md",
"chars": 2082,
"preview": "## Steps to reproduce AFD models\nAmazon Fraud Detector (AFD) models can be either run via AWS Console or using API calls"
},
{
"path": "scripts/reproducibility/afd/configs/CreditCardFraudDetection.json",
"chars": 3967,
"preview": "{\n \"dataset\": \"Credit Card Fraud Detection\",\n \"variable_mappings\": [\n {\n \"variable_name\": \"v1\",\n"
},
{
"path": "scripts/reproducibility/afd/configs/FakeJobPostingPrediction.json",
"chars": 2517,
"preview": "{\n \"dataset\": \"Fake Job Posting Prediction\", \n \"variable_mappings\": [\n {\n \"variable_name\": \"titl"
},
{
"path": "scripts/reproducibility/afd/configs/Fraudecommerce.json",
"chars": 1023,
"preview": "{\n \"dataset\": \"Fraud ecommerce\",\n \"variable_mappings\": [\n {\n \"variable_name\": \"purchase_value\",\n"
},
{
"path": "scripts/reproducibility/afd/configs/IEEECISFraudDetection.json",
"chars": 9041,
"preview": "{\n \"dataset\": \"IEEE-CIS Fraud Detection\",\n \"variable_mappings\": [\n {\n \"variable_name\": \"transact"
},
{
"path": "scripts/reproducibility/afd/configs/IPBlocklist.json",
"chars": 462,
"preview": "{\n \"dataset\": \"IP-BlockList\",\n \"variable_mappings\": [\n {\n \"variable_name\": \"ip\",\n \"va"
},
{
"path": "scripts/reproducibility/afd/configs/MaliciousURL.json",
"chars": 490,
"preview": "{\n \"dataset\": \"Malicious URLs Dataset\",\n \"variable_mappings\": [\n {\n \"variable_name\": \"url\",\n "
},
{
"path": "scripts/reproducibility/afd/configs/SimulatedCreditCardTransactionsSparkov.json",
"chars": 2551,
"preview": "{\n \"dataset\": \"Simulated Credit Card Transactions generated using Sparkov\",\n \"variable_mappings\": [\n {\n "
},
{
"path": "scripts/reproducibility/afd/configs/TwitterBotAccounts.json",
"chars": 2527,
"preview": "{\n \"dataset\": \"Twitter Bots Accounts\",\n \"variable_mappings\": [\n {\n \"variable_name\": \"default_pro"
},
{
"path": "scripts/reproducibility/afd/configs/VehicleLoanDefaultPrediction.json",
"chars": 5738,
"preview": "{\n \"dataset\": \"Vehicle Loan Default Prediction\",\n \"variable_mappings\": [\n {\n \"variable_name\": \"d"
},
{
"path": "scripts/reproducibility/afd/create_afd_resources.py",
"chars": 7544,
"preview": "# TO BE UPDATED BY USER\nIAM_ROLE = \"<IAM ROLE with acceess to S3 bucket containing the data and access to Amazon Fraud D"
},
{
"path": "scripts/reproducibility/afd/score_afd_model.py",
"chars": 9111,
"preview": "# TO BE UPDATED BY USER\nIAM_ROLE = \"<IAM ROLE with acceess to S3 bucket containing the data and access to Amazon Fraud D"
},
{
"path": "scripts/reproducibility/autogluon/README.md",
"chars": 304,
"preview": " - benchmark_ag.py: a script for autogluon benchmarking\n - example-ag-ieeecis.ipynb: an example notebook using benchmark"
},
{
"path": "scripts/reproducibility/autogluon/benchmark_ag.py",
"chars": 2971,
"preview": "import pandas as pd\nimport os\nimport gc\nimport joblib\nimport datetime\n\nimport matplotlib as mpl\nfrom sklearn.metrics imp"
},
{
"path": "scripts/reproducibility/autogluon/example-ag-ieeecis.ipynb",
"chars": 97746,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"id\": \"7d350d0d\",\n \"metadata\": {},\n \"outputs\":"
},
{
"path": "scripts/reproducibility/autosklearn/README.md",
"chars": 879,
"preview": "## Steps to reproduce Auto-sklearn models\n\n\n1. Load and save the datasets locally using [FDB Loader](../../examples/Test"
},
{
"path": "scripts/reproducibility/autosklearn/benchmark_autosklearn.py",
"chars": 5336,
"preview": "\nimport json\nimport joblib\nimport datetime\nimport numpy as np\nimport pandas as pd\nimport os, sys, shutil\n\nfrom autosklea"
},
{
"path": "scripts/reproducibility/benchmark_utils.py",
"chars": 1448,
"preview": "import numpy as np\nimport pandas as pd\nimport os\n\nimport matplotlib as mpl\n\nmpl.rcParams['figure.dpi'] = 150\npd.set_opti"
},
{
"path": "scripts/reproducibility/h2o/README.md",
"chars": 292,
"preview": "- benchmark_h2o.py: a script for h2o benchmarking\n- example-h2o-ieeecis.ipynb: an example notebook using benchmark_h2o.p"
},
{
"path": "scripts/reproducibility/h2o/benchmark_h2o.py",
"chars": 3597,
"preview": "import pandas as pd\nimport os\nimport gc\nimport joblib\n\nimport matplotlib as mpl\nfrom sklearn.metrics import roc_auc_scor"
},
{
"path": "scripts/reproducibility/h2o/example-h2o-ieeecis.ipynb",
"chars": 85367,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"id\": \"afc2eecf\",\n \"metadata\": {},\n \"outputs\":"
},
{
"path": "scripts/reproducibility/label-noise/benchmark_experiments.ipynb",
"chars": 20437,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"id\": \"c77e5eb5\",\n \"metadata\": {},\n \"output"
},
{
"path": "scripts/reproducibility/label-noise/feature_dict.py",
"chars": 5505,
"preview": "feature_dict = {\n 'ieeecis': {\n 'transactionamt': 'numeric',\n 'productcd': 'categorical',\n 'card1': 'numeric',"
},
{
"path": "scripts/reproducibility/label-noise/load_fdb_datasets.py",
"chars": 14703,
"preview": "import os\nimport re\nimport json\nimport pandas as pd\nimport numpy as np\nimport warnings\nfrom datetime import datetime\n\nfr"
},
{
"path": "scripts/reproducibility/label-noise/micro_models.py",
"chars": 8233,
"preview": "import logging\nimport pandas as pd\nimport numpy as np\n\n\nclass MicroModelError(Exception):\n \"\"\"\n basic exception ty"
},
{
"path": "setup.py",
"chars": 655,
"preview": "import os\nfrom glob import glob\n\nfrom setuptools import find_packages, setup\n\n\nsetup(\n name='fraud_dataset_benchmark'"
},
{
"path": "src/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "src/fdb/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "src/fdb/datasets.py",
"chars": 1300,
"preview": "from abc import abstractmethod, ABC\nfrom fdb.preprocessing import *\nfrom fdb.preprocessing_objects import load_data\nfrom"
},
{
"path": "src/fdb/kaggle_configs.py",
"chars": 1861,
"preview": "KAGGLE_CONFIGS = {\n\n \"fakejob\":\n {\n \"owner\": \"shivamb\",\n \"dataset\": \"real-or-fake-fake-jobposting-pr"
},
{
"path": "src/fdb/preprocessing.py",
"chars": 26223,
"preview": "\n\nimport os\nimport re\nimport shutil\nimport kaggle\nimport pkgutil\nimport requests\nimport zipfile\nimport numpy as np\nfrom "
},
{
"path": "src/fdb/preprocessing_objects.py",
"chars": 2949,
"preview": "from fdb.preprocessing import *\n\n\ndef load_data(key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_n"
},
{
"path": "src/fdb/versioned_datasets/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "src/fdb/versioned_datasets/ipblock/__init__.py",
"chars": 0,
"preview": ""
}
]
About this extraction
This page contains the full source code of the amazon-science/fraud-dataset-benchmark GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 39 files (623.7 KB), approximately 219.8k tokens, and a symbol index with 125 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.