Repository: amazon-science/fraud-dataset-benchmark Branch: main Commit: f100cb829599 Files: 39 Total size: 623.7 KB Directory structure: gitextract_sn16q5ml/ ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── scripts/ │ ├── examples/ │ │ └── Test_FDB_Loader.ipynb │ └── reproducibility/ │ ├── afd/ │ │ ├── README.md │ │ ├── configs/ │ │ │ ├── CreditCardFraudDetection.json │ │ │ ├── FakeJobPostingPrediction.json │ │ │ ├── Fraudecommerce.json │ │ │ ├── IEEECISFraudDetection.json │ │ │ ├── IPBlocklist.json │ │ │ ├── MaliciousURL.json │ │ │ ├── SimulatedCreditCardTransactionsSparkov.json │ │ │ ├── TwitterBotAccounts.json │ │ │ └── VehicleLoanDefaultPrediction.json │ │ ├── create_afd_resources.py │ │ └── score_afd_model.py │ ├── autogluon/ │ │ ├── README.md │ │ ├── benchmark_ag.py │ │ └── example-ag-ieeecis.ipynb │ ├── autosklearn/ │ │ ├── README.md │ │ └── benchmark_autosklearn.py │ ├── benchmark_utils.py │ ├── h2o/ │ │ ├── README.md │ │ ├── benchmark_h2o.py │ │ └── example-h2o-ieeecis.ipynb │ └── label-noise/ │ ├── benchmark_experiments.ipynb │ ├── feature_dict.py │ ├── load_fdb_datasets.py │ └── micro_models.py ├── setup.py └── src/ ├── __init__.py └── fdb/ ├── __init__.py ├── datasets.py ├── kaggle_configs.py ├── preprocessing.py ├── preprocessing_objects.py └── versioned_datasets/ ├── __init__.py └── ipblock/ └── __init__.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: CODE_OF_CONDUCT.md ================================================ ## Code of Conduct This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact opensource-codeofconduct@amazon.com with any additional questions or comments. ================================================ FILE: CONTRIBUTING.md ================================================ # Contributing Guidelines Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional documentation, we greatly value feedback and contributions from our community. Please read through this document before submitting any issues or pull requests to ensure we have all the necessary information to effectively respond to your bug report or contribution. ## Reporting Bugs/Feature Requests We welcome you to use the GitHub issue tracker to report bugs or suggest features. When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: * A reproducible test case or series of steps * The version of our code being used * Any modifications you've made relevant to the bug * Anything unusual about your environment or deployment ## Contributing via Pull Requests Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 1. You are working against the latest source on the *main* branch. 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. To send us a pull request, please: 1. Fork the repository. 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 3. Ensure local tests pass. 4. Commit to your fork using clear commit messages. 5. Send us a pull request, answering any default questions in the pull request interface. 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). ## Finding contributions to work on Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. ## Code of Conduct This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact opensource-codeofconduct@amazon.com with any additional questions or comments. ## Security issue notifications If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. ## Licensing See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2021-2022 Prince Grover Copyright (c) 2021-2022 Zheng Li Copyright (c) 2022 Jianbo Liu Copyright (c) 2022 Jakub Zablocki Copyright (c) 2022 Jianbo Liu Copyright (c) 2022 Hao Zhou Copyright (c) 2022 Julia Xu Copyright (c) 2022 Anqi Cheng Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # FDB: Fraud Dataset Benchmark *By [Prince Grover](groverpr), [Zheng Li](zhengli0817), [Julia Xu](SheliaXin), [Justin Tittelfitz](jtittelfitz), Anqi Cheng, [Jakub Zablocki](qbaza), Jianbo Liu, and [Hao Zhou](haozhouamzn)* [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) The **Fraud Dataset Benchmark (FDB)** is a compilation of publicly available datasets relevant to **fraud detection** ([arXiv Link](https://arxiv.org/abs/2208.14417)). The FDB aims to cover a wide variety of fraud detection tasks, ranging from card not present transaction fraud, bot attacks, malicious traffic, loan risk and content moderation. The Python based data loaders from FDB provide dataset loading, standardized train-test splits and performance evaluation metrics. The goal of our work is to provide researchers working in the field of fraud and abuse detection a standardized set of benchmarking datasets and evaluation tools for their experiments. Using FDB tools we We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning. ## Datasets used in FDB Brief summary of the datasets used in FDB. Each dataset is described in detail in [data source section](#data-sources). | **#** | **Dataset name** | **Dataset key** | **Fraud category** | **#Train** | **#Test** | **Class ratio (train)** | **#Feats** | **#Cat** | **#Num** | **#Text** | **#Enrichable** | |-------|------------------------------------------------------------|-----------------|-------------------------------------|------------|-----------|-------------------------|------------|----------|----------|-----------|-----------------| | 1 | IEEE-CIS Fraud Detection | ieeecis | Card Not Present Transactions Fraud | 561,013 | 28,527 | 3.50% | 67 | 6 | 61 | 0 | 0 | | 2 | Credit Card Fraud Detection | ccfraud | Card Not Present Transactions Fraud | 227,845 | 56,962 | 0.18% | 28 | 0 | 28 | 0 | 0 | | 3 | Fraud ecommerce | fraudecom | Card Not Present Transactions Fraud | 120,889 | 30,223 | 10.60% | 6 | 2 | 3 | 0 | 1 | | 4 | Simulated Credit Card Transactions generated using Sparkov | sparknov | Card Not Present Transactions Fraud | 1,296,675 | 20,000 | 5.70% | 17 | 10 | 6 | 1 | 0 | | 5 | Twitter Bots Accounts | twitterbot | Bot Attacks | 29,950 | 7,488 | 33.10% | 16 | 6 | 6 | 4 | 0 | | 6 | Malicious URLs dataset | malurl | Malicious Traffic | 586,072 | 65,119 | 34.20% | 2 | 0 | 1 | 1 | 0 | | 7 | Fake Job Posting Prediction | fakejob | Content Moderation | 14,304 | 3,576 | 4.70% | 16 | 10 | 1 | 5 | 0 | | 8 | Vehicle Loan Default Prediction | vehicleloan | Credit Risk | 186,523 | 46,631 | 21.60% | 38 | 13 | 22 | 3 | 0 | | 9 | IP Blocklist | ipblock | Malicious Traffic | 172,000 | 43,000 | 7% | 1 | 0 | 0 | 0 | 1 | ## Installation ### Requirements - Kaggle account - **Important**: `ieeecis` dataset requires you to [**join IEEE-CIS competetion**](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call fdb API. Otherwise you will get ApiException: (403). - AWS account - Python 3.7+ - Python requirements ``` autogluon==0.4.2 h2o==3.36.1.2 boto3==1.20.21 click==8.0.3 click-plugins==1.1.1 Faker==4.14.2 joblib==1.0.0 kaggle==1.5.12 numpy==1.19.5 pandas==1.1.2 regex==2020.7.14 scikit-learn==0.22.1 scipy==1.5.4 auto-sklearn==0.14.7 dask==2022.8.1 ``` ### Step 1: Setup Kaggle CLI The `FraudDatasetBenchmark` object is going to load datasets from the source (which in most of the cases is Kaggle), and then it will modify/standardize on the fly, and provide train-test splits. So, the first step is to setup Kaggle CLI in the machine being used to run Python. Use intructions from [How to Use Kaggle](https://www.kaggle.com/docs/api) guide. The steps include: Remember to download the authentication token from "My Account" on Kaggle, and save token at `~/.kaggle/kaggle.json` on Linux, OSX and at `C:\Users.kaggle\kaggle.json` on Windows. If the token is not there, an error will be raised. Hence, once you’ve downloaded the token, you should move it from your Downloads folder to this folder. #### Step 1.2. [Join IEEE-CIS competetion](https://www.kaggle.com/competitions/ieee-fraud-detection/overview) from your Kaggle account, before you can call `fdb.datasets` with `ieeecis`. Otherwise you will get ApiException: (403). ### Step 2: Clone Repo Once Kaggle CLI is setup and installed, clone the github repo using `git clone https://github.com/amazon-research/fraud-dataset-benchmark.git` if using HTTPS, or `git clone git@github.com:amazon-research/fraud-dataset-benchmark.git` if using SSH. ### Step 3: Install Once repo is cloned, from your terminal, `cd` to the repo and type `pip install .`, which will install the required classes and methods. ## FraudDatasetBenchmark Usage The usage is straightforward, where you create a `dataset` object of `FraudDatasetBenchmark` class, and extract useful goodies like train/test splits and eval_metrics. **Important note**: If you are running multiple experiments that require re-loading dataframes multiple times, default setting of downloading from Kaggle before loading into dataframe exceed the account level API limits. So, use the setting to persist the downloaded dataset and then load from the persisted data. During the first call of FraudDatasetBenchmark(), use `load_pre_downloaded=False, delete_downloaded=False` and for subsequent calls, use `load_pre_downloaded=True, delete_downloaded=False`. The default setting is `load_pre_downloaded=False, delete_downloaded=True` ``` from fdb.datasets import FraudDatasetBenchmark # all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud', 'fraudecom', 'twitterbot', 'ipblock'] key = 'ipblock' obj = FraudDatasetBenchmark( key=key, load_pre_downloaded=False, # default delete_downloaded=True, # default add_random_values_if_real_na = { "EVENT_TIMESTAMP": True, "LABEL_TIMESTAMP": True, "ENTITY_ID": True, "ENTITY_TYPE": True, "ENTITY_ID": True, "EVENT_ID": True } # default ) print(obj.key) print('Train set: ') display(obj.train.head()) print(len(obj.train.columns)) print(obj.train.shape) print('Test set: ') display(obj.test.head()) print(obj.test.shape) print('Test scores') display(obj.test_labels.head()) print(obj.test_labels['EVENT_LABEL'].value_counts()) print(obj.train['EVENT_LABEL'].value_counts(normalize=True)) print('=========') ``` Notebook template to load dataset using FDB data-loader is available at [scripts/examples/Test_FDB_Loader.ipynb](scripts/examples/Test_FDB_Loader.ipynb) ## Reproducibility Reproducibility scripts are available at [scripts/reproducibility/](scripts/reproducibility/) in respective folders for [afd](scripts/reproducibility/afd), [autogluon](scripts/reproducibility/autogluon) and [h2o](scripts/reproducibility/h2o). Each folder also had README with steps to reproduce. ## Benchmark Results | **Dataset key** | **AUC-ROC** | | | | | |:---------------:|:-----------:|:-----------:|:-------------:|:----------------:|:----------------:| | | **AFD OFI** | **AFD TFI** | **AutoGluon** | **H2O** | **Auto-sklearn** | | ccfraud | 0.985 | 0.99 | 0.99 | **0.992** | 0.988 | | fakejob | 0.987 | - | **0.998** | 0.99 | 0.983 | | fraudecom | 0.519 | **0.636** | 0.522 | 0.518 | 0.515 | | ieeecis | 0.938 | **0.94** | 0.855 | 0.89 | 0.932 | | malurl | 0.985 | - | **0.998** | Training failure | 0.5 | | sparknov | **0.998** | - | 0.997 | 0.997 | 0.995 | | twitterbot | 0.934 | - | **0.943** | 0.938 | 0.936 | | vehicleloan | **0.673** | - | 0.669 | 0.67 | 0.664 | | ipblock | **0.937** | - | 0.804 | Training failure | 0.5 | ### ROC Curves The numbers in the legend represent AUC-ROC from different models from our baseline evaluations on AutoML. ![roc curves](images/all_fdb.png) ## Data Sources 1. **IEEE-CIS Fraud Detection** - Source URL: https://www.kaggle.com/c/ieee-fraud-detection/overview - Source license: https://www.kaggle.com/competitions/ieee-fraud-detection/rules - Variables: Anonymized product, card, address, email domain, device, transaction date information. Numeric columns with name prefixes as V, C, D and M, and meaning hidden from public. - Fraud category: Card Not Present Transaction Fraud - Provider: [Vesta Corporation](https://www.vesta.io/) - Release date: 2019-10-03 - Description: Prepared by IEEE Computational Intelligence Society, this card-non-present transaction fraud dataset was launched during IEEE-CIS Fraud Detection Kaggle competition, and was provided by Vesta Corporation. The original dataset contains 393 features which are reduced to 67 features in the benchmark. Feature selection was performed based on highly voted Kaggle kernels. The fraud rate in training segment of source dataset is 3.5%. We only used training files (train transaction and train identity) containing 590,540 transactions in the benchmark, and split that into train (95%) and test (5%) segments based on time. Based on the insights from a Kaggle kernel written by the competition winner, we added UUID (called it as ENTITY_ID) that represents a fingerprint and was created using card, address, time and D1 features. 2. **Credit Card Fraud Detection** - Source URL: https://www.kaggle.com/mlg-ulb/creditcardfraud/ - Source license: https://opendatacommons.org/licenses/dbcl/1-0/ - Variables: PCA transformed features, time, amount (highly imbalanced) - Fraud category: Card Not Present Transaction Fraud - Provider: [Machine Learning Group - ULB](https://mlg.ulb.ac.be/) - Release date: 2018-03-23 - Description: This dataset contains anonymized credit card transactions by European cardholders in September 2013. The dataset contains 492 frauds out of 284,807 transactions over 2 days. Data only contains numerical features that are the result of a PCA transformation, plus non transformed time and amount. 3. **Fraud ecommerce** - Source URL: https://www.kaggle.com/vbinh002/fraud-ecommerce - Source license: None - Variables: The features include sign up time, purchase time, purchase value, device id, user id, browser, and IP address. We added a new feature that measured the time difference between sign up and purchase, as the age of an account is often an important variable in fraud detection. - Fraud category: Card Not Present Transaction Fraud - Provider: [Binh Vu](https://www.kaggle.com/vbinh002) - Release date: 2018-12-09 - Description: This dataset contains ~150k e-commerce transactions. 4. **Simulated Credit Card Transactions generated using Sparkov** - Source URL: https://www.kaggle.com/kartik2112/fraud-detection - Source license: https://creativecommons.org/publicdomain/zero/1.0/ - Variables: Transaction date, credit card number, merchant, category, amount, name, street, gender. All variables are synthetically generated using the Sparknov tool. - Fraud category: Card Not Present Transaction Fraud - Provider: [Kartik Shenoy](https://www.kaggle.com/kartik2112) - Release date: 2020-08-05 - Description: This is a simulated credit card transaction dataset. The dataset was generated using Sparkov Data Generation tool and we modified a version of dataset created for Kaggle. It covers transactions of 1000 customers with a pool of 800 merchants over 6 months. We used both train and test segments directly from the source and randomly down sampled test segment. 5. **Twitter Bots Accounts** - Source URL: https://www.kaggle.com/code/davidmartngutirrez/bots-accounts-eda/data?select=twitter_human_bots_dataset.csv - Source license: https://creativecommons.org/publicdomain/zero/1.0/ - Variables: Features like account creation date, follower and following counts, profile description, account age, meta data about profile picture and account activity, and a label indicating whether the account is human or bot. - Fraud category: Bot Attacks - Provider: [David Martín Gutiérrez](https://www.kaggle.com/davidmartngutirrez) - Release date: 2020-08-20 - Description: The dataset composes of 37,438 rows corresponding to different user accounts from Twitter. 6. **Malicious URLs dataset** - Source URL: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset - Source license: https://creativecommons.org/publicdomain/zero/1.0/ - Variables: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. - Fraud category: Malicious Traffic - Provider: [Manu Siddhartha](https://www.kaggle.com/sid321axn) - Release date: 2021-07-23 - Description: The Kaggle dataset is curated using five different sources, and contains url and type. Even though original dataset has multiclass label (type), we converted it into binary label. There is no timestamp information from the source. Therefore, we generate a dummy timestamp column for consistency. 7. **Real / Fake Job Posting Prediction** - Source URL: https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction - Source license: https://creativecommons.org/publicdomain/zero/1.0/ - Variables: Title, location, department, company, salary range, requirements, description, benefits, telecommuting. Most of the variables are categorical and free form text in nature. - Fraud category: Content Moderation - Provider: [Shivam Bansal](https://www.kaggle.com/shivamb) - Release date: 2020-02-29 - Description: This Kaggle dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The task is to train classification model to detect which job posts are fraudulent. 8. **Vehicle Loan Default Prediction** - Source URL: https://www.kaggle.com/avikpaul4u/vehicle-loan-default-prediction - Source license: Unknown - Variables: Loanee information, loan information, credit bureau data, and history. - Fraud category: Credit Risk - Provider: [Avik Paul](https://www.kaggle.com/avikpaul4u) - Release date: 2019-11-12 - Description: The task in this dataset is to determine the probability of vehicle loan default, particularly the risk of default on the first monthly installments. It contains data for 233k loans with 21.7% default rate. 9. **IP Blocklist** - Source URL: http://cinsscore.com/list/ci-badguys.txt - Source license: Unknown - Variables: The dataset contains IP address and label telling malicious or fake. A dummy categorical variable that has no relation label is added. - Fraud category: Malicious Traffic - Provider: [CINSscore.com](http://cinsscore.com) - Release date: 2017-09-25 - Description: This dataset is made up from malicious IP address from cinsscore.com. To the list of malicious IP addresses, we added randomly generated IP address using Faker labeled as benign. ## Citation ``` @misc{grover2023fraud, title={Fraud Dataset Benchmark and Applications}, author={Prince Grover and Julia Xu and Justin Tittelfitz and Anqi Cheng and Zheng Li and Jakub Zablocki and Jianbo Liu and Hao Zhou}, year={2023}, eprint={2208.14417}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` ## License This project is licensed under the MIT-0 License. ## Acknowledgement We thank creators of all datasets used in the benchmark and organizations that have helped in hosting the datasets and making them widely availabel for research purposes. ================================================ FILE: scripts/examples/Test_FDB_Loader.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append('../../src/')\n", "from fdb.datasets import FraudDatasetBenchmark\n", "from fdb.kaggle_configs import KAGGLE_CONFIGS" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Notebook setups\n", "\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "from io import StringIO\n", "\n", "from IPython.core.display import display, HTML\n", "from IPython.display import clear_output\n", "display(HTML(\"\"))\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_colwidth', 200)\n", "pd.set_option('display.max_rows', 500)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import shutil\n", "\n", "if os.path.exists('tmp'):\n", " shutil.rmtree('tmp')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# UNCOMMENT IF YOU NEED TO UPLOAD DATA TO AN S3 BUCKET IN YOUR ACCOUNT\n", "\n", "# import boto3\n", "# BUCKET=''\n", "\n", "# def _s3_upload(df):\n", "# csv_memory=StringIO()\n", "# df.to_csv(csv_memory, index=False)\n", "# content = csv_memory.getvalue()\n", "# s3_client.put_object(\n", "# Body=content,\n", "# Bucket=BUCKET,\n", "# Key=KEY,\n", "# ACL='bucket-owner-full-control')\n", "# s3_client = boto3.client('s3')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# All options for keys\n", "all_keys = ['fakejob', 'vehicleloan', 'malurl', 'ieeecis', 'ccfraud','fraudecom', 'twitterbot', 'ipblock']\n", "# all_keys = ['ipblock']\n", "# all_keys = ['twitterbot']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Default setting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Default setting pulls data from the source in your system, modified the data and adds random values for columns that are missing, if add_random_values_if_real_na flags are True.\n", "\n", "Defalt parameters: \n", "- load_pre_downloaded: False\n", "- delete_downloaded: True\n", "- add_random_values_if_real_na = ```\n", "{\n", "\"EVENT_TIMESTAMP\": True,\n", "\"LABEL_TIMESTAMP\": True,\n", "\"ENTITY_ID\": True,\n", "\"ENTITY_TYPE\": True,\n", "\"EVENT_ID\": True\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n", "fakejob\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_IDtitlelocationdepartmentsalary_rangecompany_profiledescriptionrequirementsbenefitstelecommutinghas_company_logohas_questionsemployment_typerequired_experiencerequired_educationindustryfunctionEVENT_LABELENTITY_IDEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPE
57365737Jr. Business Analyst & Quality Analyst (entry level)US, NJ, PISCATAWAYNaNNaNNaNDuration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial & Health care clients.Candidate should have knowledge or experience in ...What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad...NaN000Full-timeEntry levelMaster's DegreeFinancial ServicesFinance0382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b2022-12-13T13:05:21Z2023-05-05T08:46:09Zuser
71067107English Teacher AbroadUS, PA, ScrantonNaNNaNWe help teachers get safe & secure jobs abroad :)Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr...University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders onlySee job description011ContractNaNBachelor's DegreeEducation ManagementNaN0deadb697-08d2-4dca-83ec-a15d5e501a5b2022-07-26T01:40:53Z2023-05-05T08:46:09Zuser
1197811979SQL Server Database Developer Job opportunity at Barrington, ILUS, IL, BarringtonNaN90000-100000We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc...Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil...Benefits - FullBonus Eligible - Yes000Full-timeMid-Senior levelBachelor's DegreeInformation Technology and ServicesInformation Technology0f5fcea87-6798-4529-a6c7-205d893b9b242023-03-09T13:06:59Z2023-05-05T08:46:09Zuser
93749375Legal Analyst - 12 Month FTCGB, LND, LondonLegalNaNMarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo...DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki...Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col...Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible.010Full-timeAssociateProfessionalFinancial ServicesLegal0114fbd01-0573-42cf-9365-78729264e1aa2022-12-09T08:17:07Z2023-05-05T08:46:09Zuser
13001301Part-Time Finance AssistantGB, LND,NaNNaNNaNSalary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a...Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec...Salary:£9 - £10 per hour000Part-timeNaNNaNAccountingNaN005a5dbdb-9778-4e4a-b967-7850dd483a542022-08-28T17:32:28Z2023-05-05T08:46:09Zuser
\n", "
" ], "text/plain": [ " EVENT_ID \\\n", "5736 5737 \n", "7106 7107 \n", "11978 11979 \n", "9374 9375 \n", "1300 1301 \n", "\n", " title \\\n", "5736 Jr. Business Analyst & Quality Analyst (entry level) \n", "7106 English Teacher Abroad \n", "11978 SQL Server Database Developer Job opportunity at Barrington, IL \n", "9374 Legal Analyst - 12 Month FTC \n", "1300 Part-Time Finance Assistant \n", "\n", " location department salary_range \\\n", "5736 US, NJ, PISCATAWAY NaN NaN \n", "7106 US, PA, Scranton NaN NaN \n", "11978 US, IL, Barrington NaN 90000-100000 \n", "9374 GB, LND, London Legal NaN \n", "1300 GB, LND, NaN NaN \n", "\n", " company_profile \\\n", "5736 NaN \n", "7106 We help teachers get safe & secure jobs abroad :) \n", "11978 We are an innovative personnel-sourcing firm with solid team strength in recruiting candidates for various domains in the IT and Non-IT sectors. We offer a whole gamut of HR services such as sourc... \n", "9374 MarketInvoice is one of the most high-profile London based fin-tech companies. The Company is Europe’s leading P2P invoice finance platform that allows SMEs to quickly and flexibly sell their invo... \n", "1300 NaN \n", "\n", " description \\\n", "5736 Duration: Full time / W2Location: Piscataway,NJJob description: BA/QA We are looking to hire resources for our Financial & Health care clients.Candidate should have knowledge or experience in ... \n", "7106 Play with kids, get paid for it Love travel? Jobs in Asia$1,500+ USD monthly ($200 Cost of living)Housing provided (Private/Furnished)Airfare ReimbursedExcellent for student loans/credit cardsGabr... \n", "11978 Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil... \n", "9374 DescriptionOur mission at MarketInvoice is to modernise the way by which businesses finance their working capital and fund their growth. We are seeking to bring much-needed innovation to the banki... \n", "1300 Salary:£9 - £10 per hour We are currently going through an exciting period of change and a new client base, resulting in this part-time finance position being created. You will offer a flexible, a... \n", "\n", " requirements \\\n", "5736 What we require:-- Masters degree in Computers Science/ Information Technology/MBA.-- Candidates willing to relocates New Jersey. -- Excellent Communication skills. -- Quick learner, Ability to ad... \n", "7106 University degree required. TEFL / TESOL / CELTA or teaching experience preferred but not necessaryCanada/US passport holders only \n", "11978 Position : SQL Server Database DeveloperJob Location : Location: Barrington, ILUs work status required : H1B / EAD / Green Card / US Citizens Position Summary:The SQL Server Database Developer wil... \n", "9374 Duties and ResponsibilitiesReviewing contractual terms and advising on legal risksDrafting deeds, contracts and other legal documentationResearching and advising on ad hoc legal issuesManaging col... \n", "1300 Your role will be a varied, interesting and interactive role, and will likely to be approximately 15-20 hours per week (sometimes more) and will include: - Book-keeping via Sage Line 50 - Bank rec... \n", "\n", " benefits \\\n", "5736 NaN \n", "7106 See job description \n", "11978 Benefits - FullBonus Eligible - Yes \n", "9374 Competitive salaryPrivate HealthcareHalf price gym membership25 days holidayThe opportunity to progress your career at one of London's hottest FinTech startupsStart Date - as soon as possible.  \n", "1300 Salary:£9 - £10 per hour  \n", "\n", " telecommuting has_company_logo has_questions employment_type \\\n", "5736 0 0 0 Full-time \n", "7106 0 1 1 Contract \n", "11978 0 0 0 Full-time \n", "9374 0 1 0 Full-time \n", "1300 0 0 0 Part-time \n", "\n", " required_experience required_education \\\n", "5736 Entry level Master's Degree \n", "7106 NaN Bachelor's Degree \n", "11978 Mid-Senior level Bachelor's Degree \n", "9374 Associate Professional \n", "1300 NaN NaN \n", "\n", " industry function \\\n", "5736 Financial Services Finance \n", "7106 Education Management NaN \n", "11978 Information Technology and Services Information Technology \n", "9374 Financial Services Legal \n", "1300 Accounting NaN \n", "\n", " EVENT_LABEL ENTITY_ID \\\n", "5736 0 382e41c8-f35c-4b5b-aa4d-fa0959ee7d4b \n", "7106 0 deadb697-08d2-4dca-83ec-a15d5e501a5b \n", "11978 0 f5fcea87-6798-4529-a6c7-205d893b9b24 \n", "9374 0 114fbd01-0573-42cf-9365-78729264e1aa \n", "1300 0 05a5dbdb-9778-4e4a-b967-7850dd483a54 \n", "\n", " EVENT_TIMESTAMP LABEL_TIMESTAMP ENTITY_TYPE \n", "5736 2022-12-13T13:05:21Z 2023-05-05T08:46:09Z user \n", "7106 2022-07-26T01:40:53Z 2023-05-05T08:46:09Z user \n", "11978 2023-03-09T13:06:59Z 2023-05-05T08:46:09Z user \n", "9374 2022-12-09T08:17:07Z 2023-05-05T08:46:09Z user \n", "1300 2022-08-28T17:32:28Z 2023-05-05T08:46:09Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "22\n", "(14304, 22)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_IDtitlelocationdepartmentsalary_rangecompany_profiledescriptionrequirementsbenefitstelecommutinghas_company_logohas_questionsemployment_typerequired_experiencerequired_educationindustryfunctionENTITY_IDEVENT_TIMESTAMPENTITY_TYPE
010Customer Service Associate - Part TimeUS, AZ, PhoenixNaNNaNNovitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr...The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai...Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre...NaN010Part-timeEntry levelHigh School or equivalentFinancial ServicesCustomer Service1743dd4b-f989-4227-8480-cbafa760b4de2022-12-31T18:14:06Zuser
115Account Executive - SydneyAU, NSW, SydneySalesNaNAdthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ...Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th...You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication...In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ...010Full-timeAssociateBachelor's DegreeInternetSalesd5a82588-fcff-495b-aeda-20a8de0737d02022-06-20T15:25:47Zuser
216VP of Sales - Vault DragonSG, 01, SingaporeSales120000-150000Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ...About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count...Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ...Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa...011Full-timeExecutiveBachelor's DegreeFacilities ServicesSales298d3508-76bb-4362-9ad4-f843fa3f99fa2022-10-30T20:49:56Zuser
319Visual DesignerUS, NY, New YorkNaNNaNKettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer...Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ...NaNNaN010NaNNaNNaNNaNNaNcad2f705-4b22-4110-bb06-b34a47c62a6d2022-05-30T19:30:26Zuser
421Marketing AssistantUS, TX, AustinNaNNaNIntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ...IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential...Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website...NaN010NaNNaNNaNNaNMarketing24c31ad9-95a9-479c-87c5-de6af06ddef62022-12-05T07:48:39Zuser
\n", "
" ], "text/plain": [ " EVENT_ID title location \\\n", "0 10 Customer Service Associate - Part Time US, AZ, Phoenix \n", "1 15 Account Executive - Sydney AU, NSW, Sydney \n", "2 16 VP of Sales - Vault Dragon SG, 01, Singapore \n", "3 19 Visual Designer US, NY, New York \n", "4 21 Marketing Assistant US, TX, Austin \n", "\n", " department salary_range \\\n", "0 NaN NaN \n", "1 Sales NaN \n", "2 Sales 120000-150000 \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", " company_profile \\\n", "0 Novitex Enterprise Solutions, formerly Pitney Bowes Management Services, delivers innovative document and communications management solutions that help companies around the world drive business pr... \n", "1 Adthena is the UK’s leading competitive intelligence service for Google search advertisers. Adthena is loved by major brands and digital agencies alike and provides a great opportunity to work in ... \n", "2 Jungle Ventures is the leading Singapore based, entrepreneur backed, venture capital firm, that funds and actively supports start-ups in scaling across Asia Pacific. We pride ourselves on leading ... \n", "3 Kettle is an independent digital agency based in New York City and the Bay Area. We’re committed to making digital do more — for both people and brands — because we believe the digital world offer... \n", "4 IntelliBright was created to leverage enterprise level online business practices to generate exclusive leads on behalf of our medium and small business clients across a wide variety of verticals. ... \n", "\n", " description \\\n", "0 The Customer Service Associate will be based in Phoenix, AZ. The right candidate will be an integral part of our talented team, supporting our continued growth.Responsibilities:Perform various Mai... \n", "1 Are you interested in a satisfying and financially rewarding role in a high growth technology company? You’ll work in a casual yet high energy environment alongside passionate people delivering th... \n", "2 About Vault Dragon Vault Dragon is Dropbox for your physical stuff - a startup that is changing the aesthetic face of Singapore by creating more space in households and offices. We also save count... \n", "3 Kettle is hiring a Visual Designer!Job Location: New York, NYKettle is a growing digital agency focused on delivering outstanding products, and we’ve been working hard to find equally outstanding ... \n", "4 IntelliBright is growing fast and is looking for a Marketing Assistant to join our team. Your invaluable input will help our small to midsize business clientele to achieve their greatest potential... \n", "\n", " requirements \\\n", "0 Minimum Requirements:Minimum of 6 months customer service related experience requiredHigh school diploma or equivalent (GED) requiredValid Driver's License and good driving record requiredPreferre... \n", "1 You’ll need to be smart and passionate and have 2 years experience selling software/Saas ideally including familiarity with PPC and marketing technologies. Excellent presentation and communication... \n", "2 Key Superpowers3-5 years of high-pressure sales experience, but if you absorb knowledge like a sponge and keep getting promoted we are flexiblePreferably mastery of both phone and field sales for ... \n", "3 NaN \n", "4 Job RequirementsAssist in creating client online marketing campaignsConduct research on various industry niches to determine potential partnership opportunities and make decisions on which website... \n", "\n", " benefits \\\n", "0 NaN \n", "1 In return we'll pay you well, give you some ownership in the company (stock options) and importantly provide you with excellent opportunities for advancement and professional development. Oh, and ... \n", "2 Basic: SGD 120,000Equity negotiable for a rock starGround floor opportunity to make a difference and do things as Dean said \"my way\"Hire and train your own superhero sales team, the way you wantMa... \n", "3 NaN \n", "4 NaN \n", "\n", " telecommuting has_company_logo has_questions employment_type \\\n", "0 0 1 0 Part-time \n", "1 0 1 0 Full-time \n", "2 0 1 1 Full-time \n", "3 0 1 0 NaN \n", "4 0 1 0 NaN \n", "\n", " required_experience required_education industry \\\n", "0 Entry level High School or equivalent Financial Services \n", "1 Associate Bachelor's Degree Internet \n", "2 Executive Bachelor's Degree Facilities Services \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN \n", "\n", " function ENTITY_ID \\\n", "0 Customer Service 1743dd4b-f989-4227-8480-cbafa760b4de \n", "1 Sales d5a82588-fcff-495b-aeda-20a8de0737d0 \n", "2 Sales 298d3508-76bb-4362-9ad4-f843fa3f99fa \n", "3 NaN cad2f705-4b22-4110-bb06-b34a47c62a6d \n", "4 Marketing 24c31ad9-95a9-479c-87c5-de6af06ddef6 \n", "\n", " EVENT_TIMESTAMP ENTITY_TYPE \n", "0 2022-12-31T18:14:06Z user \n", "1 2022-06-20T15:25:47Z user \n", "2 2022-10-30T20:49:56Z user \n", "3 2022-05-30T19:30:26Z user \n", "4 2022-12-05T07:48:39Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(3576, 20)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABEL
00
10
20
30
40
\n", "
" ], "text/plain": [ " EVENT_LABEL\n", "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 3389\n", "1 187\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.952531\n", "1 0.047469\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n", "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n", "vehicleloan\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_IDdisbursed_amountasset_costltvbranch_idsupplier_idmanufacturer_idcurrent_pincode_iddate_of_birthemployment_typestate_idemployee_code_idmobileno_avl_flagaadhar_flagpan_flagvoterid_flagdriving_flagpassport_flagperform_cns_scoreperform_cns_score_descriptionpri_no_of_acctspri_active_acctspri_overdue_acctspri_current_balancepri_sanctioned_amountpri_disbursed_amountsec_no_of_acctssec_active_acctssec_overdue_acctssec_current_balancesec_sanctioned_amountsec_disbursed_amountprimary_instal_amtsec_instal_amtnew_accts_in_last_six_monthsdelinquent_accts_in_last_six_monthsaverage_acct_agecredit_history_lengthno_of_inquiriesEVENT_LABELENTITY_IDEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPE
8976462711334846264455.23672272745151116-06-1991Salaried61201110000743C-Very Low Risk95016042323048919453800000091490400yrs 7mon1yrs 4mon1027b9d5e1-69de-47f2-a559-cfba34dffb5f2022-09-20T06:58:09Z2023-05-05T08:46:09Zuser
76007558674668828118784.3722350886170815-09-1994Salaried410601100000No Bureau History Available00000000000000000yrs 0mon0yrs 0mon001c58aced-df31-4170-8f85-e0dd95d1ff212022-08-25T18:27:59Z2023-05-05T08:46:09Zuser
77677528251591137175784.87482147886632201-01-1995Self employed51189110000738C-Very Low Risk33045828585825858200000042400300yrs 2mon0yrs 4mon01fa383d19-de52-4a71-8222-77e328fcf3872022-10-13T07:51:51Z2023-05-05T08:46:09Zuser
209438633950560597130781.341461831786298901-01-1971Salaried1429641100000No Bureau History Available00000000000000000yrs 0mon0yrs 0mon016aa0b3ef-8fff-4094-bc16-2a7ec4c00e372022-08-09T09:25:01Z2023-05-05T08:46:09Zuser
143261476747567596710085.691361778386379303-12-1975Self employed812951100000No Bureau History Available00000000000000000yrs 0mon0yrs 0mon00e00bb721-ce37-4d32-99e8-84f8a46cf82f2022-06-27T20:32:23Z2023-05-05T08:46:09Zuser
\n", "
" ], "text/plain": [ " EVENT_ID disbursed_amount asset_cost ltv branch_id supplier_id \\\n", "8976 462711 33484 62644 55.23 67 22727 \n", "76007 558674 66882 81187 84.37 2 23508 \n", "77677 528251 59113 71757 84.87 48 21478 \n", "209438 633950 56059 71307 81.34 146 18317 \n", "143261 476747 56759 67100 85.69 136 17783 \n", "\n", " manufacturer_id current_pincode_id date_of_birth employment_type \\\n", "8976 45 1511 16-06-1991 Salaried \n", "76007 86 1708 15-09-1994 Salaried \n", "77677 86 6322 01-01-1995 Self employed \n", "209438 86 2989 01-01-1971 Salaried \n", "143261 86 3793 03-12-1975 Self employed \n", "\n", " state_id employee_code_id mobileno_avl_flag aadhar_flag pan_flag \\\n", "8976 6 1201 1 1 0 \n", "76007 4 1060 1 1 0 \n", "77677 5 1189 1 1 0 \n", "209438 14 2964 1 1 0 \n", "143261 8 1295 1 1 0 \n", "\n", " voterid_flag driving_flag passport_flag perform_cns_score \\\n", "8976 0 0 0 743 \n", "76007 0 0 0 0 \n", "77677 0 0 0 738 \n", "209438 0 0 0 0 \n", "143261 0 0 0 0 \n", "\n", " perform_cns_score_description pri_no_of_accts pri_active_accts \\\n", "8976 C-Very Low Risk 9 5 \n", "76007 No Bureau History Available 0 0 \n", "77677 C-Very Low Risk 3 3 \n", "209438 No Bureau History Available 0 0 \n", "143261 No Bureau History Available 0 0 \n", "\n", " pri_overdue_accts pri_current_balance pri_sanctioned_amount \\\n", "8976 0 160423 230489 \n", "76007 0 0 0 \n", "77677 0 45828 58582 \n", "209438 0 0 0 \n", "143261 0 0 0 \n", "\n", " pri_disbursed_amount sec_no_of_accts sec_active_accts \\\n", "8976 194538 0 0 \n", "76007 0 0 0 \n", "77677 58582 0 0 \n", "209438 0 0 0 \n", "143261 0 0 0 \n", "\n", " sec_overdue_accts sec_current_balance sec_sanctioned_amount \\\n", "8976 0 0 0 \n", "76007 0 0 0 \n", "77677 0 0 0 \n", "209438 0 0 0 \n", "143261 0 0 0 \n", "\n", " sec_disbursed_amount primary_instal_amt sec_instal_amt \\\n", "8976 0 9149 0 \n", "76007 0 0 0 \n", "77677 0 4240 0 \n", "209438 0 0 0 \n", "143261 0 0 0 \n", "\n", " new_accts_in_last_six_months delinquent_accts_in_last_six_months \\\n", "8976 4 0 \n", "76007 0 0 \n", "77677 3 0 \n", "209438 0 0 \n", "143261 0 0 \n", "\n", " average_acct_age credit_history_length no_of_inquiries EVENT_LABEL \\\n", "8976 0yrs 7mon 1yrs 4mon 1 0 \n", "76007 0yrs 0mon 0yrs 0mon 0 0 \n", "77677 0yrs 2mon 0yrs 4mon 0 1 \n", "209438 0yrs 0mon 0yrs 0mon 0 1 \n", "143261 0yrs 0mon 0yrs 0mon 0 0 \n", "\n", " ENTITY_ID EVENT_TIMESTAMP \\\n", "8976 27b9d5e1-69de-47f2-a559-cfba34dffb5f 2022-09-20T06:58:09Z \n", "76007 1c58aced-df31-4170-8f85-e0dd95d1ff21 2022-08-25T18:27:59Z \n", "77677 fa383d19-de52-4a71-8222-77e328fcf387 2022-10-13T07:51:51Z \n", "209438 6aa0b3ef-8fff-4094-bc16-2a7ec4c00e37 2022-08-09T09:25:01Z \n", "143261 e00bb721-ce37-4d32-99e8-84f8a46cf82f 2022-06-27T20:32:23Z \n", "\n", " LABEL_TIMESTAMP ENTITY_TYPE \n", "8976 2023-05-05T08:46:09Z user \n", "76007 2023-05-05T08:46:09Z user \n", "77677 2023-05-05T08:46:09Z user \n", "209438 2023-05-05T08:46:09Z user \n", "143261 2023-05-05T08:46:09Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "44\n", "(186523, 44)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_IDdisbursed_amountasset_costltvbranch_idsupplier_idmanufacturer_idcurrent_pincode_iddate_of_birthemployment_typestate_idemployee_code_idmobileno_avl_flagaadhar_flagpan_flagvoterid_flagdriving_flagpassport_flagperform_cns_scoreperform_cns_score_descriptionpri_no_of_acctspri_active_acctspri_overdue_acctspri_current_balancepri_sanctioned_amountpri_disbursed_amountsec_no_of_acctssec_active_acctssec_overdue_acctssec_current_balancesec_sanctioned_amountsec_disbursed_amountprimary_instal_amtsec_instal_amtnew_accts_in_last_six_monthsdelinquent_accts_in_last_six_monthsaverage_acct_agecredit_history_lengthno_of_inquiriesENTITY_IDEVENT_TIMESTAMPENTITY_TYPE
0420825505785840089.55672280745144101-01-1984Salaried619981100000No Bureau History Available00000000000000000yrs 0mon0yrs 0mon003cf53e2-5c0b-4809-8333-04560101987b2022-12-29T10:25:40Zuser
1518279545136190089.66672280745150108-09-1990Self employed61998110000825A-Very Low Risk20000000000013470001yrs 9mon2yrs 0mon003166b12-ee18-4144-aa73-10a3d2ac999a2022-08-07T20:17:18Zuser
2510278438946190071.89672280745150104-10-1989Salaried6199811000017Not Scored: Not Enough Info available on the customer11072879745007450000000000000yrs 2mon0yrs 2mon0ff0fc8f9-c524-45cc-99b4-139dd726d7cd2022-11-03T09:35:54Zuser
3510980526036130086.95672280745149201-06-1968Salaried61998100100818A-Very Low Risk10000000000026080001yrs 7mon1yrs 7mon08955bac7-5812-4e5f-b3ae-22738ee5e7012023-02-19T06:55:03Zuser
4513916577136575089.28672280745144001-06-1976Self employed61998110000300M-Very High Risk6422906910672001067200000000471000112yrs 6mon5yrs 6mon0a8154baa-1407-493a-bbc2-4bc1fd30d1f92022-08-14T11:20:39Zuser
\n", "
" ], "text/plain": [ " EVENT_ID disbursed_amount asset_cost ltv branch_id supplier_id \\\n", "0 420825 50578 58400 89.55 67 22807 \n", "1 518279 54513 61900 89.66 67 22807 \n", "2 510278 43894 61900 71.89 67 22807 \n", "3 510980 52603 61300 86.95 67 22807 \n", "4 513916 57713 65750 89.28 67 22807 \n", "\n", " manufacturer_id current_pincode_id date_of_birth employment_type state_id \\\n", "0 45 1441 01-01-1984 Salaried 6 \n", "1 45 1501 08-09-1990 Self employed 6 \n", "2 45 1501 04-10-1989 Salaried 6 \n", "3 45 1492 01-06-1968 Salaried 6 \n", "4 45 1440 01-06-1976 Self employed 6 \n", "\n", " employee_code_id mobileno_avl_flag aadhar_flag pan_flag voterid_flag \\\n", "0 1998 1 1 0 0 \n", "1 1998 1 1 0 0 \n", "2 1998 1 1 0 0 \n", "3 1998 1 0 0 1 \n", "4 1998 1 1 0 0 \n", "\n", " driving_flag passport_flag perform_cns_score \\\n", "0 0 0 0 \n", "1 0 0 825 \n", "2 0 0 17 \n", "3 0 0 818 \n", "4 0 0 300 \n", "\n", " perform_cns_score_description pri_no_of_accts \\\n", "0 No Bureau History Available 0 \n", "1 A-Very Low Risk 2 \n", "2 Not Scored: Not Enough Info available on the customer 1 \n", "3 A-Very Low Risk 1 \n", "4 M-Very High Risk 6 \n", "\n", " pri_active_accts pri_overdue_accts pri_current_balance \\\n", "0 0 0 0 \n", "1 0 0 0 \n", "2 1 0 72879 \n", "3 0 0 0 \n", "4 4 2 29069 \n", "\n", " pri_sanctioned_amount pri_disbursed_amount sec_no_of_accts sec_active_accts \\\n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "2 74500 74500 0 0 \n", "3 0 0 0 0 \n", "4 1067200 1067200 0 0 \n", "\n", " sec_overdue_accts sec_current_balance sec_sanctioned_amount \\\n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 0 0 0 \n", "\n", " sec_disbursed_amount primary_instal_amt sec_instal_amt \\\n", "0 0 0 0 \n", "1 0 1347 0 \n", "2 0 0 0 \n", "3 0 2608 0 \n", "4 0 47100 0 \n", "\n", " new_accts_in_last_six_months delinquent_accts_in_last_six_months \\\n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 1 1 \n", "\n", " average_acct_age credit_history_length no_of_inquiries \\\n", "0 0yrs 0mon 0yrs 0mon 0 \n", "1 1yrs 9mon 2yrs 0mon 0 \n", "2 0yrs 2mon 0yrs 2mon 0 \n", "3 1yrs 7mon 1yrs 7mon 0 \n", "4 2yrs 6mon 5yrs 6mon 0 \n", "\n", " ENTITY_ID EVENT_TIMESTAMP ENTITY_TYPE \n", "0 03cf53e2-5c0b-4809-8333-04560101987b 2022-12-29T10:25:40Z user \n", "1 03166b12-ee18-4144-aa73-10a3d2ac999a 2022-08-07T20:17:18Z user \n", "2 ff0fc8f9-c524-45cc-99b4-139dd726d7cd 2022-11-03T09:35:54Z user \n", "3 8955bac7-5812-4e5f-b3ae-22738ee5e701 2023-02-19T06:55:03Z user \n", "4 a8154baa-1407-493a-bbc2-4bc1fd30d1f9 2022-08-14T11:20:39Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(46631, 42)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABEL
00
10
20
30
40
\n", "
" ], "text/plain": [ " EVENT_LABEL\n", "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 36323\n", "1 10308\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.783925\n", "1 0.216075\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n", "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n", "malurl\n", "Train set: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlEVENT_LABELEVENT_IDENTITY_IDEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPEdummy_cat
167113apolloduck.co.za/0d16773dd-0077-4129-a39d-f935464bd07f5e694594-fcfa-418e-8417-21c5e99b8d8a2022-05-15T15:36:37Z2023-05-05T08:46:09Zuser87edb1a6-7936-4afa-b7be-4c35b7f1a5c6
387680acronyms.thefreedictionary.com/WDOM0b40b1f9e-9218-4a65-8b8e-870d45feb3688d1aea20-97bb-46c4-bf56-3dc935f5c1162022-06-28T06:32:21Z2023-05-05T08:46:09Zuser864a0704-ab05-49c3-8a0c-5b0b23b3eeef
528900https://nepan.org.np/Alibaba/Alibaba.com/Login.htm186c52fda-2f6f-41ee-aa15-a7b682138cc9fce90a90-3ce2-475c-ac7d-a0d6c8fa784a2022-06-11T21:40:20Z2023-05-05T08:46:09Zuser7ef071fc-a143-4d52-bd88-2a21f2b16c56
251286soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes0447529b9-923c-43e0-afed-c570e037f1aac4a96aba-24b1-4cc4-a7b8-f9c0a9a345462022-08-15T12:11:14Z2023-05-05T08:46:09Zuser2709ea1a-f5a7-4ecc-8dbe-767910778226
433650ottawakiosk.com/hill_cam.html0976080b6-500f-4de3-95c4-a4c2679e672b21497a05-52ce-4a25-a4d4-361b8298dbc12022-08-19T15:47:51Z2023-05-05T08:46:09Zuser752bff63-ad3b-4845-b975-7f6f7302402c
\n", "
" ], "text/plain": [ " url \\\n", "167113 apolloduck.co.za/ \n", "387680 acronyms.thefreedictionary.com/WDOM \n", "528900 https://nepan.org.np/Alibaba/Alibaba.com/Login.htm \n", "251286 soundonsound.com/sos/aug06/articles/rogernichols_0806.htm?print=yes \n", "433650 ottawakiosk.com/hill_cam.html \n", "\n", " EVENT_LABEL EVENT_ID \\\n", "167113 0 d16773dd-0077-4129-a39d-f935464bd07f \n", "387680 0 b40b1f9e-9218-4a65-8b8e-870d45feb368 \n", "528900 1 86c52fda-2f6f-41ee-aa15-a7b682138cc9 \n", "251286 0 447529b9-923c-43e0-afed-c570e037f1aa \n", "433650 0 976080b6-500f-4de3-95c4-a4c2679e672b \n", "\n", " ENTITY_ID EVENT_TIMESTAMP \\\n", "167113 5e694594-fcfa-418e-8417-21c5e99b8d8a 2022-05-15T15:36:37Z \n", "387680 8d1aea20-97bb-46c4-bf56-3dc935f5c116 2022-06-28T06:32:21Z \n", "528900 fce90a90-3ce2-475c-ac7d-a0d6c8fa784a 2022-06-11T21:40:20Z \n", "251286 c4a96aba-24b1-4cc4-a7b8-f9c0a9a34546 2022-08-15T12:11:14Z \n", "433650 21497a05-52ce-4a25-a4d4-361b8298dbc1 2022-08-19T15:47:51Z \n", "\n", " LABEL_TIMESTAMP ENTITY_TYPE dummy_cat \n", "167113 2023-05-05T08:46:09Z user 87edb1a6-7936-4afa-b7be-4c35b7f1a5c6 \n", "387680 2023-05-05T08:46:09Z user 864a0704-ab05-49c3-8a0c-5b0b23b3eeef \n", "528900 2023-05-05T08:46:09Z user 7ef071fc-a143-4d52-bd88-2a21f2b16c56 \n", "251286 2023-05-05T08:46:09Z user 2709ea1a-f5a7-4ecc-8dbe-767910778226 \n", "433650 2023-05-05T08:46:09Z user 752bff63-ad3b-4845-b975-7f6f7302402c " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "8\n", "(586072, 8)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlEVENT_IDENTITY_IDEVENT_TIMESTAMPENTITY_TYPEdummy_cat
0http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.htmlb4233390-3167-401d-a85f-27331078ff273fd82c9f-b26a-44dc-ac26-4a635690938c2022-11-20T12:29:18Zuserf45a2001-81b6-4b29-bba9-e376cc9a4ca9
1cyndislist.com/us/pa/counties77d73435-251f-43fa-a82c-cc6ab4dbce6b7ac20b7a-ee66-46ce-83da-703e095e9c872022-12-26T07:01:46Zusera54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2
2https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ87a47093-0039-445f-8002-87b6af3e709deaea621e-895d-43cf-8bbb-93acac029c472022-06-25T00:29:41Zuser20e00a79-d5fc-49d1-b563-173e69f09434
3articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery3143022e-ce02-441b-8ad0-5ebbf3c1c829ba97f126-6159-4655-9c11-807c998070592023-03-07T14:27:10Zuser5398bd49-ce09-4438-bfc3-24fce419c612
4kitsapsun.com/photos/2011/feb/25/177999/8885745c-4494-4f04-92a0-bb57006fe7aab51cdf46-1467-45f0-9c9c-62233be01d0e2022-12-07T01:31:11Zuser0ac04255-86df-47bc-8990-557f4c65fe0d
\n", "
" ], "text/plain": [ " url \\\n", "0 http://buzzfil.net/m/show-art/ils-etaient-loin-de-s-imaginer-que-le-hibou-allait-faire-ceci-quand-ils-filmaient-2.html \n", "1 cyndislist.com/us/pa/counties \n", "2 https://docs.google.com/spreadsheet/viewform?formkey=dGg2Z1lCUHlSdjllTVNRUW50TFIzSkE6MQ \n", "3 articles.baltimoresun.com/1991-06-11/sports/1991162162_1_james-koehler-texas-rangers-terrell-lowery \n", "4 kitsapsun.com/photos/2011/feb/25/177999/ \n", "\n", " EVENT_ID ENTITY_ID \\\n", "0 b4233390-3167-401d-a85f-27331078ff27 3fd82c9f-b26a-44dc-ac26-4a635690938c \n", "1 77d73435-251f-43fa-a82c-cc6ab4dbce6b 7ac20b7a-ee66-46ce-83da-703e095e9c87 \n", "2 87a47093-0039-445f-8002-87b6af3e709d eaea621e-895d-43cf-8bbb-93acac029c47 \n", "3 3143022e-ce02-441b-8ad0-5ebbf3c1c829 ba97f126-6159-4655-9c11-807c99807059 \n", "4 8885745c-4494-4f04-92a0-bb57006fe7aa b51cdf46-1467-45f0-9c9c-62233be01d0e \n", "\n", " EVENT_TIMESTAMP ENTITY_TYPE dummy_cat \n", "0 2022-11-20T12:29:18Z user f45a2001-81b6-4b29-bba9-e376cc9a4ca9 \n", "1 2022-12-26T07:01:46Z user a54af7c2-9dba-4aa2-9efd-1c4ef4e2eeb2 \n", "2 2022-06-25T00:29:41Z user 20e00a79-d5fc-49d1-b563-173e69f09434 \n", "3 2023-03-07T14:27:10Z user 5398bd49-ce09-4438-bfc3-24fce419c612 \n", "4 2022-12-07T01:31:11Z user 0ac04255-86df-47bc-8990-557f4c65fe0d " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(65119, 6)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABELEVENT_ID
00b4233390-3167-401d-a85f-27331078ff27
1077d73435-251f-43fa-a82c-cc6ab4dbce6b
2187a47093-0039-445f-8002-87b6af3e709d
303143022e-ce02-441b-8ad0-5ebbf3c1c829
408885745c-4494-4f04-92a0-bb57006fe7aa
\n", "
" ], "text/plain": [ " EVENT_LABEL EVENT_ID\n", "0 0 b4233390-3167-401d-a85f-27331078ff27\n", "1 0 77d73435-251f-43fa-a82c-cc6ab4dbce6b\n", "2 1 87a47093-0039-445f-8002-87b6af3e709d\n", "3 0 3143022e-ce02-441b-8ad0-5ebbf3c1c829\n", "4 0 8885745c-4494-4f04-92a0-bb57006fe7aa" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 42695\n", "1 22424\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.657612\n", "1 0.342388\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n", "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ieeecis\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABELtransactionamtproductcdcard1card2card3card5card6addr1dist1p_emaildomainr_emaildomainc1c2c4c5c6c7c8c9c10c11c12c13c14v62v70v76v78v82v91v127v130v139v160v165v187v203v207v209v210v221v234v257v258v261v264v266v267v271v274v277v283v285v289v291v294id_01id_02id_05id_06id_09id_13id_17id_19id_20devicetypedeviceinfoEVENT_IDENTITY_IDEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPE
TransactionID
2987000.0068.5W13926.0NaN150.0142.0credit315.019.0NaNNaN1.01.00.00.01.00.00.01.00.02.00.01.01.01.00.01.01.00.00.0117.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.00.00.01.01.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNc5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f713926.0_315.0_-13.02021-01-02T00:00:00Z2023-05-05T08:46:09Zuser
2987001.0029.0W2755.0404.0150.0102.0credit325.0NaNgmail.comNaN1.01.00.00.01.00.00.00.00.01.00.01.01.01.00.00.01.01.00.00.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.00.00.01.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN9aa1d670-7446-4979-8c09-87f02311d2ca2755.0_325.0_1.02021-01-02T00:00:01Z2023-05-05T08:46:09Zuser
2987002.0059.0W4663.0490.0150.0166.0debit330.0287.0outlook.comNaN1.01.00.00.01.00.00.01.00.01.00.01.01.01.00.01.01.01.00.00.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.00.00.01.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN4cdb1e2e-3c63-4e96-80a6-382d0ec97fe34663.0_330.0_1.02021-01-02T00:01:09Z2023-05-05T08:46:09Zuser
2987003.0050.0W18132.0567.0150.0117.0debit476.0NaNyahoo.comNaN2.05.00.00.04.00.00.01.00.01.00.025.01.01.00.01.01.01.00.01758.0354.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.010.00.01.038.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNd3e3803c-b1a3-4dfd-841d-30b8d261136418132.0_476.0_-111.02021-01-02T00:01:39Z2023-05-05T08:46:09Zuser
2987004.0050.0H4497.0514.0150.0102.0credit420.0NaNgmail.comNaN1.01.00.00.01.00.01.00.01.01.00.01.01.0NaNNaNNaNNaNNaNNaN0.00.00.0169690.7968755155.01.00.00.00.00.01.00.01.01.01.00.00.00.00.00.00.01.00.00.01.00.00.070787.0NaNNaNNaNNaN166.0542.0144.0mobileSAMSUNG SM-G892A Build/NRD90M2c013afb-7779-45db-a330-a5808d5313724497.0_420.0_1.02021-01-02T00:01:46Z2023-05-05T08:46:09Zuser
\n", "
" ], "text/plain": [ " EVENT_LABEL transactionamt productcd card1 card2 card3 \\\n", "TransactionID \n", "2987000.0 0 68.5 W 13926.0 NaN 150.0 \n", "2987001.0 0 29.0 W 2755.0 404.0 150.0 \n", "2987002.0 0 59.0 W 4663.0 490.0 150.0 \n", "2987003.0 0 50.0 W 18132.0 567.0 150.0 \n", "2987004.0 0 50.0 H 4497.0 514.0 150.0 \n", "\n", " card5 card6 addr1 dist1 p_emaildomain r_emaildomain c1 \\\n", "TransactionID \n", "2987000.0 142.0 credit 315.0 19.0 NaN NaN 1.0 \n", "2987001.0 102.0 credit 325.0 NaN gmail.com NaN 1.0 \n", "2987002.0 166.0 debit 330.0 287.0 outlook.com NaN 1.0 \n", "2987003.0 117.0 debit 476.0 NaN yahoo.com NaN 2.0 \n", "2987004.0 102.0 credit 420.0 NaN gmail.com NaN 1.0 \n", "\n", " c2 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 \\\n", "TransactionID \n", "2987000.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 \n", "2987001.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 \n", "2987002.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 \n", "2987003.0 5.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 \n", "2987004.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 \n", "\n", " v62 v70 v76 v78 v82 v91 v127 v130 v139 \\\n", "TransactionID \n", "2987000.0 1.0 0.0 1.0 1.0 0.0 0.0 117.0 0.0 NaN \n", "2987001.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 NaN \n", "2987002.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 NaN \n", "2987003.0 1.0 0.0 1.0 1.0 1.0 0.0 1758.0 354.0 NaN \n", "2987004.0 NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 \n", "\n", " v160 v165 v187 v203 v207 v209 v210 v221 \\\n", "TransactionID \n", "2987000.0 NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987001.0 NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987002.0 NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987003.0 NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987004.0 169690.796875 5155.0 1.0 0.0 0.0 0.0 0.0 1.0 \n", "\n", " v234 v257 v258 v261 v264 v266 v267 v271 v274 v277 \\\n", "TransactionID \n", "2987000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987001.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987002.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987003.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2987004.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " v283 v285 v289 v291 v294 id_01 id_02 id_05 id_06 \\\n", "TransactionID \n", "2987000.0 1.0 0.0 0.0 1.0 1.0 NaN NaN NaN NaN \n", "2987001.0 1.0 0.0 0.0 1.0 0.0 NaN NaN NaN NaN \n", "2987002.0 1.0 0.0 0.0 1.0 0.0 NaN NaN NaN NaN \n", "2987003.0 0.0 10.0 0.0 1.0 38.0 NaN NaN NaN NaN \n", "2987004.0 1.0 0.0 0.0 1.0 0.0 0.0 70787.0 NaN NaN \n", "\n", " id_09 id_13 id_17 id_19 id_20 devicetype \\\n", "TransactionID \n", "2987000.0 NaN NaN NaN NaN NaN NaN \n", "2987001.0 NaN NaN NaN NaN NaN NaN \n", "2987002.0 NaN NaN NaN NaN NaN NaN \n", "2987003.0 NaN NaN NaN NaN NaN NaN \n", "2987004.0 NaN NaN 166.0 542.0 144.0 mobile \n", "\n", " deviceinfo \\\n", "TransactionID \n", "2987000.0 NaN \n", "2987001.0 NaN \n", "2987002.0 NaN \n", "2987003.0 NaN \n", "2987004.0 SAMSUNG SM-G892A Build/NRD90M \n", "\n", " EVENT_ID ENTITY_ID \\\n", "TransactionID \n", "2987000.0 c5ca20e9-c4e6-47da-bd6b-2e5ff6ea97f7 13926.0_315.0_-13.0 \n", "2987001.0 9aa1d670-7446-4979-8c09-87f02311d2ca 2755.0_325.0_1.0 \n", "2987002.0 4cdb1e2e-3c63-4e96-80a6-382d0ec97fe3 4663.0_330.0_1.0 \n", "2987003.0 d3e3803c-b1a3-4dfd-841d-30b8d2611364 18132.0_476.0_-111.0 \n", "2987004.0 2c013afb-7779-45db-a330-a5808d531372 4497.0_420.0_1.0 \n", "\n", " EVENT_TIMESTAMP LABEL_TIMESTAMP ENTITY_TYPE \n", "TransactionID \n", "2987000.0 2021-01-02T00:00:00Z 2023-05-05T08:46:09Z user \n", "2987001.0 2021-01-02T00:00:01Z 2023-05-05T08:46:09Z user \n", "2987002.0 2021-01-02T00:01:09Z 2023-05-05T08:46:09Z user \n", "2987003.0 2021-01-02T00:01:39Z 2023-05-05T08:46:09Z user \n", "2987004.0 2021-01-02T00:01:46Z 2023-05-05T08:46:09Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "73\n", "(561013, 73)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transactionamtproductcdcard1card2card3card5card6addr1dist1p_emaildomainr_emaildomainc1c2c4c5c6c7c8c9c10c11c12c13c14v62v70v76v78v82v91v127v130v139v160v165v187v203v207v209v210v221v234v257v258v261v264v266v267v271v274v277v283v285v289v291v294id_01id_02id_05id_06id_09id_13id_17id_19id_20devicetypedeviceinfoEVENT_IDENTITY_IDEVENT_TIMESTAMPENTITY_TYPE
TransactionID
3548013.0125.000000S15775.0481.0150.0102.0credit330.0NaNNaNyahoo.com5.03.03.00.00.00.08.00.03.05.00.061.05.00.00.0NaNNaNNaNNaN109411.0000002301.0000000.02401.066104.01.0103183.0877.01961.0465.00.073.0NaNNaNNaNNaNNaNNaN0.0NaNNaN1.026.01.02.0926.0-10.01411.06.00.00.052.0166.0633.0533.0desktopWindows569c4257-3d62-466d-a806-e3b456b2b37215775.0_330.0_129.02021-06-21T23:11:15Zuser
3548014.0125.000000S15775.0481.0150.0102.0credit330.0NaNNaNyahoo.com5.03.03.00.00.00.08.00.03.05.00.061.05.00.00.0NaNNaNNaNNaN109536.0000002301.0000000.02401.066229.01.0103308.0877.01961.0465.00.073.0NaNNaNNaNNaNNaNNaN0.0NaNNaN1.026.01.02.0927.0-10.0693.06.00.00.052.0166.0633.0533.0desktopWindowse951afe6-b895-42b8-adff-df0f812e9ee815775.0_330.0_129.02021-06-21T23:11:29Zuser
3548015.0125.000000S15775.0481.0150.0102.0credit330.0NaNNaNyahoo.com5.03.03.00.00.00.08.00.03.05.00.061.05.00.00.0NaNNaNNaNNaN109661.0000002301.0000000.02401.066354.01.0103433.0877.01961.0465.00.073.0NaNNaNNaNNaNNaNNaN0.0NaNNaN1.026.01.02.0928.0-10.01116.06.00.00.052.0166.0633.0533.0desktopWindowscd69e301-8c15-42b3-9839-cc4c8b9d89db15775.0_330.0_129.02021-06-21T23:11:45Zuser
3548016.0125.000000S15775.0481.0150.0102.0credit330.0NaNNaNyahoo.com5.03.03.00.00.00.08.00.03.05.00.061.05.00.00.0NaNNaNNaNNaN109786.0000002301.0000000.02401.066479.01.0103558.0877.01961.0465.00.073.0NaNNaNNaNNaNNaNNaN0.0NaNNaN1.026.01.02.0929.0-10.01589.06.00.00.052.0166.0633.0533.0desktopWindows71431bc1-19ec-49b6-a00f-4e8c7d121b0215775.0_330.0_129.02021-06-21T23:12:00Zuser
3548017.031.950001W9500.0321.0150.0226.0debit204.074.0NaNNaN3.03.00.01.01.00.00.01.00.01.00.06.03.01.01.01.02.01.01.027.95000127.950001NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.01.01.01.00.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNde297b4c-d372-4fd3-8c66-ab6ff0c19e169500.0_204.0_150.02021-06-21T23:12:11Zuser
\n", "
" ], "text/plain": [ " transactionamt productcd card1 card2 card3 card5 card6 \\\n", "TransactionID \n", "3548013.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n", "3548014.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n", "3548015.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n", "3548016.0 125.000000 S 15775.0 481.0 150.0 102.0 credit \n", "3548017.0 31.950001 W 9500.0 321.0 150.0 226.0 debit \n", "\n", " addr1 dist1 p_emaildomain r_emaildomain c1 c2 c4 c5 \\\n", "TransactionID \n", "3548013.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n", "3548014.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n", "3548015.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n", "3548016.0 330.0 NaN NaN yahoo.com 5.0 3.0 3.0 0.0 \n", "3548017.0 204.0 74.0 NaN NaN 3.0 3.0 0.0 1.0 \n", "\n", " c6 c7 c8 c9 c10 c11 c12 c13 c14 v62 v70 v76 \\\n", "TransactionID \n", "3548013.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n", "3548014.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n", "3548015.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n", "3548016.0 0.0 0.0 8.0 0.0 3.0 5.0 0.0 61.0 5.0 0.0 0.0 NaN \n", "3548017.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 6.0 3.0 1.0 1.0 1.0 \n", "\n", " v78 v82 v91 v127 v130 v139 v160 \\\n", "TransactionID \n", "3548013.0 NaN NaN NaN 109411.000000 2301.000000 0.0 2401.0 \n", "3548014.0 NaN NaN NaN 109536.000000 2301.000000 0.0 2401.0 \n", "3548015.0 NaN NaN NaN 109661.000000 2301.000000 0.0 2401.0 \n", "3548016.0 NaN NaN NaN 109786.000000 2301.000000 0.0 2401.0 \n", "3548017.0 2.0 1.0 1.0 27.950001 27.950001 NaN NaN \n", "\n", " v165 v187 v203 v207 v209 v210 v221 v234 \\\n", "TransactionID \n", "3548013.0 66104.0 1.0 103183.0 877.0 1961.0 465.0 0.0 73.0 \n", "3548014.0 66229.0 1.0 103308.0 877.0 1961.0 465.0 0.0 73.0 \n", "3548015.0 66354.0 1.0 103433.0 877.0 1961.0 465.0 0.0 73.0 \n", "3548016.0 66479.0 1.0 103558.0 877.0 1961.0 465.0 0.0 73.0 \n", "3548017.0 NaN NaN NaN NaN NaN NaN NaN NaN \n", "\n", " v257 v258 v261 v264 v266 v267 v271 v274 v277 v283 \\\n", "TransactionID \n", "3548013.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n", "3548014.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n", "3548015.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n", "3548016.0 NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 1.0 \n", "3548017.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 \n", "\n", " v285 v289 v291 v294 id_01 id_02 id_05 id_06 id_09 \\\n", "TransactionID \n", "3548013.0 26.0 1.0 2.0 926.0 -10.0 1411.0 6.0 0.0 0.0 \n", "3548014.0 26.0 1.0 2.0 927.0 -10.0 693.0 6.0 0.0 0.0 \n", "3548015.0 26.0 1.0 2.0 928.0 -10.0 1116.0 6.0 0.0 0.0 \n", "3548016.0 26.0 1.0 2.0 929.0 -10.0 1589.0 6.0 0.0 0.0 \n", "3548017.0 1.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN \n", "\n", " id_13 id_17 id_19 id_20 devicetype deviceinfo \\\n", "TransactionID \n", "3548013.0 52.0 166.0 633.0 533.0 desktop Windows \n", "3548014.0 52.0 166.0 633.0 533.0 desktop Windows \n", "3548015.0 52.0 166.0 633.0 533.0 desktop Windows \n", "3548016.0 52.0 166.0 633.0 533.0 desktop Windows \n", "3548017.0 NaN NaN NaN NaN NaN NaN \n", "\n", " EVENT_ID ENTITY_ID \\\n", "TransactionID \n", "3548013.0 569c4257-3d62-466d-a806-e3b456b2b372 15775.0_330.0_129.0 \n", "3548014.0 e951afe6-b895-42b8-adff-df0f812e9ee8 15775.0_330.0_129.0 \n", "3548015.0 cd69e301-8c15-42b3-9839-cc4c8b9d89db 15775.0_330.0_129.0 \n", "3548016.0 71431bc1-19ec-49b6-a00f-4e8c7d121b02 15775.0_330.0_129.0 \n", "3548017.0 de297b4c-d372-4fd3-8c66-ab6ff0c19e16 9500.0_204.0_150.0 \n", "\n", " EVENT_TIMESTAMP ENTITY_TYPE \n", "TransactionID \n", "3548013.0 2021-06-21T23:11:15Z user \n", "3548014.0 2021-06-21T23:11:29Z user \n", "3548015.0 2021-06-21T23:11:45Z user \n", "3548016.0 2021-06-21T23:12:00Z user \n", "3548017.0 2021-06-21T23:12:11Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(29527, 71)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABELEVENT_ID
TransactionID
3548013.00569c4257-3d62-466d-a806-e3b456b2b372
3548014.00e951afe6-b895-42b8-adff-df0f812e9ee8
3548015.00cd69e301-8c15-42b3-9839-cc4c8b9d89db
3548016.0071431bc1-19ec-49b6-a00f-4e8c7d121b02
3548017.00de297b4c-d372-4fd3-8c66-ab6ff0c19e16
\n", "
" ], "text/plain": [ " EVENT_LABEL EVENT_ID\n", "TransactionID \n", "3548013.0 0 569c4257-3d62-466d-a806-e3b456b2b372\n", "3548014.0 0 e951afe6-b895-42b8-adff-df0f812e9ee8\n", "3548015.0 0 cd69e301-8c15-42b3-9839-cc4c8b9d89db\n", "3548016.0 0 71431bc1-19ec-49b6-a00f-4e8c7d121b02\n", "3548017.0 0 de297b4c-d372-4fd3-8c66-ab6ff0c19e16" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 28358\n", "1 1169\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.965252\n", "1 0.034748\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n", "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/pringrov/opt/anaconda3/lib/python3.9/site-packages/fdb/preprocessing.py:260: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self.test_labels[_EVENT_ID] = self.test[_EVENT_ID]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ccfraud\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amountEVENT_LABELEVENT_IDENTITY_IDEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPE
0-1.3598071336738-0.07278117330984972.536346737969141.37815522427443-0.3383207699425180.4623877777622920.2395985540612570.09869790126105070.3637869696112130.0907941719789316-0.551599533260813-0.617800855762348-0.991389847235408-0.3111693536998791.46817697209427-0.4704005252594780.2079712419292420.02579058019855910.4039929602557330.251412098239705-0.0183067779441530.277837575558899-0.1104739101887670.06692807491467310.128539358273528-0.1891148438888240.133558376740387-0.0210530534538215149.620f8e77dc0-44ef-490c-b0de-8b4054b5a031266103ff-71f2-4057-981d-a548213672372021-09-01T00:00:00Z2023-05-05T08:46:09Zuser
11.191857111314860.266150712059630.166480113353210.4481540784609110.0600176492822243-0.0823608088155687-0.07880298333231130.0851016549148104-0.255425128109186-0.1669744140046141.612726661054791.065235311372870.48909501589608-0.1437722964415190.6355580932582080.463917041022171-0.114804663102346-0.183361270123994-0.145783041325259-0.0690831352230203-0.225775248033138-0.6386719527718510.101288021253234-0.3398464755291270.1671704044181430.125894532368176-0.008983099143228130.01472416919249272.690b557449e-6b35-4be0-991e-337f764f5e21f85083b2-d31f-4b9e-9d49-eb85c0476f6e2021-09-01T00:00:00Z2023-05-05T08:46:09Zuser
2-1.35835406159823-1.340163074736091.773209342631190.379779593034328-0.5031981333181931.800499380792630.7914609564504220.247675786588991-1.514654322605830.2076428652166960.6245014594248950.0660836852688310.717292731410831-0.1659459227635542.34586494901581-2.890083194442311.10996937869599-0.121359313195888-2.261857095304140.5249797252244040.2479981534697540.7716794019172290.909412262347719-0.689280956490685-0.327641833735251-0.139096571514147-0.0553527940384261-0.0597518405929204378.660d78d879c-eb7c-455d-8fde-6b1205080a4a237ca488-c695-402c-b30f-0544554ea96c2021-09-01T00:01:00Z2023-05-05T08:46:09Zuser
3-0.966271711572087-0.1852260080828981.79299333957872-0.863291275036453-0.01030887960308231.247203167524860.237608939771780.377435874652262-1.38702406270197-0.0549519224713749-0.2264872638354010.1782282258773030.507756869957169-0.28792374549456-0.631418117709045-1.0596472454325-0.6840927863454791.96577500349538-1.2326219700892-0.208037781160366-0.1083004520355450.00527359678253453-0.190320518742841-1.175575331863210.647376034602038-0.2219288444584070.06272284872930330.0614576285006353123.50ef448a36-2763-449c-a54a-a9e05af209679964b305-b591-4ed0-bff1-8adca81d01942021-09-01T00:01:00Z2023-05-05T08:46:09Zuser
4-1.158233093495230.8777367548484511.5487178465110.403033933955121-0.4071933773116530.09592146246842560.592940745385545-0.2705326771922820.8177393082352940.753074431976354-0.8228428779463630.538195550149951.3458515932154-1.119669834717310.175121130008994-0.451449182813529-0.237033239362776-0.03819478703528420.8034869249601750.408542360392758-0.009430697132329190.79827849458971-0.1374580796190630.141266983824769-0.2060095876197560.5022922241815690.2194222295133480.21515314749920669.990e333b3c0-83ae-42dc-a865-17849665302987b2fbf2-5b7d-479c-85f5-d989bd701f362021-09-01T00:02:00Z2023-05-05T08:46:09Zuser
\n", "
" ], "text/plain": [ " v1 v2 v3 \\\n", "0 -1.3598071336738 -0.0727811733098497 2.53634673796914 \n", "1 1.19185711131486 0.26615071205963 0.16648011335321 \n", "2 -1.35835406159823 -1.34016307473609 1.77320934263119 \n", "3 -0.966271711572087 -0.185226008082898 1.79299333957872 \n", "4 -1.15823309349523 0.877736754848451 1.548717846511 \n", "\n", " v4 v5 v6 \\\n", "0 1.37815522427443 -0.338320769942518 0.462387777762292 \n", "1 0.448154078460911 0.0600176492822243 -0.0823608088155687 \n", "2 0.379779593034328 -0.503198133318193 1.80049938079263 \n", "3 -0.863291275036453 -0.0103088796030823 1.24720316752486 \n", "4 0.403033933955121 -0.407193377311653 0.0959214624684256 \n", "\n", " v7 v8 v9 \\\n", "0 0.239598554061257 0.0986979012610507 0.363786969611213 \n", "1 -0.0788029833323113 0.0851016549148104 -0.255425128109186 \n", "2 0.791460956450422 0.247675786588991 -1.51465432260583 \n", "3 0.23760893977178 0.377435874652262 -1.38702406270197 \n", "4 0.592940745385545 -0.270532677192282 0.817739308235294 \n", "\n", " v10 v11 v12 \\\n", "0 0.0907941719789316 -0.551599533260813 -0.617800855762348 \n", "1 -0.166974414004614 1.61272666105479 1.06523531137287 \n", "2 0.207642865216696 0.624501459424895 0.066083685268831 \n", "3 -0.0549519224713749 -0.226487263835401 0.178228225877303 \n", "4 0.753074431976354 -0.822842877946363 0.53819555014995 \n", "\n", " v13 v14 v15 \\\n", "0 -0.991389847235408 -0.311169353699879 1.46817697209427 \n", "1 0.48909501589608 -0.143772296441519 0.635558093258208 \n", "2 0.717292731410831 -0.165945922763554 2.34586494901581 \n", "3 0.507756869957169 -0.28792374549456 -0.631418117709045 \n", "4 1.3458515932154 -1.11966983471731 0.175121130008994 \n", "\n", " v16 v17 v18 \\\n", "0 -0.470400525259478 0.207971241929242 0.0257905801985591 \n", "1 0.463917041022171 -0.114804663102346 -0.183361270123994 \n", "2 -2.89008319444231 1.10996937869599 -0.121359313195888 \n", "3 -1.0596472454325 -0.684092786345479 1.96577500349538 \n", "4 -0.451449182813529 -0.237033239362776 -0.0381947870352842 \n", "\n", " v19 v20 v21 \\\n", "0 0.403992960255733 0.251412098239705 -0.018306777944153 \n", "1 -0.145783041325259 -0.0690831352230203 -0.225775248033138 \n", "2 -2.26185709530414 0.524979725224404 0.247998153469754 \n", "3 -1.2326219700892 -0.208037781160366 -0.108300452035545 \n", "4 0.803486924960175 0.408542360392758 -0.00943069713232919 \n", "\n", " v22 v23 v24 \\\n", "0 0.277837575558899 -0.110473910188767 0.0669280749146731 \n", "1 -0.638671952771851 0.101288021253234 -0.339846475529127 \n", "2 0.771679401917229 0.909412262347719 -0.689280956490685 \n", "3 0.00527359678253453 -0.190320518742841 -1.17557533186321 \n", "4 0.79827849458971 -0.137458079619063 0.141266983824769 \n", "\n", " v25 v26 v27 \\\n", "0 0.128539358273528 -0.189114843888824 0.133558376740387 \n", "1 0.167170404418143 0.125894532368176 -0.00898309914322813 \n", "2 -0.327641833735251 -0.139096571514147 -0.0553527940384261 \n", "3 0.647376034602038 -0.221928844458407 0.0627228487293033 \n", "4 -0.206009587619756 0.502292224181569 0.219422229513348 \n", "\n", " v28 amount EVENT_LABEL \\\n", "0 -0.0210530534538215 149.62 0 \n", "1 0.0147241691924927 2.69 0 \n", "2 -0.0597518405929204 378.66 0 \n", "3 0.0614576285006353 123.5 0 \n", "4 0.215153147499206 69.99 0 \n", "\n", " EVENT_ID ENTITY_ID \\\n", "0 f8e77dc0-44ef-490c-b0de-8b4054b5a031 266103ff-71f2-4057-981d-a54821367237 \n", "1 b557449e-6b35-4be0-991e-337f764f5e21 f85083b2-d31f-4b9e-9d49-eb85c0476f6e \n", "2 d78d879c-eb7c-455d-8fde-6b1205080a4a 237ca488-c695-402c-b30f-0544554ea96c \n", "3 ef448a36-2763-449c-a54a-a9e05af20967 9964b305-b591-4ed0-bff1-8adca81d0194 \n", "4 e333b3c0-83ae-42dc-a865-178496653029 87b2fbf2-5b7d-479c-85f5-d989bd701f36 \n", "\n", " EVENT_TIMESTAMP LABEL_TIMESTAMP ENTITY_TYPE \n", "0 2021-09-01T00:00:00Z 2023-05-05T08:46:09Z user \n", "1 2021-09-01T00:00:00Z 2023-05-05T08:46:09Z user \n", "2 2021-09-01T00:01:00Z 2023-05-05T08:46:09Z user \n", "3 2021-09-01T00:01:00Z 2023-05-05T08:46:09Z user \n", "4 2021-09-01T00:02:00Z 2023-05-05T08:46:09Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "35\n", "(227845, 35)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amountEVENT_IDENTITY_IDEVENT_TIMESTAMPENTITY_TYPE
2278451.91402682161454-0.490067987909997-0.3261113125151180.604710739174721-0.8501359998436-0.736318677031096-0.524057962475328-0.08861410663619871.091125104722480.093484357816225-0.8923046258561070.0272205159068718-0.2437902096187210.03177400671891870.9006238971137910.536032161644219-0.6484080940971690.183072340001028-0.48632249422331-0.139578763352220.2109584288786520.6393378790540970.1475225519882980.0736542664022496-0.3183782466012460.350612262707235-0.0238434747433154-0.037139331505512650bd64c6f1-1c1d-49ea-8561-6cc56bd2a173ee6232a9-6ba4-4654-b406-72e582f010312021-12-10T20:48:00Zuser
2278462.15269624649984-0.036160786158066-2.231810980498030.09176584355839190.537612206488446-1.368102509726440.613326738349479-0.4552519548496990.291813590043350.253161344559488-1.50188197076942-0.870607641524177-1.441737564993720.9887566262010740.496349234837293-0.0686989613348823-0.454073497932566-0.2990952627365510.267443131415241-0.2757779147503610.01715335553399630.0632416225359206-0.0345611249491173-0.6268662126269120.2492131294139170.773930519516097-0.137114784582898-0.090610608842072714.956728a9b7-ab9c-404e-93a8-fcf76baf7e8e3dc93b80-f110-4355-b516-5174a0cd214d2021-12-10T20:49:00Zuser
227847-4.034795167172752.30507905571504-1.46169292457709-0.729887055238227-1.5287503399573-1.22567909778369-0.8933536794978681.622521993695541.29199841774415-0.0409558359937061-0.9714252876975120.5747436956304580.155656078919204-0.7290549978893850.4774389479996591.061718515692520.934694753675360.403768792198479-0.494929851777981-0.0810925858921718-0.392556502541116-0.787599062515760.343467795972994-0.09033139998409350.248286972151669-0.2385238453424240.26648354183946-0.06223616346916547.71f4a3cae-3a95-48b7-8cc9-dd2258689f3758879cd9-4053-4e16-9144-3b04c276f74e2021-12-10T20:49:00Zuser
227848-1.668741068625831.168054717603640.249642461553748-1.268497489250320.785922573014156-0.6639585621667290.8594329736168950.0681106263347446-0.1441830449273180.04328808412879750.5420137360600611.002024504690610.4007595957434330.136412487776037-1.289649024488790.276827961550432-0.868491702025561-0.366839507131127-0.187391599008302-0.0335233340620367-0.247543775399679-0.592536769878023-0.286693549546811-0.378855664973759-0.07742890416387050.0676084004301294-0.27896200360197-0.06419266909925776.99930cd5cb-b226-4af5-8dda-574340d05a12bb616582-e509-4c77-9154-755ca81039c42021-12-10T20:49:00Zuser
227849-0.550678353341949-0.429004102182237-1.29189255347072-0.414409226593379-0.2922285386713120.0718429392350582.42606795091335-0.2127297582230820.412374372851086-1.93996940549555-1.81011838293809-1.22351031687552-1.32491464932768-1.46239178995552-0.311640557598380.5067077603782570.7399325846385770.8924220172046590.1950425290371030.7911267477152840.00303193944814891-0.6457829788587530.877016475964068-1.22852893747944-0.0362812174160739-0.110609895882901-0.09838031352719810.0959849443846813460.712e909126-def3-4d82-9485-03798817c94288ea4bc9-29fd-4302-913d-e6788cb7e6ab2021-12-10T20:50:00Zuser
\n", "
" ], "text/plain": [ " v1 v2 v3 \\\n", "227845 1.91402682161454 -0.490067987909997 -0.326111312515118 \n", "227846 2.15269624649984 -0.036160786158066 -2.23181098049803 \n", "227847 -4.03479516717275 2.30507905571504 -1.46169292457709 \n", "227848 -1.66874106862583 1.16805471760364 0.249642461553748 \n", "227849 -0.550678353341949 -0.429004102182237 -1.29189255347072 \n", "\n", " v4 v5 v6 \\\n", "227845 0.604710739174721 -0.8501359998436 -0.736318677031096 \n", "227846 0.0917658435583919 0.537612206488446 -1.36810250972644 \n", "227847 -0.729887055238227 -1.5287503399573 -1.22567909778369 \n", "227848 -1.26849748925032 0.785922573014156 -0.663958562166729 \n", "227849 -0.414409226593379 -0.292228538671312 0.071842939235058 \n", "\n", " v7 v8 v9 \\\n", "227845 -0.524057962475328 -0.0886141066361987 1.09112510472248 \n", "227846 0.613326738349479 -0.455251954849699 0.29181359004335 \n", "227847 -0.893353679497868 1.62252199369554 1.29199841774415 \n", "227848 0.859432973616895 0.0681106263347446 -0.144183044927318 \n", "227849 2.42606795091335 -0.212729758223082 0.412374372851086 \n", "\n", " v10 v11 v12 \\\n", "227845 0.093484357816225 -0.892304625856107 0.0272205159068718 \n", "227846 0.253161344559488 -1.50188197076942 -0.870607641524177 \n", "227847 -0.0409558359937061 -0.971425287697512 0.574743695630458 \n", "227848 0.0432880841287975 0.542013736060061 1.00202450469061 \n", "227849 -1.93996940549555 -1.81011838293809 -1.22351031687552 \n", "\n", " v13 v14 v15 \\\n", "227845 -0.243790209618721 0.0317740067189187 0.900623897113791 \n", "227846 -1.44173756499372 0.988756626201074 0.496349234837293 \n", "227847 0.155656078919204 -0.729054997889385 0.477438947999659 \n", "227848 0.400759595743433 0.136412487776037 -1.28964902448879 \n", "227849 -1.32491464932768 -1.46239178995552 -0.31164055759838 \n", "\n", " v16 v17 v18 \\\n", "227845 0.536032161644219 -0.648408094097169 0.183072340001028 \n", "227846 -0.0686989613348823 -0.454073497932566 -0.299095262736551 \n", "227847 1.06171851569252 0.93469475367536 0.403768792198479 \n", "227848 0.276827961550432 -0.868491702025561 -0.366839507131127 \n", "227849 0.506707760378257 0.739932584638577 0.892422017204659 \n", "\n", " v19 v20 v21 \\\n", "227845 -0.48632249422331 -0.13957876335222 0.210958428878652 \n", "227846 0.267443131415241 -0.275777914750361 0.0171533555339963 \n", "227847 -0.494929851777981 -0.0810925858921718 -0.392556502541116 \n", "227848 -0.187391599008302 -0.0335233340620367 -0.247543775399679 \n", "227849 0.195042529037103 0.791126747715284 0.00303193944814891 \n", "\n", " v22 v23 v24 \\\n", "227845 0.639337879054097 0.147522551988298 0.0736542664022496 \n", "227846 0.0632416225359206 -0.0345611249491173 -0.626866212626912 \n", "227847 -0.78759906251576 0.343467795972994 -0.0903313999840935 \n", "227848 -0.592536769878023 -0.286693549546811 -0.378855664973759 \n", "227849 -0.645782978858753 0.877016475964068 -1.22852893747944 \n", "\n", " v25 v26 v27 \\\n", "227845 -0.318378246601246 0.350612262707235 -0.0238434747433154 \n", "227846 0.249213129413917 0.773930519516097 -0.137114784582898 \n", "227847 0.248286972151669 -0.238523845342424 0.26648354183946 \n", "227848 -0.0774289041638705 0.0676084004301294 -0.27896200360197 \n", "227849 -0.0362812174160739 -0.110609895882901 -0.0983803135271981 \n", "\n", " v28 amount EVENT_ID \\\n", "227845 -0.0371393315055126 50 bd64c6f1-1c1d-49ea-8561-6cc56bd2a173 \n", "227846 -0.0906106088420727 14.95 6728a9b7-ab9c-404e-93a8-fcf76baf7e8e \n", "227847 -0.0622361634691654 7.7 1f4a3cae-3a95-48b7-8cc9-dd2258689f37 \n", "227848 -0.0641926690992577 6.99 930cd5cb-b226-4af5-8dda-574340d05a12 \n", "227849 0.0959849443846813 460.71 2e909126-def3-4d82-9485-03798817c942 \n", "\n", " ENTITY_ID EVENT_TIMESTAMP ENTITY_TYPE \n", "227845 ee6232a9-6ba4-4654-b406-72e582f01031 2021-12-10T20:48:00Z user \n", "227846 3dc93b80-f110-4355-b516-5174a0cd214d 2021-12-10T20:49:00Z user \n", "227847 58879cd9-4053-4e16-9144-3b04c276f74e 2021-12-10T20:49:00Z user \n", "227848 bb616582-e509-4c77-9154-755ca81039c4 2021-12-10T20:49:00Z user \n", "227849 88ea4bc9-29fd-4302-913d-e6788cb7e6ab 2021-12-10T20:50:00Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(56962, 33)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABELEVENT_ID
2278450bd64c6f1-1c1d-49ea-8561-6cc56bd2a173
22784606728a9b7-ab9c-404e-93a8-fcf76baf7e8e
22784701f4a3cae-3a95-48b7-8cc9-dd2258689f37
2278480930cd5cb-b226-4af5-8dda-574340d05a12
22784902e909126-def3-4d82-9485-03798817c942
\n", "
" ], "text/plain": [ " EVENT_LABEL EVENT_ID\n", "227845 0 bd64c6f1-1c1d-49ea-8561-6cc56bd2a173\n", "227846 0 6728a9b7-ab9c-404e-93a8-fcf76baf7e8e\n", "227847 0 1f4a3cae-3a95-48b7-8cc9-dd2258689f37\n", "227848 0 930cd5cb-b226-4af5-8dda-574340d05a12\n", "227849 0 2e909126-def3-4d82-9485-03798817c942" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 56887\n", "1 75\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.99817\n", "1 0.00183\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n", "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n", "fraudecom\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_IDpurchase_valueENTITY_IDsourcebrowserageip_addressEVENT_LABELtime_since_signupEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPE
11508630955714BBPACGBUVJUXFAdsChrome38119.75.87.223112021-01-01T00:00:44Z2023-05-05T08:46:09Zuser
4199012453914BBPACGBUVJUXFAdsChrome38119.75.87.223112021-01-01T00:00:45Z2023-05-05T08:46:09Zuser
13483616124614BBPACGBUVJUXFAdsChrome38119.75.87.223112021-01-01T00:00:46Z2023-05-05T08:46:09Zuser
2457235641414BBPACGBUVJUXFAdsChrome38119.75.87.223112021-01-01T00:00:47Z2023-05-05T08:46:09Zuser
10616033865614BBPACGBUVJUXFAdsChrome38119.75.87.223112021-01-01T00:00:48Z2023-05-05T08:46:09Zuser
\n", "
" ], "text/plain": [ " EVENT_ID purchase_value ENTITY_ID source browser age \\\n", "115086 309557 14 BBPACGBUVJUXF Ads Chrome 38 \n", "41990 124539 14 BBPACGBUVJUXF Ads Chrome 38 \n", "134836 161246 14 BBPACGBUVJUXF Ads Chrome 38 \n", "24572 356414 14 BBPACGBUVJUXF Ads Chrome 38 \n", "106160 338656 14 BBPACGBUVJUXF Ads Chrome 38 \n", "\n", " ip_address EVENT_LABEL time_since_signup EVENT_TIMESTAMP \\\n", "115086 119.75.87.223 1 1 2021-01-01T00:00:44Z \n", "41990 119.75.87.223 1 1 2021-01-01T00:00:45Z \n", "134836 119.75.87.223 1 1 2021-01-01T00:00:46Z \n", "24572 119.75.87.223 1 1 2021-01-01T00:00:47Z \n", "106160 119.75.87.223 1 1 2021-01-01T00:00:48Z \n", "\n", " LABEL_TIMESTAMP ENTITY_TYPE \n", "115086 2023-05-05T08:46:09Z user \n", "41990 2023-05-05T08:46:09Z user \n", "134836 2023-05-05T08:46:09Z user \n", "24572 2023-05-05T08:46:09Z user \n", "106160 2023-05-05T08:46:09Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "12\n", "(120889, 12)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_IDpurchase_valueENTITY_IDsourcebrowserageip_addresstime_since_signupEVENT_TIMESTAMPENTITY_TYPE
6962830443550EFASVBVKDGQKIAdsChrome40202.165.191.211353102021-08-30T15:18:56Zuser
12057322217730LUAQDRQGTDVHQSEOChrome392.82.213.23526552021-08-30T15:20:03Zuser
10505030883635ODWUMTCAPBLXPAdsFireFox2073.185.82.155350832021-08-30T15:20:35Zuser
11803720251520LTOEZIQLNHGACAdsIE37108.236.13.24840322021-08-30T15:27:14Zuser
609426038946GMTRBZCZVBKQCAdsChrome34129.163.194.162192372021-08-30T15:28:27Zuser
\n", "
" ], "text/plain": [ " EVENT_ID purchase_value ENTITY_ID source browser age \\\n", "69628 304435 50 EFASVBVKDGQKI Ads Chrome 40 \n", "120573 222177 30 LUAQDRQGTDVHQ SEO Chrome 39 \n", "105050 308836 35 ODWUMTCAPBLXP Ads FireFox 20 \n", "118037 202515 20 LTOEZIQLNHGAC Ads IE 37 \n", "6094 260389 46 GMTRBZCZVBKQC Ads Chrome 34 \n", "\n", " ip_address time_since_signup EVENT_TIMESTAMP ENTITY_TYPE \n", "69628 202.165.191.211 35310 2021-08-30T15:18:56Z user \n", "120573 2.82.213.23 52655 2021-08-30T15:20:03Z user \n", "105050 73.185.82.155 35083 2021-08-30T15:20:35Z user \n", "118037 108.236.13.248 4032 2021-08-30T15:27:14Z user \n", "6094 129.163.194.162 19237 2021-08-30T15:28:27Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(30223, 10)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABEL
696280
1205730
1050500
1180370
60940
\n", "
" ], "text/plain": [ " EVENT_LABEL\n", "69628 0\n", "120573 0\n", "105050 0\n", "118037 0\n", "6094 0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 28834\n", "1 1389\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.894432\n", "1 0.105568\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n", "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n", "twitterbot\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unnamed: 0created_atdefault_profiledefault_profile_imagedescriptionfavourites_countfollowers_countfriends_countgeo_enabledEVENT_IDlanglocationprofile_background_image_urlprofile_image_urlscreen_namestatuses_countverifiedaverage_tweets_per_dayaccount_age_daysEVENT_LABELENTITY_IDEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPE
20963209632013-05-27 21:22:15TrueFalseWHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.3237423952823True1463172686enMount Morris, MIhttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpgblerdwords11448False4.33626400d300a2e5-86e1-488a-8ca6-6b49cc5171642022-05-07T14:44:33Z2023-05-05T08:46:09Zuser
633163312009-09-14 18:58:36FalseFalseComedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo686643507891528False74231747enLos Angeles - Always a Texanhttp://abs.twimg.com/images/themes/theme9/bg.gifhttp://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpgJennyJohnsonHi518732True4.69439910c253258d-91c5-483c-ba5e-c357551adf162023-03-05T02:17:19Z2023-05-05T08:46:09Zuser
17209172092010-06-06 16:27:08TrueFalseNaN7454657False152688783NaNAbu Dhabihttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.pngAbnerJosh161False0.043372617af565e8-c19c-4132-b5f7-b017efad79512022-11-03T20:32:13Z2023-05-05T08:46:09Zuser
23964239642010-06-22 21:56:09FalseFalseInformation and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.151755881991True158502985enRegina, SK Canadahttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpgGlobalRegina103379True27.8653710028e07864-c08a-43b8-9cc9-423c25254b0b2022-06-30T20:31:18Z2023-05-05T08:46:09Zuser
30569305692009-03-10 02:26:45FalseFalseDetritus26161118405657False23544268nounknownhttp://abs.twimg.com/images/themes/theme11/bg.gifhttp://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpgOfficialKat4980True1.1914180096724009-efa7-4558-8bad-3aeaa7bfdea52022-05-11T12:37:16Z2023-05-05T08:46:09Zuser
\n", "
" ], "text/plain": [ " unnamed: 0 created_at default_profile default_profile_image \\\n", "20963 20963 2013-05-27 21:22:15 True False \n", "6331 6331 2009-09-14 18:58:36 False False \n", "17209 17209 2010-06-06 16:27:08 True False \n", "23964 23964 2010-06-22 21:56:09 False False \n", "30569 30569 2009-03-10 02:26:45 False False \n", "\n", " description \\\n", "20963 WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery. \n", "6331 Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo \n", "17209 NaN \n", "23964 Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom. \n", "30569 Detritus \n", "\n", " favourites_count followers_count friends_count geo_enabled EVENT_ID \\\n", "20963 32374 2395 2823 True 1463172686 \n", "6331 68664 350789 1528 False 74231747 \n", "17209 74 54 657 False 152688783 \n", "23964 1517 55881 991 True 158502985 \n", "30569 2616 1118405 657 False 23544268 \n", "\n", " lang location \\\n", "20963 en Mount Morris, MI \n", "6331 en Los Angeles - Always a Texan \n", "17209 NaN Abu Dhabi \n", "23964 en Regina, SK Canada \n", "30569 no unknown \n", "\n", " profile_background_image_url \\\n", "20963 http://abs.twimg.com/images/themes/theme1/bg.png \n", "6331 http://abs.twimg.com/images/themes/theme9/bg.gif \n", "17209 http://abs.twimg.com/images/themes/theme1/bg.png \n", "23964 http://abs.twimg.com/images/themes/theme1/bg.png \n", "30569 http://abs.twimg.com/images/themes/theme11/bg.gif \n", "\n", " profile_image_url \\\n", "20963 http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg \n", "6331 http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg \n", "17209 http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png \n", "23964 http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg \n", "30569 http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg \n", "\n", " screen_name statuses_count verified average_tweets_per_day \\\n", "20963 blerdwords 11448 False 4.336 \n", "6331 JennyJohnsonHi5 18732 True 4.694 \n", "17209 AbnerJosh 161 False 0.043 \n", "23964 GlobalRegina 103379 True 27.865 \n", "30569 OfficialKat 4980 True 1.191 \n", "\n", " account_age_days EVENT_LABEL ENTITY_ID \\\n", "20963 2640 0 d300a2e5-86e1-488a-8ca6-6b49cc517164 \n", "6331 3991 0 c253258d-91c5-483c-ba5e-c357551adf16 \n", "17209 3726 1 7af565e8-c19c-4132-b5f7-b017efad7951 \n", "23964 3710 0 28e07864-c08a-43b8-9cc9-423c25254b0b \n", "30569 4180 0 96724009-efa7-4558-8bad-3aeaa7bfdea5 \n", "\n", " EVENT_TIMESTAMP LABEL_TIMESTAMP ENTITY_TYPE \n", "20963 2022-05-07T14:44:33Z 2023-05-05T08:46:09Z user \n", "6331 2023-03-05T02:17:19Z 2023-05-05T08:46:09Z user \n", "17209 2022-11-03T20:32:13Z 2023-05-05T08:46:09Z user \n", "23964 2022-06-30T20:31:18Z 2023-05-05T08:46:09Z user \n", "30569 2022-05-11T12:37:16Z 2023-05-05T08:46:09Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "24\n", "(29950, 24)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unnamed: 0created_atdefault_profiledefault_profile_imagedescriptionfavourites_countfollowers_countfriends_countgeo_enabledEVENT_IDlanglocationprofile_background_image_urlprofile_image_urlscreen_namestatuses_countverifiedaverage_tweets_per_dayaccount_age_daysENTITY_IDEVENT_TIMESTAMPENTITY_TYPE
012016-11-09 05:01:30FalseFalsePhotographing the American West since 1980. I specialize in location portraits & events, both indoors & outside, using natural light & portable studio lighting.536860880False796216118331310080enEstados Unidoshttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/802329632838037504/CQN6gP7k_normal.jpgCJRubinPhoto252False0.18313790a8d3859-dec4-4ba6-abae-74b5523b042f2022-08-04T08:04:08Zuser
192012-02-14 15:33:48FalseFalseMan Utd fan. mostly here for football. Takes photos. Ex care worker, does stuff with computers often sarcastic. 🐝🇬🇧🇪🇺3638421303363True492306486enUnited Kingdomhttp://abs.twimg.com/images/themes/theme14/bg.gifhttp://pbs.twimg.com/profile_images/1211318786512609281/e6UqYEa4_normal.jpgGhamGraham63376False20.391310838dd52b1-b065-4328-a620-7b0f549f501c2023-04-06T07:47:08Zuser
2102011-12-09 14:11:56FalseFalseStay hungry, Stay foolish.127320False432537664enin the clouds.http://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/809682832645824512/WOJIsCmg_normal.jpgjainabdulaziz921False0.293175233d17cc-42f6-42f6-910f-0d0117ae84562022-07-18T17:09:36Zuser
3142010-11-03 15:40:20FalseFalseFemminista, animalista, antiproibizionista... E altre -ista che ora non ricordo. Ora dirigo il sito de #LeIene 👊🏿4071252142562True211550281itItalyhttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/1046488383684521988/KrnhpZyJ_normal.jpggiuliainnocenzi6029True1.68635767925f5e0-c8d8-41f0-ad77-eeb5511a30d72022-05-23T16:54:39Zuser
4152012-08-20 11:58:04FalseFalseMi viene da vomitare. \\n\\n \\nDove sono le mie Jordan?819675281581True769392715itObliohttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/1271034506041085953/xWUZXxMm_normal.jpgRichiMasu106263False36.39129209457b540-f911-4869-84a2-a6b758a7739e2022-08-26T22:34:20Zuser
\n", "
" ], "text/plain": [ " unnamed: 0 created_at default_profile default_profile_image \\\n", "0 1 2016-11-09 05:01:30 False False \n", "1 9 2012-02-14 15:33:48 False False \n", "2 10 2011-12-09 14:11:56 False False \n", "3 14 2010-11-03 15:40:20 False False \n", "4 15 2012-08-20 11:58:04 False False \n", "\n", " description \\\n", "0 Photographing the American West since 1980. I specialize in location portraits & events, both indoors & outside, using natural light & portable studio lighting. \n", "1 Man Utd fan. mostly here for football. Takes photos. Ex care worker, does stuff with computers often sarcastic. 🐝🇬🇧🇪🇺 \n", "2 Stay hungry, Stay foolish. \n", "3 Femminista, animalista, antiproibizionista... E altre -ista che ora non ricordo. Ora dirigo il sito de #LeIene 👊🏿 \n", "4 Mi viene da vomitare. \\n\\n \\nDove sono le mie Jordan? \n", "\n", " favourites_count followers_count friends_count geo_enabled \\\n", "0 536 860 880 False \n", "1 36384 2130 3363 True \n", "2 127 32 0 False \n", "3 4071 252142 562 True \n", "4 81967 5281 581 True \n", "\n", " EVENT_ID lang location \\\n", "0 796216118331310080 en Estados Unidos \n", "1 492306486 en United Kingdom \n", "2 432537664 en in the clouds. \n", "3 211550281 it Italy \n", "4 769392715 it Oblio \n", "\n", " profile_background_image_url \\\n", "0 http://abs.twimg.com/images/themes/theme1/bg.png \n", "1 http://abs.twimg.com/images/themes/theme14/bg.gif \n", "2 http://abs.twimg.com/images/themes/theme1/bg.png \n", "3 http://abs.twimg.com/images/themes/theme1/bg.png \n", "4 http://abs.twimg.com/images/themes/theme1/bg.png \n", "\n", " profile_image_url \\\n", "0 http://pbs.twimg.com/profile_images/802329632838037504/CQN6gP7k_normal.jpg \n", "1 http://pbs.twimg.com/profile_images/1211318786512609281/e6UqYEa4_normal.jpg \n", "2 http://pbs.twimg.com/profile_images/809682832645824512/WOJIsCmg_normal.jpg \n", "3 http://pbs.twimg.com/profile_images/1046488383684521988/KrnhpZyJ_normal.jpg \n", "4 http://pbs.twimg.com/profile_images/1271034506041085953/xWUZXxMm_normal.jpg \n", "\n", " screen_name statuses_count verified average_tweets_per_day \\\n", "0 CJRubinPhoto 252 False 0.183 \n", "1 GhamGraham 63376 False 20.391 \n", "2 jainabdulaziz 921 False 0.29 \n", "3 giuliainnocenzi 6029 True 1.686 \n", "4 RichiMasu 106263 False 36.391 \n", "\n", " account_age_days ENTITY_ID \\\n", "0 1379 0a8d3859-dec4-4ba6-abae-74b5523b042f \n", "1 3108 38dd52b1-b065-4328-a620-7b0f549f501c \n", "2 3175 233d17cc-42f6-42f6-910f-0d0117ae8456 \n", "3 3576 7925f5e0-c8d8-41f0-ad77-eeb5511a30d7 \n", "4 2920 9457b540-f911-4869-84a2-a6b758a7739e \n", "\n", " EVENT_TIMESTAMP ENTITY_TYPE \n", "0 2022-08-04T08:04:08Z user \n", "1 2023-04-06T07:47:08Z user \n", "2 2022-07-18T17:09:36Z user \n", "3 2022-05-23T16:54:39Z user \n", "4 2022-08-26T22:34:20Z user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(7488, 22)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABEL
00
10
20
30
40
\n", "
" ], "text/plain": [ " EVENT_LABEL\n", "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 4987\n", "1 2501\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.668648\n", "1 0.331352\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n", "ipblock\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ipEVENT_LABELEVENT_TIMESTAMPLABEL_TIMESTAMPENTITY_TYPEEVENT_IDENTITY_IDdummy_cat
0128.1.248.4412021-11-16T04:03:42Z2022-06-01T20:30:04Zuser27dd3612-b997-4e9a-9442-eb08e0f7f923068b7a8c-8d4a-49a3-ab3a-e4d905ace4cc1253a4fb-cbfe-4e43-bc4c-4ecbe1cf58da
1119.46.34.1102022-04-22T04:24:50Z2022-06-01T20:30:04Zuser19474b6d-0af8-4610-b80e-485a43276e8a9e41adf9-fc4d-4078-a005-c9a85950c85863c81521-a604-4923-9c82-0e82878042d7
2186.172.135.4702022-05-04T19:47:16Z2022-06-01T20:30:04Zuser0db63c1e-dd12-4b2b-a39c-254af2176a83fea535c1-1b52-411d-a38e-adcbec57a95d9ded7f8a-d6fe-414a-a2da-c5f09588ebce
3181.133.0.11202022-02-25T04:37:01Z2022-06-01T20:30:04Zuser8be34510-e76d-4b78-bb4f-b8f721b8abe510d04fd3-db0b-4096-bfbc-4262b280ed690d8f329f-dda5-47f0-bd16-430f8745b4d5
451.4.204.1702022-06-01T06:11:56Z2022-06-01T20:30:04Zuser1ae9a3e9-b410-4f36-a3bf-8b466fea97c127f40e1b-8a49-43ab-9b68-5ec24469845fdf1cb347-9712-4743-9801-d11eb415d823
\n", "
" ], "text/plain": [ " ip EVENT_LABEL EVENT_TIMESTAMP LABEL_TIMESTAMP \\\n", "0 128.1.248.44 1 2021-11-16T04:03:42Z 2022-06-01T20:30:04Z \n", "1 119.46.34.11 0 2022-04-22T04:24:50Z 2022-06-01T20:30:04Z \n", "2 186.172.135.47 0 2022-05-04T19:47:16Z 2022-06-01T20:30:04Z \n", "3 181.133.0.112 0 2022-02-25T04:37:01Z 2022-06-01T20:30:04Z \n", "4 51.4.204.17 0 2022-06-01T06:11:56Z 2022-06-01T20:30:04Z \n", "\n", " ENTITY_TYPE EVENT_ID \\\n", "0 user 27dd3612-b997-4e9a-9442-eb08e0f7f923 \n", "1 user 19474b6d-0af8-4610-b80e-485a43276e8a \n", "2 user 0db63c1e-dd12-4b2b-a39c-254af2176a83 \n", "3 user 8be34510-e76d-4b78-bb4f-b8f721b8abe5 \n", "4 user 1ae9a3e9-b410-4f36-a3bf-8b466fea97c1 \n", "\n", " ENTITY_ID dummy_cat \n", "0 068b7a8c-8d4a-49a3-ab3a-e4d905ace4cc 1253a4fb-cbfe-4e43-bc4c-4ecbe1cf58da \n", "1 9e41adf9-fc4d-4078-a005-c9a85950c858 63c81521-a604-4923-9c82-0e82878042d7 \n", "2 fea535c1-1b52-411d-a38e-adcbec57a95d 9ded7f8a-d6fe-414a-a2da-c5f09588ebce \n", "3 10d04fd3-db0b-4096-bfbc-4262b280ed69 0d8f329f-dda5-47f0-bd16-430f8745b4d5 \n", "4 27f40e1b-8a49-43ab-9b68-5ec24469845f df1cb347-9712-4743-9801-d11eb415d823 " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "8\n", "(172000, 8)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ipEVENT_TIMESTAMPENTITY_TYPEEVENT_IDENTITY_IDdummy_cat
01.10.226.562022-03-25T13:09:37Zuserc6bcbff7-7c2e-4780-8007-ed29cea7535b1f18a08e-c6ab-4d32-a210-3aed5533c2723a966adb-01d2-487a-b6ac-c4075be9aff0
11.116.89.2512021-12-19T04:06:53Zuser3a1e89c3-9bd1-4f32-8b7e-69c97e2fba9224625027-898a-4ed0-8e95-485b3fd3966353e2a419-4652-4011-a9fb-6de2a4a1cfcd
21.117.176.1862021-10-02T02:10:34Zuser1a634e22-b87c-4981-ad66-587a82bfa6e82767ff1e-a6d1-4f3e-8f99-dbd2d47152c5d48e48ca-fcf2-4594-8d37-6511b8ff51c7
31.117.207.862021-10-31T12:18:58Zuserdc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c37cd5f610-28c1-451a-a86c-75ce1d6b03853ae02507-69ef-46ba-b435-2abced9560fa
41.13.17.1842022-03-08T14:20:40Zuser59012ee9-8f28-48f2-a133-10cd558ed31998f6bf66-a87e-4f59-a2d3-e5f67302361176ed0fbf-70be-4702-9c3e-5abfba2dd536
\n", "
" ], "text/plain": [ " ip EVENT_TIMESTAMP ENTITY_TYPE \\\n", "0 1.10.226.56 2022-03-25T13:09:37Z user \n", "1 1.116.89.251 2021-12-19T04:06:53Z user \n", "2 1.117.176.186 2021-10-02T02:10:34Z user \n", "3 1.117.207.86 2021-10-31T12:18:58Z user \n", "4 1.13.17.184 2022-03-08T14:20:40Z user \n", "\n", " EVENT_ID ENTITY_ID \\\n", "0 c6bcbff7-7c2e-4780-8007-ed29cea7535b 1f18a08e-c6ab-4d32-a210-3aed5533c272 \n", "1 3a1e89c3-9bd1-4f32-8b7e-69c97e2fba92 24625027-898a-4ed0-8e95-485b3fd39663 \n", "2 1a634e22-b87c-4981-ad66-587a82bfa6e8 2767ff1e-a6d1-4f3e-8f99-dbd2d47152c5 \n", "3 dc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c3 7cd5f610-28c1-451a-a86c-75ce1d6b0385 \n", "4 59012ee9-8f28-48f2-a133-10cd558ed319 98f6bf66-a87e-4f59-a2d3-e5f673023611 \n", "\n", " dummy_cat \n", "0 3a966adb-01d2-487a-b6ac-c4075be9aff0 \n", "1 53e2a419-4652-4011-a9fb-6de2a4a1cfcd \n", "2 d48e48ca-fcf2-4594-8d37-6511b8ff51c7 \n", "3 3ae02507-69ef-46ba-b435-2abced9560fa \n", "4 76ed0fbf-70be-4702-9c3e-5abfba2dd536 " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(43000, 6)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABELEVENT_ID
01c6bcbff7-7c2e-4780-8007-ed29cea7535b
113a1e89c3-9bd1-4f32-8b7e-69c97e2fba92
211a634e22-b87c-4981-ad66-587a82bfa6e8
31dc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c3
4159012ee9-8f28-48f2-a133-10cd558ed319
\n", "
" ], "text/plain": [ " EVENT_LABEL EVENT_ID\n", "0 1 c6bcbff7-7c2e-4780-8007-ed29cea7535b\n", "1 1 3a1e89c3-9bd1-4f32-8b7e-69c97e2fba92\n", "2 1 1a634e22-b87c-4981-ad66-587a82bfa6e8\n", "3 1 dc3e7d0c-1bfd-49d6-ba7d-e9e6e325e8c3\n", "4 1 59012ee9-8f28-48f2-a133-10cd558ed319" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 40003\n", "1 2997\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.930215\n", "1 0.069785\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n" ] } ], "source": [ "# for key, val in KAGGLE_CONFIGS.items():\n", "for key in all_keys:\n", " obj = FraudDatasetBenchmark(key=key, )\n", " print(obj.key)\n", " print('Train set: ')\n", " display(obj.train.head())\n", " print(len(obj.train.columns))\n", " print(obj.train.shape)\n", " print('Test set: ')\n", " display(obj.test.head())\n", " print(obj.test.shape)\n", " print('Test scores')\n", " display(obj.test_labels.head())\n", " print(obj.test_labels['EVENT_LABEL'].value_counts())\n", " print(obj.train['EVENT_LABEL'].value_counts(normalize=True))\n", " print('=========','\\n')\n", "\n", "# KEY= f'public/official-dataset-names/{val[\"name\"]}/train.csv'\n", "# _s3_upload(obj.train)\n", "\n", "\n", "# KEY= f'public/official-dataset-names/{val[\"name\"]}/test.csv'\n", "# _s3_upload(obj.test)\n", "\n", "\n", "# KEY= f'public/official-dataset-names/{val[\"name\"]}/test_labels.csv'\n", "# _s3_upload(obj.test_labels)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Without random values in missing columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameter settings\n", "\n", "- load_pre_downloaded: False\n", "- delete_downloaded: True\n", "- add_random_values_if_real_na = ```\n", "{\n", "\"EVENT_TIMESTAMP\": False,\n", "\"LABEL_TIMESTAMP\": False,\n", "\"ENTITY_ID\": False,\n", "\"ENTITY_TYPE\": False,\n", "\"EVENT_ID\": False\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "all_keys = ['ccfraud']" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n", "ccfraud\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amountEVENT_LABELEVENT_TIMESTAMP
0-1.3598071336738-0.07278117330984972.536346737969141.37815522427443-0.3383207699425180.4623877777622920.2395985540612570.09869790126105070.3637869696112130.0907941719789316-0.551599533260813-0.617800855762348-0.991389847235408-0.3111693536998791.46817697209427-0.4704005252594780.2079712419292420.02579058019855910.4039929602557330.251412098239705-0.0183067779441530.277837575558899-0.1104739101887670.06692807491467310.128539358273528-0.1891148438888240.133558376740387-0.0210530534538215149.6202021-09-01T00:00:00Z
11.191857111314860.266150712059630.166480113353210.4481540784609110.0600176492822243-0.0823608088155687-0.07880298333231130.0851016549148104-0.255425128109186-0.1669744140046141.612726661054791.065235311372870.48909501589608-0.1437722964415190.6355580932582080.463917041022171-0.114804663102346-0.183361270123994-0.145783041325259-0.0690831352230203-0.225775248033138-0.6386719527718510.101288021253234-0.3398464755291270.1671704044181430.125894532368176-0.008983099143228130.01472416919249272.6902021-09-01T00:00:00Z
2-1.35835406159823-1.340163074736091.773209342631190.379779593034328-0.5031981333181931.800499380792630.7914609564504220.247675786588991-1.514654322605830.2076428652166960.6245014594248950.0660836852688310.717292731410831-0.1659459227635542.34586494901581-2.890083194442311.10996937869599-0.121359313195888-2.261857095304140.5249797252244040.2479981534697540.7716794019172290.909412262347719-0.689280956490685-0.327641833735251-0.139096571514147-0.0553527940384261-0.0597518405929204378.6602021-09-01T00:01:00Z
3-0.966271711572087-0.1852260080828981.79299333957872-0.863291275036453-0.01030887960308231.247203167524860.237608939771780.377435874652262-1.38702406270197-0.0549519224713749-0.2264872638354010.1782282258773030.507756869957169-0.28792374549456-0.631418117709045-1.0596472454325-0.6840927863454791.96577500349538-1.2326219700892-0.208037781160366-0.1083004520355450.00527359678253453-0.190320518742841-1.175575331863210.647376034602038-0.2219288444584070.06272284872930330.0614576285006353123.502021-09-01T00:01:00Z
4-1.158233093495230.8777367548484511.5487178465110.403033933955121-0.4071933773116530.09592146246842560.592940745385545-0.2705326771922820.8177393082352940.753074431976354-0.8228428779463630.538195550149951.3458515932154-1.119669834717310.175121130008994-0.451449182813529-0.237033239362776-0.03819478703528420.8034869249601750.408542360392758-0.009430697132329190.79827849458971-0.1374580796190630.141266983824769-0.2060095876197560.5022922241815690.2194222295133480.21515314749920669.9902021-09-01T00:02:00Z
\n", "
" ], "text/plain": [ " v1 v2 v3 \\\n", "0 -1.3598071336738 -0.0727811733098497 2.53634673796914 \n", "1 1.19185711131486 0.26615071205963 0.16648011335321 \n", "2 -1.35835406159823 -1.34016307473609 1.77320934263119 \n", "3 -0.966271711572087 -0.185226008082898 1.79299333957872 \n", "4 -1.15823309349523 0.877736754848451 1.548717846511 \n", "\n", " v4 v5 v6 \\\n", "0 1.37815522427443 -0.338320769942518 0.462387777762292 \n", "1 0.448154078460911 0.0600176492822243 -0.0823608088155687 \n", "2 0.379779593034328 -0.503198133318193 1.80049938079263 \n", "3 -0.863291275036453 -0.0103088796030823 1.24720316752486 \n", "4 0.403033933955121 -0.407193377311653 0.0959214624684256 \n", "\n", " v7 v8 v9 \\\n", "0 0.239598554061257 0.0986979012610507 0.363786969611213 \n", "1 -0.0788029833323113 0.0851016549148104 -0.255425128109186 \n", "2 0.791460956450422 0.247675786588991 -1.51465432260583 \n", "3 0.23760893977178 0.377435874652262 -1.38702406270197 \n", "4 0.592940745385545 -0.270532677192282 0.817739308235294 \n", "\n", " v10 v11 v12 \\\n", "0 0.0907941719789316 -0.551599533260813 -0.617800855762348 \n", "1 -0.166974414004614 1.61272666105479 1.06523531137287 \n", "2 0.207642865216696 0.624501459424895 0.066083685268831 \n", "3 -0.0549519224713749 -0.226487263835401 0.178228225877303 \n", "4 0.753074431976354 -0.822842877946363 0.53819555014995 \n", "\n", " v13 v14 v15 \\\n", "0 -0.991389847235408 -0.311169353699879 1.46817697209427 \n", "1 0.48909501589608 -0.143772296441519 0.635558093258208 \n", "2 0.717292731410831 -0.165945922763554 2.34586494901581 \n", "3 0.507756869957169 -0.28792374549456 -0.631418117709045 \n", "4 1.3458515932154 -1.11966983471731 0.175121130008994 \n", "\n", " v16 v17 v18 \\\n", "0 -0.470400525259478 0.207971241929242 0.0257905801985591 \n", "1 0.463917041022171 -0.114804663102346 -0.183361270123994 \n", "2 -2.89008319444231 1.10996937869599 -0.121359313195888 \n", "3 -1.0596472454325 -0.684092786345479 1.96577500349538 \n", "4 -0.451449182813529 -0.237033239362776 -0.0381947870352842 \n", "\n", " v19 v20 v21 \\\n", "0 0.403992960255733 0.251412098239705 -0.018306777944153 \n", "1 -0.145783041325259 -0.0690831352230203 -0.225775248033138 \n", "2 -2.26185709530414 0.524979725224404 0.247998153469754 \n", "3 -1.2326219700892 -0.208037781160366 -0.108300452035545 \n", "4 0.803486924960175 0.408542360392758 -0.00943069713232919 \n", "\n", " v22 v23 v24 \\\n", "0 0.277837575558899 -0.110473910188767 0.0669280749146731 \n", "1 -0.638671952771851 0.101288021253234 -0.339846475529127 \n", "2 0.771679401917229 0.909412262347719 -0.689280956490685 \n", "3 0.00527359678253453 -0.190320518742841 -1.17557533186321 \n", "4 0.79827849458971 -0.137458079619063 0.141266983824769 \n", "\n", " v25 v26 v27 \\\n", "0 0.128539358273528 -0.189114843888824 0.133558376740387 \n", "1 0.167170404418143 0.125894532368176 -0.00898309914322813 \n", "2 -0.327641833735251 -0.139096571514147 -0.0553527940384261 \n", "3 0.647376034602038 -0.221928844458407 0.0627228487293033 \n", "4 -0.206009587619756 0.502292224181569 0.219422229513348 \n", "\n", " v28 amount EVENT_LABEL EVENT_TIMESTAMP \n", "0 -0.0210530534538215 149.62 0 2021-09-01T00:00:00Z \n", "1 0.0147241691924927 2.69 0 2021-09-01T00:00:00Z \n", "2 -0.0597518405929204 378.66 0 2021-09-01T00:01:00Z \n", "3 0.0614576285006353 123.5 0 2021-09-01T00:01:00Z \n", "4 0.215153147499206 69.99 0 2021-09-01T00:02:00Z " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "31\n", "(227845, 31)\n", "Test set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
v1v2v3v4v5v6v7v8v9v10v11v12v13v14v15v16v17v18v19v20v21v22v23v24v25v26v27v28amountEVENT_TIMESTAMP
2278451.91402682161454-0.490067987909997-0.3261113125151180.604710739174721-0.8501359998436-0.736318677031096-0.524057962475328-0.08861410663619871.091125104722480.093484357816225-0.8923046258561070.0272205159068718-0.2437902096187210.03177400671891870.9006238971137910.536032161644219-0.6484080940971690.183072340001028-0.48632249422331-0.139578763352220.2109584288786520.6393378790540970.1475225519882980.0736542664022496-0.3183782466012460.350612262707235-0.0238434747433154-0.0371393315055126502021-12-10T20:48:00Z
2278462.15269624649984-0.036160786158066-2.231810980498030.09176584355839190.537612206488446-1.368102509726440.613326738349479-0.4552519548496990.291813590043350.253161344559488-1.50188197076942-0.870607641524177-1.441737564993720.9887566262010740.496349234837293-0.0686989613348823-0.454073497932566-0.2990952627365510.267443131415241-0.2757779147503610.01715335553399630.0632416225359206-0.0345611249491173-0.6268662126269120.2492131294139170.773930519516097-0.137114784582898-0.090610608842072714.952021-12-10T20:49:00Z
227847-4.034795167172752.30507905571504-1.46169292457709-0.729887055238227-1.5287503399573-1.22567909778369-0.8933536794978681.622521993695541.29199841774415-0.0409558359937061-0.9714252876975120.5747436956304580.155656078919204-0.7290549978893850.4774389479996591.061718515692520.934694753675360.403768792198479-0.494929851777981-0.0810925858921718-0.392556502541116-0.787599062515760.343467795972994-0.09033139998409350.248286972151669-0.2385238453424240.26648354183946-0.06223616346916547.72021-12-10T20:49:00Z
227848-1.668741068625831.168054717603640.249642461553748-1.268497489250320.785922573014156-0.6639585621667290.8594329736168950.0681106263347446-0.1441830449273180.04328808412879750.5420137360600611.002024504690610.4007595957434330.136412487776037-1.289649024488790.276827961550432-0.868491702025561-0.366839507131127-0.187391599008302-0.0335233340620367-0.247543775399679-0.592536769878023-0.286693549546811-0.378855664973759-0.07742890416387050.0676084004301294-0.27896200360197-0.06419266909925776.992021-12-10T20:49:00Z
227849-0.550678353341949-0.429004102182237-1.29189255347072-0.414409226593379-0.2922285386713120.0718429392350582.42606795091335-0.2127297582230820.412374372851086-1.93996940549555-1.81011838293809-1.22351031687552-1.32491464932768-1.46239178995552-0.311640557598380.5067077603782570.7399325846385770.8924220172046590.1950425290371030.7911267477152840.00303193944814891-0.6457829788587530.877016475964068-1.22852893747944-0.0362812174160739-0.110609895882901-0.09838031352719810.0959849443846813460.712021-12-10T20:50:00Z
\n", "
" ], "text/plain": [ " v1 v2 v3 \\\n", "227845 1.91402682161454 -0.490067987909997 -0.326111312515118 \n", "227846 2.15269624649984 -0.036160786158066 -2.23181098049803 \n", "227847 -4.03479516717275 2.30507905571504 -1.46169292457709 \n", "227848 -1.66874106862583 1.16805471760364 0.249642461553748 \n", "227849 -0.550678353341949 -0.429004102182237 -1.29189255347072 \n", "\n", " v4 v5 v6 \\\n", "227845 0.604710739174721 -0.8501359998436 -0.736318677031096 \n", "227846 0.0917658435583919 0.537612206488446 -1.36810250972644 \n", "227847 -0.729887055238227 -1.5287503399573 -1.22567909778369 \n", "227848 -1.26849748925032 0.785922573014156 -0.663958562166729 \n", "227849 -0.414409226593379 -0.292228538671312 0.071842939235058 \n", "\n", " v7 v8 v9 \\\n", "227845 -0.524057962475328 -0.0886141066361987 1.09112510472248 \n", "227846 0.613326738349479 -0.455251954849699 0.29181359004335 \n", "227847 -0.893353679497868 1.62252199369554 1.29199841774415 \n", "227848 0.859432973616895 0.0681106263347446 -0.144183044927318 \n", "227849 2.42606795091335 -0.212729758223082 0.412374372851086 \n", "\n", " v10 v11 v12 \\\n", "227845 0.093484357816225 -0.892304625856107 0.0272205159068718 \n", "227846 0.253161344559488 -1.50188197076942 -0.870607641524177 \n", "227847 -0.0409558359937061 -0.971425287697512 0.574743695630458 \n", "227848 0.0432880841287975 0.542013736060061 1.00202450469061 \n", "227849 -1.93996940549555 -1.81011838293809 -1.22351031687552 \n", "\n", " v13 v14 v15 \\\n", "227845 -0.243790209618721 0.0317740067189187 0.900623897113791 \n", "227846 -1.44173756499372 0.988756626201074 0.496349234837293 \n", "227847 0.155656078919204 -0.729054997889385 0.477438947999659 \n", "227848 0.400759595743433 0.136412487776037 -1.28964902448879 \n", "227849 -1.32491464932768 -1.46239178995552 -0.31164055759838 \n", "\n", " v16 v17 v18 \\\n", "227845 0.536032161644219 -0.648408094097169 0.183072340001028 \n", "227846 -0.0686989613348823 -0.454073497932566 -0.299095262736551 \n", "227847 1.06171851569252 0.93469475367536 0.403768792198479 \n", "227848 0.276827961550432 -0.868491702025561 -0.366839507131127 \n", "227849 0.506707760378257 0.739932584638577 0.892422017204659 \n", "\n", " v19 v20 v21 \\\n", "227845 -0.48632249422331 -0.13957876335222 0.210958428878652 \n", "227846 0.267443131415241 -0.275777914750361 0.0171533555339963 \n", "227847 -0.494929851777981 -0.0810925858921718 -0.392556502541116 \n", "227848 -0.187391599008302 -0.0335233340620367 -0.247543775399679 \n", "227849 0.195042529037103 0.791126747715284 0.00303193944814891 \n", "\n", " v22 v23 v24 \\\n", "227845 0.639337879054097 0.147522551988298 0.0736542664022496 \n", "227846 0.0632416225359206 -0.0345611249491173 -0.626866212626912 \n", "227847 -0.78759906251576 0.343467795972994 -0.0903313999840935 \n", "227848 -0.592536769878023 -0.286693549546811 -0.378855664973759 \n", "227849 -0.645782978858753 0.877016475964068 -1.22852893747944 \n", "\n", " v25 v26 v27 \\\n", "227845 -0.318378246601246 0.350612262707235 -0.0238434747433154 \n", "227846 0.249213129413917 0.773930519516097 -0.137114784582898 \n", "227847 0.248286972151669 -0.238523845342424 0.26648354183946 \n", "227848 -0.0774289041638705 0.0676084004301294 -0.27896200360197 \n", "227849 -0.0362812174160739 -0.110609895882901 -0.0983803135271981 \n", "\n", " v28 amount EVENT_TIMESTAMP \n", "227845 -0.0371393315055126 50 2021-12-10T20:48:00Z \n", "227846 -0.0906106088420727 14.95 2021-12-10T20:49:00Z \n", "227847 -0.0622361634691654 7.7 2021-12-10T20:49:00Z \n", "227848 -0.0641926690992577 6.99 2021-12-10T20:49:00Z \n", "227849 0.0959849443846813 460.71 2021-12-10T20:50:00Z " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "(56962, 30)\n", "Test scores\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABEL
2278450
2278460
2278470
2278480
2278490
\n", "
" ], "text/plain": [ " EVENT_LABEL\n", "227845 0\n", "227846 0\n", "227847 0\n", "227848 0\n", "227849 0" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0 56887\n", "1 75\n", "Name: EVENT_LABEL, dtype: int64\n", "0 0.99817\n", "1 0.00183\n", "Name: EVENT_LABEL, dtype: float64\n", "========= \n", "\n" ] } ], "source": [ "for key in all_keys:\n", " obj = FraudDatasetBenchmark(key=key, \n", " add_random_values_if_real_na = { \"EVENT_TIMESTAMP\": False, \n", " \"LABEL_TIMESTAMP\": False,\n", " \"ENTITY_ID\": False,\n", " \"ENTITY_TYPE\": False,\n", " \"EVENT_ID\": False})\n", " print(obj.key)\n", " print('Train set: ')\n", " display(obj.train.head())\n", " print(len(obj.train.columns))\n", " print(obj.train.shape)\n", " print('Test set: ')\n", " display(obj.test.head())\n", " print(obj.test.shape)\n", " print('Test scores')\n", " display(obj.test_labels.head())\n", " print(obj.test_labels['EVENT_LABEL'].value_counts())\n", " print(obj.train['EVENT_LABEL'].value_counts(normalize=True))\n", " print('=========','\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Persisting downloaded data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Important**: If you are running multiple experiments, download from Kaggle multiple times might exceed account level API call limits. So persisting the downloaded dataset is recommended in such scenarios" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### First download but not delete the data " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameter settings\n", "\n", "- load_pre_downloaded: False\n", "- delete_downloaded: False\n", "- add_random_values_if_real_na = ```\n", "{\n", "\"EVENT_TIMESTAMP\": False,\n", "\"LABEL_TIMESTAMP\": False,\n", "\"ENTITY_ID\": False,\n", "\"ENTITY_TYPE\": True,\n", "\"EVENT_ID\": True\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "all_keys = ['twitterbot']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data download location /Users/pringrov/Documents/git/fraud-dataset-benchmark/scripts/examples/tmp\n", "twitterbot\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unnamed: 0created_atdefault_profiledefault_profile_imagedescriptionfavourites_countfollowers_countfriends_countgeo_enabledEVENT_IDlanglocationprofile_background_image_urlprofile_image_urlscreen_namestatuses_countverifiedaverage_tweets_per_dayaccount_age_daysEVENT_LABELENTITY_TYPE
20963209632013-05-27 21:22:15TrueFalseWHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.3237423952823True1463172686enMount Morris, MIhttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpgblerdwords11448False4.33626400user
633163312009-09-14 18:58:36FalseFalseComedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo686643507891528False74231747enLos Angeles - Always a Texanhttp://abs.twimg.com/images/themes/theme9/bg.gifhttp://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpgJennyJohnsonHi518732True4.69439910user
17209172092010-06-06 16:27:08TrueFalseNaN7454657False152688783NaNAbu Dhabihttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.pngAbnerJosh161False0.04337261user
23964239642010-06-22 21:56:09FalseFalseInformation and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.151755881991True158502985enRegina, SK Canadahttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpgGlobalRegina103379True27.86537100user
30569305692009-03-10 02:26:45FalseFalseDetritus26161118405657False23544268nounknownhttp://abs.twimg.com/images/themes/theme11/bg.gifhttp://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpgOfficialKat4980True1.19141800user
\n", "
" ], "text/plain": [ " unnamed: 0 created_at default_profile default_profile_image \\\n", "20963 20963 2013-05-27 21:22:15 True False \n", "6331 6331 2009-09-14 18:58:36 False False \n", "17209 17209 2010-06-06 16:27:08 True False \n", "23964 23964 2010-06-22 21:56:09 False False \n", "30569 30569 2009-03-10 02:26:45 False False \n", "\n", " description \\\n", "20963 WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery. \n", "6331 Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo \n", "17209 NaN \n", "23964 Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom. \n", "30569 Detritus \n", "\n", " favourites_count followers_count friends_count geo_enabled EVENT_ID \\\n", "20963 32374 2395 2823 True 1463172686 \n", "6331 68664 350789 1528 False 74231747 \n", "17209 74 54 657 False 152688783 \n", "23964 1517 55881 991 True 158502985 \n", "30569 2616 1118405 657 False 23544268 \n", "\n", " lang location \\\n", "20963 en Mount Morris, MI \n", "6331 en Los Angeles - Always a Texan \n", "17209 NaN Abu Dhabi \n", "23964 en Regina, SK Canada \n", "30569 no unknown \n", "\n", " profile_background_image_url \\\n", "20963 http://abs.twimg.com/images/themes/theme1/bg.png \n", "6331 http://abs.twimg.com/images/themes/theme9/bg.gif \n", "17209 http://abs.twimg.com/images/themes/theme1/bg.png \n", "23964 http://abs.twimg.com/images/themes/theme1/bg.png \n", "30569 http://abs.twimg.com/images/themes/theme11/bg.gif \n", "\n", " profile_image_url \\\n", "20963 http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg \n", "6331 http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg \n", "17209 http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png \n", "23964 http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg \n", "30569 http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg \n", "\n", " screen_name statuses_count verified average_tweets_per_day \\\n", "20963 blerdwords 11448 False 4.336 \n", "6331 JennyJohnsonHi5 18732 True 4.694 \n", "17209 AbnerJosh 161 False 0.043 \n", "23964 GlobalRegina 103379 True 27.865 \n", "30569 OfficialKat 4980 True 1.191 \n", "\n", " account_age_days EVENT_LABEL ENTITY_TYPE \n", "20963 2640 0 user \n", "6331 3991 0 user \n", "17209 3726 1 user \n", "23964 3710 0 user \n", "30569 4180 0 user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "21\n", "(29950, 21)\n", "========= \n", "\n" ] } ], "source": [ "for key in all_keys:\n", " obj = FraudDatasetBenchmark(key=key, \n", " delete_downloaded=False,\n", " add_random_values_if_real_na = { \"EVENT_TIMESTAMP\": False, \"LABEL_TIMESTAMP\": False, \"ENTITY_ID\": False, \"ENTITY_TYPE\": True })\n", " print(obj.key)\n", " print('Train set: ')\n", " display(obj.train.head())\n", " print(len(obj.train.columns))\n", " print(obj.train.shape)\n", " print('=========','\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now load from previosly downloaded data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameter settings\n", "\n", "- load_pre_downloaded: True\n", "- delete_downloaded: False\n", "- add_random_values_if_real_na = ```\n", "{\n", "\"EVENT_TIMESTAMP\": False,\n", "\"LABEL_TIMESTAMP\": False,\n", "\"ENTITY_ID\": False,\n", "\"ENTITY_TYPE\": True,\n", "\"EVENT_ID\": True\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "all_keys = ['twitterbot']" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "twitterbot\n", "Train set: \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unnamed: 0created_atdefault_profiledefault_profile_imagedescriptionfavourites_countfollowers_countfriends_countgeo_enabledEVENT_IDlanglocationprofile_background_image_urlprofile_image_urlscreen_namestatuses_countverifiedaverage_tweets_per_dayaccount_age_daysEVENT_LABELENTITY_TYPE
20963209632013-05-27 21:22:15TrueFalseWHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery.3237423952823True1463172686enMount Morris, MIhttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpgblerdwords11448False4.33626400user
633163312009-09-14 18:58:36FalseFalseComedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo686643507891528False74231747enLos Angeles - Always a Texanhttp://abs.twimg.com/images/themes/theme9/bg.gifhttp://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpgJennyJohnsonHi518732True4.69439910user
17209172092010-06-06 16:27:08TrueFalseNaN7454657False152688783NaNAbu Dhabihttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.pngAbnerJosh161False0.04337261user
23964239642010-06-22 21:56:09FalseFalseInformation and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom.151755881991True158502985enRegina, SK Canadahttp://abs.twimg.com/images/themes/theme1/bg.pnghttp://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpgGlobalRegina103379True27.86537100user
30569305692009-03-10 02:26:45FalseFalseDetritus26161118405657False23544268nounknownhttp://abs.twimg.com/images/themes/theme11/bg.gifhttp://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpgOfficialKat4980True1.19141800user
\n", "
" ], "text/plain": [ " unnamed: 0 created_at default_profile default_profile_image \\\n", "20963 20963 2013-05-27 21:22:15 True False \n", "6331 6331 2009-09-14 18:58:36 False False \n", "17209 17209 2010-06-06 16:27:08 True False \n", "23964 23964 2010-06-22 21:56:09 False False \n", "30569 30569 2009-03-10 02:26:45 False False \n", "\n", " description \\\n", "20963 WHO is Kyle Tyrone Ferguson? Father, Husband, Friend, Nerd, Teacher, Black man... Into Superheroes, MTG, Deck building games, tv, movies, and other geekery. \n", "6331 Comedian, writer, former TV news producer, drunk historian, Twitter Queen, asshole and owner of Dewey. @doinitpodcast host Bookings: smark@wmeagency.com @Cameo \n", "17209 NaN \n", "23964 Information and updates from southern Saskatchewan from the hardworking journalists and photographers who make up the Global Regina newsroom. \n", "30569 Detritus \n", "\n", " favourites_count followers_count friends_count geo_enabled EVENT_ID \\\n", "20963 32374 2395 2823 True 1463172686 \n", "6331 68664 350789 1528 False 74231747 \n", "17209 74 54 657 False 152688783 \n", "23964 1517 55881 991 True 158502985 \n", "30569 2616 1118405 657 False 23544268 \n", "\n", " lang location \\\n", "20963 en Mount Morris, MI \n", "6331 en Los Angeles - Always a Texan \n", "17209 NaN Abu Dhabi \n", "23964 en Regina, SK Canada \n", "30569 no unknown \n", "\n", " profile_background_image_url \\\n", "20963 http://abs.twimg.com/images/themes/theme1/bg.png \n", "6331 http://abs.twimg.com/images/themes/theme9/bg.gif \n", "17209 http://abs.twimg.com/images/themes/theme1/bg.png \n", "23964 http://abs.twimg.com/images/themes/theme1/bg.png \n", "30569 http://abs.twimg.com/images/themes/theme11/bg.gif \n", "\n", " profile_image_url \\\n", "20963 http://pbs.twimg.com/profile_images/1022468674169569280/7tpDfAwa_normal.jpg \n", "6331 http://pbs.twimg.com/profile_images/772455794243571712/bGBBHx0N_normal.jpg \n", "17209 http://pbs.twimg.com/profile_images/968278541/For-Twitter_normal.png \n", "23964 http://pbs.twimg.com/profile_images/722495430097903616/dKLfuc1-_normal.jpg \n", "30569 http://pbs.twimg.com/profile_images/1281416121212493826/HVjvkjRz_normal.jpg \n", "\n", " screen_name statuses_count verified average_tweets_per_day \\\n", "20963 blerdwords 11448 False 4.336 \n", "6331 JennyJohnsonHi5 18732 True 4.694 \n", "17209 AbnerJosh 161 False 0.043 \n", "23964 GlobalRegina 103379 True 27.865 \n", "30569 OfficialKat 4980 True 1.191 \n", "\n", " account_age_days EVENT_LABEL ENTITY_TYPE \n", "20963 2640 0 user \n", "6331 3991 0 user \n", "17209 3726 1 user \n", "23964 3710 0 user \n", "30569 4180 0 user " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "21\n", "(29950, 21)\n", "========= \n", "\n" ] } ], "source": [ "for key in all_keys:\n", " obj = FraudDatasetBenchmark(key=key, \n", " load_pre_downloaded=True,\n", " delete_downloaded=False,\n", " add_random_values_if_real_na = { \"EVENT_TIMESTAMP\": False, \n", " \"LABEL_TIMESTAMP\": False,\n", " \"ENTITY_ID\": False,\n", " \"ENTITY_TYPE\": True,\n", " \"EVENT_ID\": True})\n", " print(obj.key)\n", " print('Train set: ')\n", " display(obj.train.head())\n", " print(len(obj.train.columns))\n", " print(obj.train.shape)\n", " print('=========','\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# End" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 } ================================================ FILE: scripts/reproducibility/afd/README.md ================================================ ## Steps to reproduce AFD models Amazon Fraud Detector (AFD) models can be either run via AWS Console or using API calls. In this folder, we provide scripts that make API calls to create model artifacts and then to score the model on test data. High level steps to train and deploy model are: ![afd steps](../../../images/afd_steps.png) You can use provided scripts to replicate performance shown in the benchmark. 1. Setup AWS credentials in terminal for the AWS account where you want to run AFD, and store the data. You can use environment variables as [following](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html) 2. Use the [template data-loader notebook](../../examples/Test_FDB_Loader.ipynb) to upload the benchmark data on S3. (AFD requires data to be saved in S3 and require an S3 path) 3. Create AFD resources including entities, event types, and model. Update values in `IAM_ROLE`, `BUCKET`, `KEY` and `MODEL_NAME` in the `create_afd_resources.py`, then run following. ``` python create_afd_resources.py configs/{dataset-you-want-to-use} ``` You can keep `MODEL_TYPE` as **ONLINE_FRAUD_INSIGHTS** or **TRANSACTION_FRAUD_INSIGHTS** to run corresponding models. This will initiate automatic model training. Wait for ~1 hour for models to train. You can check status in your console. 4. Create detector and use it to score on the test data. Update values in `IAM_ROLE`, `BUCKET`, `TEST_PATH`, `TEST_LABELS_PATH` and `MODEL_NAME` in the `score_afd_resources.py`, then run following. ``` python score_afd_model.py ``` This will print performance metrics in terminal as well as save in S3 location you provide in the script. After a model training is completed, AFD console would show performance metrics like following (trained on `ieeecis` with ONLINE_FRAUD_INSIGHTS). ![ieee ofi sample](../../../images/ieee_ofi_sample.png) **In order to fully deep dive into working of Amazon Fraud Detector, [here](https://d1.awsstatic.com/fraud-detector/afd-technical-guide-detecting-new-account-fraud.pdf) is the link to technical guide.** ================================================ FILE: scripts/reproducibility/afd/configs/CreditCardFraudDetection.json ================================================ { "dataset": "Credit Card Fraud Detection", "variable_mappings": [ { "variable_name": "v1", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v2", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v3", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v4", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v5", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v6", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v7", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v8", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v9", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v10", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v11", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v12", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v13", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v14", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v15", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v16", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v17", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v18", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v19", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v20", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v21", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v22", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v23", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v24", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v25", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v26", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v27", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v28", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "amount", "variable_type": "NUMERIC", "data_type": "FLOAT" } ], "label_mappings": { "FRAUD": [ "1" ], "LEGIT": [ "0" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/FakeJobPostingPrediction.json ================================================ { "dataset": "Fake Job Posting Prediction", "variable_mappings": [ { "variable_name": "title", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "location", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "department", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "salary_range", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "company_profile", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "description", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "requirements", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "benefits", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "telecommuting", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "has_company_logo", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "has_questions", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "employment_type", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "required_experience", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "required_education", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "industry", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "function", "variable_type": "CATEGORICAL", "data_type": "STRING" } ], "label_mappings": { "FRAUD": [ "1" ], "LEGIT": [ "0" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/Fraudecommerce.json ================================================ { "dataset": "Fraud ecommerce", "variable_mappings": [ { "variable_name": "purchase_value", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "source", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "browser", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "age", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "ip_address", "variable_type": "IP_ADDRESS", "data_type": "FLOAT" }, { "variable_name": "time_since_signup", "variable_type": "NUMERIC", "data_type": "FLOAT" } ], "label_mappings": { "FRAUD": [ "1" ], "LEGIT": [ "0" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/IEEECISFraudDetection.json ================================================ { "dataset": "IEEE-CIS Fraud Detection", "variable_mappings": [ { "variable_name": "transactionamt", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "productcd", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "card1", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "card2", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "card3", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "card5", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "card6", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "addr1", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "dist1", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "p_emaildomain", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "r_emaildomain", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "c1", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c2", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c4", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c5", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c6", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c7", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c8", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c9", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c10", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c11", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c12", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c13", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "c14", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v62", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v70", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v76", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v78", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v82", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v91", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v127", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v130", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v139", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v160", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v165", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v187", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v203", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v207", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v209", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v210", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v221", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v234", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v257", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v258", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v261", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v264", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v266", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v267", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v271", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v274", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v277", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v283", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v285", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v289", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v291", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "v294", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_01", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_02", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_05", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_06", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_09", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_13", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_17", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_19", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "id_20", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "devicetype", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "deviceinfo", "variable_type": "CATEGORICAL", "data_type": "STRING" } ], "label_mappings": { "FRAUD": [ "1" ], "LEGIT": [ "0" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/IPBlocklist.json ================================================ { "dataset": "IP-BlockList", "variable_mappings": [ { "variable_name": "ip", "variable_type": "IP_ADDRESS", "data_type": "STRING" }, { "variable_name": "dummy_cat", "variable_type": "CATEGORICAL", "data_type": "STRING" } ], "label_mappings": { "FRAUD": [ "1" ], "LEGIT": [ "0" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/MaliciousURL.json ================================================ { "dataset": "Malicious URLs Dataset", "variable_mappings": [ { "variable_name": "url", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "dummy_cat", "variable_type": "CATEGORICAL", "data_type": "STRING" } ], "label_mappings": { "FRAUD": [ "malignant" ], "LEGIT": [ "benign" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/SimulatedCreditCardTransactionsSparkov.json ================================================ { "dataset": "Simulated Credit Card Transactions generated using Sparkov", "variable_mappings": [ { "variable_name": "cc_num", "variable_type": "CARD_BIN", "data_type": "INTEGER" }, { "variable_name": "category", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "amt", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "first", "variable_type": "BILLING_NAME", "data_type": "STRING" }, { "variable_name": "last", "variable_type": "BILLING_NAME", "data_type": "STRING" }, { "variable_name": "gender", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "street", "variable_type": "BILLING_ADDRESS_L1", "data_type": "STRING" }, { "variable_name": "city", "variable_type": "BILLING_CITY", "data_type": "STRING" }, { "variable_name": "state", "variable_type": "BILLING_STATE", "data_type": "STRING" }, { "variable_name": "zip", "variable_type": "BILLING_ZIP", "data_type": "STRING" }, { "variable_name": "lat", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "long", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "city_pop", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "job", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "dob", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "merch_lat", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "merch_long", "variable_type": "NUMERIC", "data_type": "FLOAT" } ], "label_mappings": { "FRAUD": [ "1" ], "LEGIT": [ "0" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/TwitterBotAccounts.json ================================================ { "dataset": "Twitter Bots Accounts", "variable_mappings": [ { "variable_name": "default_profile", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "default_profile_image", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "description", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "favourites_count", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "followers_count", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "friends_count", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "geo_enabled", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "lang", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "location", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "profile_background_image_url", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "profile_image_url", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "screen_name", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "statuses_count", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "verified", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "average_tweets_per_day", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "account_age_days", "variable_type": "NUMERIC", "data_type": "FLOAT" } ], "label_mappings": { "FRAUD": [ "bot" ], "LEGIT": [ "human" ] } } ================================================ FILE: scripts/reproducibility/afd/configs/VehicleLoanDefaultPrediction.json ================================================ { "dataset": "Vehicle Loan Default Prediction", "variable_mappings": [ { "variable_name": "disbursed_amount", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "asset_cost", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "ltv", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "branch_id", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "supplier_id", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "manufacturer_id", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "current_pincode_id", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "date_of_birth", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "employment_type", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "state_id", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "employee_code_id", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "mobileno_avl_flag", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "aadhar_flag", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "pan_flag", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "voterid_flag", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "driving_flag", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "passport_flag", "variable_type": "CATEGORICAL", "data_type": "STRING" }, { "variable_name": "perform_cns_score", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "perform_cns_score_description", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "pri_no_of_accts", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "pri_active_accts", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "pri_overdue_accts", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "pri_current_balance", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "pri_sanctioned_amount", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "pri_disbursed_amount", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "sec_no_of_accts", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "sec_active_accts", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "sec_overdue_accts", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "sec_current_balance", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "sec_sanctioned_amount", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "sec_disbursed_amount", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "primary_instal_amt", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "sec_instal_amt", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "new_accts_in_last_six_months", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "delinquent_accts_in_last_six_months", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "average_acct_age", "variable_type": "FREE_FORM_TEXT", "data_type": "STRING" }, { "variable_name": "credit_history_length", "variable_type": "NUMERIC", "data_type": "FLOAT" }, { "variable_name": "no_of_inquiries", "variable_type": "NUMERIC", "data_type": "FLOAT" } ], "label_mappings": { "FRAUD": [ "1" ], "LEGIT": [ "0" ] } } ================================================ FILE: scripts/reproducibility/afd/create_afd_resources.py ================================================ # TO BE UPDATED BY USER IAM_ROLE = "" BUCKET = "" KEY = "" MODEL_NAME = "" # lower case alphanumeric only, only _ allowed as delimiter MODEL_TYPE = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS import os import time import json import boto3 import click import string import random import logging import pandas as pd MODEL_DESC = "Benchmarking model" EVENT_DESC = "Event for benchmarking model" ENTITY_TYPE = "user" # this is provided in the dummy data. Will need to change if using different data ENTITY_DESC = "Entity for benchmarking model" BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME # Others are kept same as model name # boto3 connections client = boto3.client('frauddetector') s3 = boto3.client('s3') @click.command() @click.argument("config", type=click.Path(exists=True)) def afd_train_model_demo(config): ############################################# ##### Setup ##### with open(config, "r") as f: config_file = json.load(f) EVENT_VARIABLES = [variable["variable_name"] for variable in config_file["variable_mappings"]] EVENT_LABELS = [v for k,v in config_file["label_mappings"].items()] EVENT_LABELS = [item for sublist in EVENT_LABELS for item in sublist] # flattening list of lists # Variable mappings of demo data in this use case. Important to teach this to customer click.echo(f'{pd.DataFrame(config_file["variable_mappings"])}') click.echo(f'{pd.DataFrame(config_file["label_mappings"])}') S3_DATA_PATH = "s3://" + os.path.join(BUCKET, KEY) ############################################# ##### Create event variables and labels ##### # -- create variable -- for variable in config_file["variable_mappings"]: DEFAULT_VALUE = '0.0' if variable["data_type"] == "FLOAT" else '' try: resp = client.get_variables(name = variable["variable_name"]) click.echo("{0} exists, data type: {1}".format(variable["variable_name"], resp['variables'][0]['dataType'])) except: click.echo("Creating variable: {0}".format(variable["variable_name"])) resp = client.create_variable( name = variable["variable_name"], dataType = variable["data_type"], dataSource ='EVENT', defaultValue = DEFAULT_VALUE, description = variable["variable_name"], variableType = variable["variable_type"]) # Putting FRAUD for f in config_file["label_mappings"]["FRAUD"]: response = client.put_label( name = f, description = "FRAUD") # Putting LEGIT for f in config_file["label_mappings"]["LEGIT"]: response = client.put_label( name = f, description = "LEGIT") ############################################# ##### Define Entity and Event Types ##### # -- create entity type -- try: response = client.get_entity_types(name = ENTITY_TYPE) click.echo("-- entity type exists --") click.echo(response) except: response = client.put_entity_type( name = ENTITY_TYPE, description = ENTITY_DESC ) click.echo("-- create entity type --") click.echo(response) # -- create event type -- try: response = client.get_event_types(name = EVENT_TYPE) click.echo("\n-- event type exists --") click.echo(response) except: response = client.put_event_type ( name = EVENT_TYPE, eventVariables = EVENT_VARIABLES, labels = EVENT_LABELS, entityTypes = [ENTITY_TYPE]) click.echo("\n-- create event type --") click.echo(response) ############################################# ##### Batch import training file for TFI ##### if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS": try: response = client.create_batch_import_job( jobId = BATCH_PREDICTION_JOB, inputPath = S3_DATA_PATH, outputPath = "s3://" + BUCKET, eventTypeName = EVENT_TYPE, iamRoleArn = IAM_ROLE ) except Exception: pass # -- wait until batch import is finished -- print("--- waiting until batch import is finished ") stime = time.time() while True: response = client.get_batch_import_jobs(jobId=BATCH_PREDICTION_JOB) if 'IN_PROGRESS' in response['batchImports'][0]['status']: print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") time.sleep(60) # sleep for 1 minute else: print("Batch Impoort status : " + response['batchImports'][0]['status']) break etime = time.time() print(f"Elapsed time: {(etime - stime)/60:{3}.{3}} minutes \n" ) print(response) ############################################# ##### Create and train your model ##### try: response = client.create_model( description = MODEL_DESC, eventTypeName = EVENT_TYPE, modelId = MODEL_NAME, modelType = MODEL_TYPE) click.echo("-- initalize model --") click.echo(response) except Exception: pass # -- initalized the model, it's now ready to train -- # -- first define training_data_schema for model to use -- if MODEL_TYPE == "TRANSACTION_FRAUD_INSIGHTS": training_data_schema = { 'modelVariables' : EVENT_VARIABLES, 'labelSchema' : { 'labelMapper' : config_file["label_mappings"], 'unlabeledEventsTreatment': 'IGNORE' } } response = client.create_model_version( modelId = MODEL_NAME, modelType = MODEL_TYPE, trainingDataSource = 'INGESTED_EVENTS', trainingDataSchema = training_data_schema, ingestedEventsDetail={ # This needs to be changed 'ingestedEventsTimeWindow': { 'startTime': '2020-12-10T00:00:00Z', # '2021-08-28T00:00:00Z', 'endTime': '2022-06-07T00:00:00Z' #'2022-05-10T00:00:00Z' } } ) else: training_data_schema = { 'modelVariables' : EVENT_VARIABLES, 'labelSchema' : { 'labelMapper' : config_file["label_mappings"] } } response = client.create_model_version( modelId = MODEL_NAME, modelType = MODEL_TYPE, trainingDataSource = 'EXTERNAL_EVENTS', trainingDataSchema = training_data_schema, externalEventsDetail = { 'dataLocation' : S3_DATA_PATH, 'dataAccessRoleArn': IAM_ROLE } ) model_version = response['modelVersionNumber'] click.echo("-- model training --") click.echo(response) if __name__=="__main__": afd_train_model_demo() ================================================ FILE: scripts/reproducibility/afd/score_afd_model.py ================================================ # TO BE UPDATED BY USER IAM_ROLE = "" BUCKET = "" TEST_PATH = "" TEST_LABELS_PATH = "" MODEL_NAME = "" # lower case alphanumeric only, only _ allowed as delimiter MODEL_TYPE = "ONLINE_FRAUD_INSIGHTS" # or TRANSACTION_FRAUD_INSIGHTS import os import ast import time import json import boto3 import click import string import random import logging import numpy as np import pandas as pd from sklearn.metrics import roc_curve, auc # boto3 connections client = boto3.client('frauddetector') s3 = boto3.client('s3') BATCH_PREDICTION_JOB = DETECTOR_NAME = EVENT_TYPE = MODEL_NAME model_version = '1.0' DETECTOR_DESC = "Benchmarking detector" def create_outcomes(outcomes): """ Create Fraud Detector Outcomes """ for outcome in outcomes: print("creating outcome variable: {0} ".format(outcome)) response = client.put_outcome(name = outcome, description = outcome) def create_rules(score_cuts, outcomes): """ Creating rules Arguments: score_cuts - list of score cuts to create rules outcomes - list of outcomes associated with the rules Returns: a rule list to used when create detector """ if len(score_cuts)+1 != len(outcomes): logging.error('Your socre cuts and outcomes are not matched.') rule_list = [] for i in range(len(outcomes)): # rule expression if i < (len(outcomes)-1): rule = "${0}_insightscore > {1}".format(MODEL_NAME,score_cuts[i]) else: rule = "${0}_insightscore <= {1}".format(MODEL_NAME,score_cuts[i-1]) # append to rule_list (used when create detector) rule_id = "rules{0}_{1}".format(i, MODEL_NAME[:9]) rule_list.append({ "ruleId": rule_id, "ruleVersion" : '1', "detectorId" : DETECTOR_NAME }) # create rules print("creating rule: {0}: IF {1} THEN {2}".format(rule_id, rule, outcomes[i])) try: response = client.create_rule( ruleId = rule_id, detectorId = DETECTOR_NAME, expression = rule, language = 'DETECTORPL', outcomes = [outcomes[i]] ) except: print("this rule already exists in this detector") return rule_list def ast_with_nan(x): try: return ast.literal_eval(x) except: return np.nan def afd_train_model_demo(): # -- activate the model version -- try: response = client.update_model_version_status ( modelId = MODEL_NAME, modelType = MODEL_TYPE, modelVersionNumber = model_version, status = 'ACTIVE' ) print("-- activating model --") print(response) except Exception: print("First train the model") # -- wait until model is active -- print("--- waiting until model status is active ") stime = time.time() while True: response = client.get_model_version(modelId=MODEL_NAME, modelType = MODEL_TYPE, modelVersionNumber = model_version) if response['status'] != 'ACTIVE': print(response['status']) print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") time.sleep(60) # sleep for 1 minute if response['status'] == 'ACTIVE': print("Model status : " + response['status']) break etime = time.time() print("Elapsed time : %s" % (etime - stime) + " seconds \n" ) print(response) # -- put detector, initalizes your detector -- response = client.put_detector( detectorId = DETECTOR_NAME, description = DETECTOR_DESC, eventTypeName = EVENT_TYPE ) # -- decide what threshold and corresponding outcome you want to add -- # here, we create three simple rules by cutting the score at [950,750], and create three outcome ['fraud', 'investigate', 'approve'] # it will create 3 rules: # score > 950: fraud # score <= 750: approve score_cuts = [750] # recommended to fine tune this based on your business use case outcomes = ['fraud', 'approve'] # recommended to define this based on your business use case # -- create outcomes -- print(" -- create outcomes --") create_outcomes(outcomes) # -- create rules -- print(" -- create rules --") rule_list = create_rules(score_cuts, outcomes) # -- create detector version -- client.create_detector_version( detectorId = DETECTOR_NAME, rules = rule_list, modelVersions = [{"modelId": MODEL_NAME, "modelType": MODEL_TYPE, "modelVersionNumber": model_version}], # there are 2 options for ruleExecutionMode: # 'ALL_MATCHED' - return all matched rules' outcome # 'FIRST_MATCHED' - return first matched rule's outcome ruleExecutionMode = 'FIRST_MATCHED' ) print("\n -- detector created -- ") print(response) response = client.update_detector_version_status( detectorId = DETECTOR_NAME, detectorVersionId = '1', status = 'ACTIVE' ) print("\n -- detector activated -- ") print(response) # -- wait until detector is active -- print("\n --- waiting until detector status is active ") stime = time.time() while True: response = client.describe_detector( detectorId = DETECTOR_NAME, ) if response['detectorVersionSummaries'][0]['status'] != 'ACTIVE': print(response['detectorVersionSummaries'][0]['status']) print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") time.sleep(60) if response['detectorVersionSummaries'][0]['status'] == 'ACTIVE': break etime = time.time() print("Elapsed time : %s" % (etime - stime) + " seconds \n" ) print(response) # -- create detector evaluation -- try: client.create_batch_prediction_job ( jobId = BATCH_PREDICTION_JOB, inputPath = os.path.join('s3://', BUCKET, TEST_PATH), outputPath =os.path.join('s3://', BUCKET), eventTypeName = EVENT_TYPE, detectorName = DETECTOR_NAME, detectorVersion = '1', iamRoleArn = IAM_ROLE) except Exception as e: print(e) print("batch prediction job already exists") # -- wait until batch prediction job is completed -- print("\n --- waiting until batch prediction job is completed ") stime = time.time() while True: response = client.get_batch_prediction_jobs(jobId=BATCH_PREDICTION_JOB) response = response['batchPredictions'][0] if (response['status'] != 'COMPLETE') and (response['status'] != 'FAILED'): print(f"current progress: {(time.time() - stime)/60:{3}.{3}} minutes") time.sleep(60) if response['status'] == 'COMPLETE': break etime = time.time() print("Elapsed time : %s" % (etime - stime) + " seconds \n" ) print(response) # -- get batch prediction job result -- contents = s3.list_objects_v2(Bucket=BUCKET, Prefix=os.path.join(TEST_PATH))['Contents'] print(contents) S3_SCORE_PATH = sorted([c['Key'] for c in contents if c['Key'].endswith('output.csv')])[-1] print(S3_SCORE_PATH) # -- get test performance -- # Predictions print(os.path.join('s3://', BUCKET, S3_SCORE_PATH)) predictions = pd.read_csv(os.path.join('s3://', BUCKET, S3_SCORE_PATH)) predictions = predictions.copy()[~predictions.MODEL_SCORES.isna()] predictions['scores'] = predictions['MODEL_SCORES'].\ apply(lambda x: ast_with_nan(x)).\ apply(lambda x: x.get(MODEL_NAME)) # Labels labels = pd.read_csv(os.path.join('s3://', BUCKET, TEST_LABELS_PATH)) # labels['EVENT_LABEL'] = labels['EVENT_LABEL'].map({'benign': 0, 'malignant': 1}) predictions = predictions.merge(labels, on='EVENT_ID', how='left') print('Test size: ', predictions.shape) fpr, tpr, threshold = roc_curve(predictions['EVENT_LABEL'], predictions['scores']) test_auc = auc(fpr,tpr) print('AUC: ', test_auc) test_metrics = {} test_metrics['auc'] = test_auc test_metrics['fpr'] = list(fpr) test_metrics['tpr'] = list(tpr) test_metrics['threshold'] = list(threshold) # -- put test metrics in s3 -- s3.put_object( Body=json.dumps(test_metrics), Bucket=BUCKET, Key='test_metrics.json') print("\n -- test metrics saved -- ") if __name__ == "__main__": afd_train_model_demo() ================================================ FILE: scripts/reproducibility/autogluon/README.md ================================================ - benchmark_ag.py: a script for autogluon benchmarking - example-ag-ieeecis.ipynb: an example notebook using benchmark_ag.py Note that autogluon is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper. ================================================ FILE: scripts/reproducibility/autogluon/benchmark_ag.py ================================================ import pandas as pd import os import gc import joblib import datetime import matplotlib as mpl from sklearn.metrics import roc_auc_score, roc_curve mpl.rcParams['figure.dpi'] = 150 pd.set_option('display.max_columns', 500) pd.set_option('display.max_rows', 500) pd.set_option('display.width', 200) pd.set_option('display.float_format', lambda x: '%.3f' % x) import logging FORMAT = "%(levelname)s: %(name)s: %(message)s" DATE_FORMAT = "%Y-%m-%d %H:%M:%S" logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) logger = logging.getLogger(os.path.basename(__file__)) logger.setLevel(logging.DEBUG) import sys sys.path.append('../') from benchmark_utils import load_data, get_recall from autogluon.tabular import TabularPredictor def run_ag(dataset, base_path, time_limit=3600, presets=None, hyperparameters=None, feature_metadata='infer', verbosity=2): gc.collect() features, df_train, df_test = load_data(dataset, base_path) dateTimeObj = datetime.datetime.now() timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S") suffix = (f"_{presets}" if presets is not None else "") \ + (f"_{hyperparameters}" if hyperparameters is not None else "") \ + ("_feature_metadata" if feature_metadata != 'infer' else "") folder = f"ag-{timestampStr}" \ + suffix predictor = TabularPredictor(label='EVENT_LABEL', eval_metric='roc_auc', path=f"{base_path}/{dataset}/AutogluonModels/{folder}/", verbosity=verbosity) predictor.fit(df_train[features + ['EVENT_LABEL'] ], time_limit=time_limit, presets=presets, hyperparameters=hyperparameters, feature_metadata=feature_metadata) leaderboard = predictor.leaderboard(df_test[features + ['EVENT_LABEL'] ]) leaderboard_file = "leaderboard" \ + suffix \ + ".csv" leaderboard.to_csv(f"{base_path}/{dataset}/{leaderboard_file}", index=False) df_pred = predictor.predict_proba(df_test[ features ], as_multiclass=False) auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred) logger.info(f"auc on test data: {auc}") pos_label = predictor.positive_class fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred, pos_label=pos_label) y_true = df_test['EVENT_LABEL'] y_true = (y_true==pos_label) recall = get_recall(fpr, tpr, fpr_target=0.01) logger.info(f"tpr@1%fpr on test data: {recall}") test_metrics_ag_bq = { "labels": df_test['EVENT_LABEL'], "pred_prob": df_pred, "auc": auc, "tpr@1%fpr": recall, "fpr": fpr, "tpr": tpr, "thresholds": thresholds } metrics_file = "test_metrics_ag" \ + suffix \ + ".joblib" joblib.dump(test_metrics_ag_bq, f"{base_path}/{dataset}/{metrics_file}") ================================================ FILE: scripts/reproducibility/autogluon/example-ag-ieeecis.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "7d350d0d", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "1d6a8c41", "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.core.display import display, HTML\n", "from IPython.display import clear_output\n", "display(HTML(\"\"))" ] }, { "cell_type": "markdown", "id": "611127d9", "metadata": {}, "source": [ "## Step 1: pip install required packages if not installed already" ] }, { "cell_type": "code", "execution_count": 6, "id": "321cb018", "metadata": {}, "outputs": [], "source": [ "# !pip install autogluon\n", "import benchmark_ag\n", "from benchmark_ag import load_data, run_ag" ] }, { "cell_type": "markdown", "id": "1d191102", "metadata": {}, "source": [ "## Step 2: download data using fdb\n", "Example: https://github.com/amazon-research/fraud-dataset-benchmark/blob/main/scripts/examples/Test_FDB_Loader.ipynb" ] }, { "cell_type": "code", "execution_count": 7, "id": "33fd8a7b", "metadata": {}, "outputs": [], "source": [ "# This is where datasets are stored: {BASE_PATH}/{dataset}/\n", "BASE_PATH = \"/home/ec2-user/SageMaker/official-dataset-names\"\n", "dataset = \"IEEE-CIS Fraud Detection\"" ] }, { "cell_type": "markdown", "id": "c4bca656", "metadata": {}, "source": [ "Make sure three files are downloaded:\n", "1. {BASE_PATH}/{dataset}/train.csv\n", "2. {BASE_PATH}/{dataset}/test.csv\n", "3. {BASE_PATH}/{dataset}/test_labels.csv" ] }, { "cell_type": "markdown", "id": "52b3fcdb", "metadata": {}, "source": [ "## Step 3: look at data" ] }, { "cell_type": "code", "execution_count": 8, "id": "b07893d7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n", "INFO: benchmark_utils.py: (313060, 194)\n", "INFO: benchmark_utils.py: (27330, 71)\n", "INFO: benchmark_utils.py: (29527, 2)\n", "INFO: benchmark_utils.py: (27329, 72)\n", "INFO: benchmark_utils.py: 67\n", "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n" ] } ], "source": [ "features, df_train, df_test = load_data(dataset, BASE_PATH)" ] }, { "cell_type": "code", "execution_count": 9, "id": "8cad73e3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABELtransactionamtproductcdcard1card2card3card5card6addr1addr2dist1dist2p_emaildomainr_emaildomainc1c2c4c5c6c7c8c9c10c11c12c13c14d1d2d3d4d5d10d11d15m1m2m3m4m6m7m8m9v1v3v4v6v8v11v13v14v17v20v23v26v27v30v36v37v40v41v44v47v48v54v56v59v62v65v67v68v70v76v78v80v82v86v88v89v91v107v108v111v115v117v120v121v123v124v127v129v130v136v138v139v142v147v156v160v162v165v166v169v171v173v175v176v178v180v182v185v187v188v198v203v205v207v209v210v215v218v220v221v223v224v226v228v229v234v235v238v240v250v252v253v257v258v260v261v264v266v267v271v274v277v281v283v284v285v286v289v291v294v296v297v301v303v305v307v309v310v314v320id_01id_02id_03id_04id_05id_06id_09id_10id_11id_12id_13id_15id_16id_17id_18id_19id_20id_28id_29id_31id_35id_36id_37id_38devicetypedeviceinfoENTITY_IDEVENT_TIMESTAMPENTITY_TYPEEVENT_IDLABEL_TIMESTAMP
0068.500W13926.000NaN150.000142.000credit315.00087.00019.000NaNNaNNaN1.0001.0000.0000.0001.0000.0000.0001.0000.0002.0000.0001.0001.00014.000NaN13.000NaNNaN12.00012.000-1.000TTTM2TNaNNaNNaN1.0001.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.000NaNNaNNaNNaNNaNNaNNaN1.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0000.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.000117.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0000.0000.0000.0001.0001.0000.0000.0000.0000.0001.000117.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN13926.0_315.0_-13.02021-01-02 00:00:00user2987000.0002022-01-01T20:30:04Z
1029.000W2755.000404.000150.000102.000credit325.00087.000NaNNaNgmail.comNaN1.0001.0000.0000.0001.0000.0000.0000.0000.0001.0000.0001.0001.0000.000NaNNaN-1.000NaN-1.000NaN-1.000NaNNaNNaNM0TNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0001.0001.0001.0000.0000.0000.0001.0000.0001.0001.0001.0000.0000.0001.0000.0001.0001.0001.0000.0000.0000.0001.0000.0001.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2755.0_325.0_1.02021-01-02 00:00:01user2987001.0002022-01-01T20:30:04Z
2059.000W4663.000490.000150.000166.000debit330.00087.000287.000NaNoutlook.comNaN1.0001.0000.0000.0001.0000.0000.0001.0000.0001.0000.0001.0001.0000.000NaNNaN-1.001NaN-1.001313.999313.999TTTM0FFFF1.0001.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN4663.0_330.0_1.02021-01-02 00:01:09user2987002.0002022-01-01T20:30:04Z
3050.000W18132.000567.000150.000117.000debit476.00087.000NaNNaNyahoo.comNaN2.0005.0000.0000.0004.0000.0000.0001.0000.0001.0000.00025.0001.000112.000112.0000.00092.9990.00082.999NaN109.999NaNNaNNaNM0FNaNNaNNaNNaNNaNNaNNaNNaNNaN1.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001758.0000.000354.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0000.0000.00010.0000.0000.0001.00038.0000.0000.0000.0000.0001.0001758.0000.000354.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN18132.0_476.0_-111.02021-01-02 00:01:39user2987003.0002022-01-01T20:30:04Z
4050.000H4497.000514.000150.000102.000credit420.00087.000NaNNaNgmail.comNaN1.0001.0000.0000.0001.0000.0001.0000.0001.0001.0000.0001.0001.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.000169690.8000.0005155.0002840.0000.0001.0000.0000.0001.0000.0000.0000.0000.0001.0001.0001.0000.0000.0000.0000.0000.0000.0000.0000.0001.0000.0000.0000.0001.0001.0000.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.0000.0000.0000.0001.0000.0000.0000.0000.0001.0000.0000.0000.0000.0001.0001.0000.0000.0000.0000.0000.0000.00070787.000NaNNaNNaNNaNNaNNaN100.000NotFoundNaNNewNotFound166.000NaN542.000144.000NewNotFoundsamsung browser 6.2TFTTmobileSAMSUNG SM-G892A Build/NRD90M4497.0_420.0_1.02021-01-02 00:01:46user2987004.0002022-01-01T20:30:04Z
\n", "
" ], "text/plain": [ " EVENT_LABEL transactionamt productcd card1 card2 card3 card5 card6 addr1 addr2 dist1 dist2 p_emaildomain r_emaildomain c1 c2 c4 c5 c6 c7 c8 c9 c10 \\\n", "0 0 68.500 W 13926.000 NaN 150.000 142.000 credit 315.000 87.000 19.000 NaN NaN NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000 \n", "1 0 29.000 W 2755.000 404.000 150.000 102.000 credit 325.000 87.000 NaN NaN gmail.com NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 \n", "2 0 59.000 W 4663.000 490.000 150.000 166.000 debit 330.000 87.000 287.000 NaN outlook.com NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000 \n", "3 0 50.000 W 18132.000 567.000 150.000 117.000 debit 476.000 87.000 NaN NaN yahoo.com NaN 2.000 5.000 0.000 0.000 4.000 0.000 0.000 1.000 0.000 \n", "4 0 50.000 H 4497.000 514.000 150.000 102.000 credit 420.000 87.000 NaN NaN gmail.com NaN 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 \n", "\n", " c11 c12 c13 c14 d1 d2 d3 d4 d5 d10 d11 d15 m1 m2 m3 m4 m6 m7 m8 m9 v1 v3 v4 v6 v8 v11 v13 v14 v17 v20 v23 v26 \\\n", "0 2.000 0.000 1.000 1.000 14.000 NaN 13.000 NaN NaN 12.000 12.000 -1.000 T T T M2 T NaN NaN NaN 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 \n", "1 1.000 0.000 1.000 1.000 0.000 NaN NaN -1.000 NaN -1.000 NaN -1.000 NaN NaN NaN M0 T NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 1.000 1.000 1.000 \n", "2 1.000 0.000 1.000 1.000 0.000 NaN NaN -1.001 NaN -1.001 313.999 313.999 T T T M0 F F F F 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 \n", "3 1.000 0.000 25.000 1.000 112.000 112.000 0.000 92.999 0.000 82.999 NaN 109.999 NaN NaN NaN M0 F NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 1.000 0.000 1.000 1.000 1.000 \n", "4 1.000 0.000 1.000 1.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "\n", " v27 v30 v36 v37 v40 v41 v44 v47 v48 v54 v56 v59 v62 v65 v67 v68 v70 v76 v78 v80 v82 v86 v88 v89 v91 v107 v108 v111 v115 v117 v120 v121 \\\n", "0 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "1 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "2 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "3 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "\n", " v123 v124 v127 v129 v130 v136 v138 v139 v142 v147 v156 v160 v162 v165 v166 v169 v171 v173 v175 v176 v178 v180 v182 v185 v187 v188 v198 v203 v205 v207 \\\n", "0 1.000 1.000 117.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 1.000 1.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2 1.000 1.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "3 1.000 1.000 1758.000 0.000 354.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "4 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 169690.800 0.000 5155.000 2840.000 0.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000 \n", "\n", " v209 v210 v215 v218 v220 v221 v223 v224 v226 v228 v229 v234 v235 v238 v240 v250 v252 v253 v257 v258 v260 v261 v264 v266 v267 v271 v274 v277 v281 v283 v284 v285 \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 0.000 \n", "1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 0.000 \n", "2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 0.000 \n", "3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 0.000 0.000 10.000 \n", "4 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 \n", "\n", " v286 v289 v291 v294 v296 v297 v301 v303 v305 v307 v309 v310 v314 v320 id_01 id_02 id_03 id_04 id_05 id_06 id_09 id_10 id_11 id_12 id_13 id_15 id_16 \\\n", "0 0.000 0.000 1.000 1.000 0.000 0.000 0.000 0.000 1.000 117.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "3 0.000 0.000 1.000 38.000 0.000 0.000 0.000 0.000 1.000 1758.000 0.000 354.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "4 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 70787.000 NaN NaN NaN NaN NaN NaN 100.000 NotFound NaN New NotFound \n", "\n", " id_17 id_18 id_19 id_20 id_28 id_29 id_31 id_35 id_36 id_37 id_38 devicetype deviceinfo ENTITY_ID EVENT_TIMESTAMP ENTITY_TYPE \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13926.0_315.0_-13.0 2021-01-02 00:00:00 user \n", "1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2755.0_325.0_1.0 2021-01-02 00:00:01 user \n", "2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4663.0_330.0_1.0 2021-01-02 00:01:09 user \n", "3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18132.0_476.0_-111.0 2021-01-02 00:01:39 user \n", "4 166.000 NaN 542.000 144.000 New NotFound samsung browser 6.2 T F T T mobile SAMSUNG SM-G892A Build/NRD90M 4497.0_420.0_1.0 2021-01-02 00:01:46 user \n", "\n", " EVENT_ID LABEL_TIMESTAMP \n", "0 2987000.000 2022-01-01T20:30:04Z \n", "1 2987001.000 2022-01-01T20:30:04Z \n", "2 2987002.000 2022-01-01T20:30:04Z \n", "3 2987003.000 2022-01-01T20:30:04Z \n", "4 2987004.000 2022-01-01T20:30:04Z " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head()" ] }, { "cell_type": "code", "execution_count": 10, "id": "6400b0c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transactionamtproductcdcard1card2card3card5card6addr1dist1p_emaildomainr_emaildomainc1c2c4c5c6c7c8c9c10c11c12c13c14v62v70v76v78v82v91v127v130v139v160v165v187v203v207v209v210v221v234v257v258v261v264v266v267v271v274v277v283v285v289v291v294id_01id_02id_05id_06id_09id_13id_17id_19id_20devicetypedeviceinfoEVENT_TIMESTAMPENTITY_IDENTITY_TYPEEVENT_IDEVENT_LABEL
0125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109411.0002301.0000.0002401.00066104.0001.000103183.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000926.000-10.0001411.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:11:1515775.0_330.0_129.0user3548013.0000
1125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109536.0002301.0000.0002401.00066229.0001.000103308.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000927.000-10.000693.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:11:2915775.0_330.0_129.0user3548014.0000
2125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109661.0002301.0000.0002401.00066354.0001.000103433.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000928.000-10.0001116.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:11:4515775.0_330.0_129.0user3548015.0000
3125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109786.0002301.0000.0002401.00066479.0001.000103558.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000929.000-10.0001589.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:12:0015775.0_330.0_129.0user3548016.0000
431.950W9500.000321.000150.000226.000debit204.00074.000NaNNaN3.0003.0000.0001.0001.0000.0000.0001.0000.0001.0000.0006.0003.0001.0001.0001.0002.0001.0001.00027.95027.950NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.0001.0001.0001.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2021-06-21 23:12:119500.0_204.0_150.0user3548017.0000
\n", "
" ], "text/plain": [ " transactionamt productcd card1 card2 card3 card5 card6 addr1 dist1 p_emaildomain r_emaildomain c1 c2 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 v62 \\\n", "0 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "1 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "2 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "3 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "4 31.950 W 9500.000 321.000 150.000 226.000 debit 204.000 74.000 NaN NaN 3.000 3.000 0.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 6.000 3.000 1.000 \n", "\n", " v70 v76 v78 v82 v91 v127 v130 v139 v160 v165 v187 v203 v207 v209 v210 v221 v234 v257 v258 v261 v264 v266 v267 v271 v274 v277 v283 \\\n", "0 0.000 NaN NaN NaN NaN 109411.000 2301.000 0.000 2401.000 66104.000 1.000 103183.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "1 0.000 NaN NaN NaN NaN 109536.000 2301.000 0.000 2401.000 66229.000 1.000 103308.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "2 0.000 NaN NaN NaN NaN 109661.000 2301.000 0.000 2401.000 66354.000 1.000 103433.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "3 0.000 NaN NaN NaN NaN 109786.000 2301.000 0.000 2401.000 66479.000 1.000 103558.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "4 1.000 1.000 2.000 1.000 1.000 27.950 27.950 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 \n", "\n", " v285 v289 v291 v294 id_01 id_02 id_05 id_06 id_09 id_13 id_17 id_19 id_20 devicetype deviceinfo EVENT_TIMESTAMP ENTITY_ID ENTITY_TYPE EVENT_ID EVENT_LABEL \n", "0 26.000 1.000 2.000 926.000 -10.000 1411.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:11:15 15775.0_330.0_129.0 user 3548013.000 0 \n", "1 26.000 1.000 2.000 927.000 -10.000 693.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:11:29 15775.0_330.0_129.0 user 3548014.000 0 \n", "2 26.000 1.000 2.000 928.000 -10.000 1116.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:11:45 15775.0_330.0_129.0 user 3548015.000 0 \n", "3 26.000 1.000 2.000 929.000 -10.000 1589.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:12:00 15775.0_330.0_129.0 user 3548016.000 0 \n", "4 1.000 1.000 1.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2021-06-21 23:12:11 9500.0_204.0_150.0 user 3548017.000 0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_test.head()" ] }, { "cell_type": "markdown", "id": "cd88bef4", "metadata": {}, "source": [ "## Step 3: run Autogluon" ] }, { "cell_type": "markdown", "id": "4c153785", "metadata": {}, "source": [ "1. The function run_ag below also saves a leaderboard file (leaderboard_xxx.csv) and a test metrics file (test_metrics_xxx.joblib) into {BASE_PATH}/{dataset}/, respectively\n", "2. AutoGluon models are saved at {BASE_PATH}/{dataset}/AutogluonModels" ] }, { "cell_type": "code", "execution_count": 13, "id": "54b6b65a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n", "INFO: benchmark_utils.py: (313060, 194)\n", "INFO: benchmark_utils.py: (27330, 71)\n", "INFO: benchmark_utils.py: (29527, 2)\n", "INFO: benchmark_utils.py: (27329, 72)\n", "INFO: benchmark_utils.py: 67\n", "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n", "INFO: autogluon.tabular.predictor.predictor: Presets specified: ['best_quality']\n", "INFO: autogluon.tabular.learner.default_learner: Beginning AutoGluon training ... Time limit = 3600s\n", "INFO: autogluon.tabular.learner.default_learner: AutoGluon will save models to \"/home/ec2-user/SageMaker/official-dataset-names/IEEE-CIS Fraud Detection/AutogluonModels/ag-20220615_135015_best_quality/\"\n", "INFO: autogluon.tabular.learner.default_learner: AutoGluon Version: 0.4.2\n", "INFO: autogluon.tabular.learner.default_learner: Python Version: 3.7.10\n", "INFO: autogluon.tabular.learner.default_learner: Operating System: Linux\n", "INFO: autogluon.tabular.learner.default_learner: Train Data Rows: 313060\n", "INFO: autogluon.tabular.learner.default_learner: Train Data Columns: 67\n", "INFO: autogluon.tabular.learner.default_learner: Label Column: EVENT_LABEL\n", "INFO: autogluon.tabular.learner.default_learner: Preprocessing data ...\n", "Level 25: autogluon.core.utils.utils: AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n", "INFO: autogluon.core.utils.utils: \t2 unique label values: [0, 1]\n", "Level 25: autogluon.core.utils.utils: \tIf 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n", "INFO: autogluon.core.data.label_cleaner: Selected class <--> label mapping: class 1 = 1, class 0 = 0\n", "INFO: autogluon.tabular.learner.default_learner: Using Feature Generators to preprocess the data ...\n", "INFO: autogluon.features.generators.abstract: Fitting AutoMLPipelineFeatureGenerator...\n", "INFO: autogluon.features.generators.abstract: \tAvailable Memory: 502950.08 MB\n", "INFO: autogluon.features.generators.abstract: \tTrain Data (Original) Memory Usage: 248.99 MB (0.0% of available memory)\n", "INFO: autogluon.features.generators.abstract: \tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n", "INFO: autogluon.features.generators.abstract: \tStage 1 Generators:\n", "INFO: autogluon.features.generators.abstract: \t\tFitting AsTypeFeatureGenerator...\n", "INFO: autogluon.features.generators.abstract: \tStage 2 Generators:\n", "INFO: autogluon.features.generators.abstract: \t\tFitting FillNaFeatureGenerator...\n", "INFO: autogluon.features.generators.abstract: \tStage 3 Generators:\n", "INFO: autogluon.features.generators.abstract: \t\tFitting IdentityFeatureGenerator...\n", "INFO: autogluon.features.generators.abstract: \t\tFitting CategoryFeatureGenerator...\n", "INFO: autogluon.features.generators.abstract: \t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n", "INFO: autogluon.features.generators.abstract: \tStage 4 Generators:\n", "INFO: autogluon.features.generators.abstract: \t\tFitting DropUniqueFeatureGenerator...\n", "INFO: autogluon.features.generators.abstract: \tTypes of features in original data (raw dtype, special dtypes):\n", "INFO: autogluon.common.features.feature_metadata: \t\t('float', []) : 61 | ['transactionamt', 'card1', 'card2', 'card3', 'card5', ...]\n", "INFO: autogluon.common.features.feature_metadata: \t\t('object', []) : 6 | ['productcd', 'card6', 'p_emaildomain', 'r_emaildomain', 'devicetype', ...]\n", "INFO: autogluon.features.generators.abstract: \tTypes of features in processed data (raw dtype, special dtypes):\n", "INFO: autogluon.common.features.feature_metadata: \t\t('category', []) : 6 | ['productcd', 'card6', 'p_emaildomain', 'r_emaildomain', 'devicetype', ...]\n", "INFO: autogluon.common.features.feature_metadata: \t\t('float', []) : 61 | ['transactionamt', 'card1', 'card2', 'card3', 'card5', ...]\n", "INFO: autogluon.features.generators.abstract: \t2.6s = Fit runtime\n", "INFO: autogluon.features.generators.abstract: \t67 features in original data used to generate 67 features in processed data.\n", "INFO: autogluon.features.generators.abstract: \tTrain Data (Processed) Memory Usage: 154.97 MB (0.0% of available memory)\n", "INFO: autogluon.tabular.learner.default_learner: Data preprocessing and feature engineering runtime = 2.94s ...\n", "Level 25: autogluon.core.trainer.abstract_trainer: AutoGluon will gauge predictive performance using evaluation metric: 'roc_auc'\n", "Level 25: autogluon.core.trainer.abstract_trainer: \tThis metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict()\n", "INFO: autogluon.core.trainer.abstract_trainer: \tTo change this, specify the eval_metric parameter of Predictor()\n", "INFO: autogluon.core.trainer.abstract_trainer: AutoGluon will fit 2 stack levels (L1 to L2) ...\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting 13 L1 models ...\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 2397.44s of the 3597.05s of remaining time.\n", "WARNING: autogluon.core.models.ensemble.bagged_ensemble_model: \tNot enough time to generate out-of-fold predictions for model. Estimated time required was 3589.28s compared to 3115.73s of available time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping KNeighborsUnif_BAG_L1.\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 2390.48s of the 3590.09s of remaining time.\n", "WARNING: autogluon.core.models.ensemble.bagged_ensemble_model: \tNot enough time to generate out-of-fold predictions for model. Estimated time required was 3399.45s compared to 3106.63s of available time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping KNeighborsDist_BAG_L1.\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 2383.77s of the 3583.38s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9629\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t177.58s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t36.28s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBM_BAG_L1 ... Training model for up to 2196.14s of the 3395.75s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.969\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t97.61s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t17.24s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 2095.39s of the 3295.0s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9456\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t15.42s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t24.02s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestEntr_BAG_L1 ... Training model for up to 2054.81s of the 3254.42s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9474\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t13.73s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t26.66s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: CatBoost_BAG_L1 ... Training model for up to 2012.05s of the 3211.67s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9563\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t1615.45s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t3.48s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesGini_BAG_L1 ... Training model for up to 394.91s of the 1594.53s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9468\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t13.38s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t35.27s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesEntr_BAG_L1 ... Training model for up to 344.0s of the 1543.61s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9505\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t15.11s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t41.22s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 285.31s of the 1484.93s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9086\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t166.51s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t6.71s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: XGBoost_BAG_L1 ... Training model for up to 116.32s of the 1315.93s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9652\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t94.35s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t8.73s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetTorch_BAG_L1 ... Training model for up to 19.33s of the 1218.94s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping NeuralNetTorch_BAG_L1.\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMLarge_BAG_L1 ... Training model for up to 3.46s of the 1203.08s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping LightGBMLarge_BAG_L1.\n", "INFO: autogluon.core.trainer.abstract_trainer: Completed 1/20 k-fold bagging repeats ...\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 1197.96s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9719\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t92.94s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.09s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting 11 L2 models ...\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 1104.9s of the 1104.8s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9747\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t21.16s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t2.79s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBM_BAG_L2 ... Training model for up to 1080.72s of the 1080.63s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9752\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t11.64s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t1.43s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestGini_BAG_L2 ... Training model for up to 1067.58s of the 1067.49s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9632\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t16.3s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t21.33s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: RandomForestEntr_BAG_L2 ... Training model for up to 1029.08s of the 1028.98s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9649\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t14.04s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t20.67s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: CatBoost_BAG_L2 ... Training model for up to 993.29s of the 993.19s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9762\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t393.78s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t1.04s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesGini_BAG_L2 ... Training model for up to 598.03s of the 597.92s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9641\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t14.12s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t37.16s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: ExtraTreesEntr_BAG_L2 ... Training model for up to 545.02s of the 544.92s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9641\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t14.5s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t37.31s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetFastAI_BAG_L2 ... Training model for up to 491.61s of the 491.51s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9736\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t289.96s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t6.78s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: XGBoost_BAG_L2 ... Training model for up to 198.91s of the 198.81s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9752\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t25.26s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t3.99s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: NeuralNetTorch_BAG_L2 ... Training model for up to 171.76s of the 171.67s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \tTime limit exceeded... Skipping NeuralNetTorch_BAG_L2.\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: LightGBMLarge_BAG_L2 ... Training model for up to 148.85s of the 148.75s of remaining time.\n", "INFO: autogluon.core.models.ensemble.bagged_ensemble_model: \tFitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9753\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t44.1s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t2.06s\t = Validation runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: Completed 1/20 k-fold bagging repeats ...\n", "INFO: autogluon.core.trainer.abstract_trainer: Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the 95.08s of remaining time.\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.9767\t = Validation score (roc_auc)\n", "INFO: autogluon.core.trainer.abstract_trainer: \t110.29s\t = Training runtime\n", "INFO: autogluon.core.trainer.abstract_trainer: \t0.13s\t = Validation runtime\n", "INFO: autogluon.tabular.learner.default_learner: AutoGluon training complete, total runtime = 3615.62s ... Best model: \"WeightedEnsemble_L3\"\n", "INFO: autogluon.tabular.predictor.predictor: TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"/home/ec2-user/SageMaker/official-dataset-names/IEEE-CIS Fraud Detection/AutogluonModels/ag-20220615_135015_best_quality/\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order\n", "0 WeightedEnsemble_L2 0.889 0.972 17.168 130.219 491.324 0.008 0.094 92.936 2 True 10\n", "1 ExtraTreesGini_BAG_L1 0.879 0.947 1.356 35.271 13.385 1.356 35.271 13.385 1 True 6\n", "2 RandomForestEntr_BAG_L1 0.878 0.947 0.948 26.657 13.727 0.948 26.657 13.727 1 True 4\n", "3 ExtraTreesEntr_BAG_L1 0.878 0.950 1.526 41.223 15.112 1.526 41.223 15.112 1 True 7\n", "4 WeightedEnsemble_L3 0.876 0.977 36.877 330.085 3131.503 0.023 0.126 110.288 3 True 21\n", "5 RandomForestGini_BAG_L2 0.873 0.963 26.468 220.941 2225.452 0.544 21.332 16.300 2 True 13\n", "6 RandomForestGini_BAG_L1 0.872 0.946 0.740 24.016 15.415 0.740 24.016 15.415 1 True 3\n", "7 RandomForestEntr_BAG_L2 0.871 0.965 26.435 220.275 2223.195 0.511 20.666 14.044 2 True 14\n", "8 ExtraTreesGini_BAG_L2 0.871 0.964 26.592 236.773 2223.269 0.669 37.164 14.118 2 True 16\n", "9 CatBoost_BAG_L2 0.868 0.976 26.427 200.651 2602.927 0.504 1.042 393.775 2 True 15\n", "10 ExtraTreesEntr_BAG_L2 0.868 0.964 26.599 236.924 2223.651 0.675 37.315 14.499 2 True 17\n", "11 CatBoost_BAG_L1 0.865 0.956 1.264 3.485 1615.448 1.264 3.485 1615.448 1 True 5\n", "12 LightGBM_BAG_L2 0.864 0.975 26.580 201.041 2220.792 0.656 1.432 11.641 2 True 12\n", "13 XGBoost_BAG_L2 0.864 0.975 28.676 203.598 2234.414 2.753 3.989 25.263 2 True 19\n", "14 LightGBMXT_BAG_L2 0.862 0.975 26.927 202.401 2230.314 1.004 2.792 21.163 2 True 11\n", "15 LightGBMLarge_BAG_L2 0.860 0.975 26.780 201.669 2253.252 0.857 2.060 44.101 2 True 20\n", "16 NeuralNetFastAI_BAG_L2 0.859 0.974 30.341 206.390 2499.116 4.417 6.781 289.964 2 True 18\n", "17 XGBoost_BAG_L1 0.857 0.965 3.367 8.731 94.352 3.367 8.731 94.352 1 True 9\n", "18 LightGBMXT_BAG_L1 0.853 0.963 7.588 36.276 177.582 7.588 36.276 177.582 1 True 1\n", "19 LightGBM_BAG_L1 0.851 0.969 3.731 17.236 97.615 3.731 17.236 97.615 1 True 2\n", "20 NeuralNetFastAI_BAG_L1 0.837 0.909 5.404 6.713 166.515 5.404 6.713 166.515 1 True 8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO: benchmark_ag.py: auc on test data: 0.8761825926835967\n", "INFO: benchmark_ag.py: tpr@1%fpr on test data: 0.4408502772643253\n" ] } ], "source": [ "run_ag(dataset, BASE_PATH, time_limit=3600, presets='best_quality')" ] }, { "cell_type": "code", "execution_count": null, "id": "0f720c57", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_mxnet_latest_p37", "language": "python", "name": "conda_mxnet_latest_p37" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: scripts/reproducibility/autosklearn/README.md ================================================ ## Steps to reproduce Auto-sklearn models 1. Load and save the datasets locally using [FDB Loader](../../examples/Test_FDB_Loader.ipynb). Keep note of `{DATASET_PATH}` that contains local paths to datasets containing `train.csv`, `test.csv` and `test_labels.csv` from FDB loader. 2. Run `benchmark_autosklearn.py` using following: ``` python3 benchmark_autosklearn.py {DATASET_PATH} ``` 3. The script after running successfully will save results in the `DATASET_PATH`. The evaluation metrics on `test.csv` will be saved in `test_metrics_autosklearn.joblib`. *Note: Python 3.7+ is needed to run the used version of auto-sklearn and to reproduce the results. Similar to other auto-ml frameworks, auto-sklearn is also not perfectly reproducible because some underlying models are not deterministically seeded. However, the variations in results are within acceptable errors.* ================================================ FILE: scripts/reproducibility/autosklearn/benchmark_autosklearn.py ================================================ import json import joblib import datetime import numpy as np import pandas as pd import os, sys, shutil from autosklearn.metrics import roc_auc, log_loss from autosklearn.classification import AutoSklearnClassifier from sklearn.metrics import roc_auc_score, roc_curve from pandas.api.types import is_numeric_dtype, is_string_dtype import logging FORMAT = "%(levelname)s: %(name)s: %(message)s" DATE_FORMAT = "%Y-%m-%d %H:%M:%S" logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) logger = logging.getLogger(os.path.basename(__file__)) logger.setLevel(logging.DEBUG) logging_config = { 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'simple': { 'format': '%(levelname)-8s %(name)-15s %(message)s' } }, 'handlers':{ 'console_handler': { 'class': 'logging.StreamHandler', 'formatter': 'simple' }, 'file_handler': { 'class':'logging.FileHandler', 'mode': 'a', 'encoding': 'utf-8', 'filename':'main.log', 'formatter': 'simple' }, 'spec_handler':{ 'class':'logging.FileHandler', 'filename':'dummy_autosklearn.log', 'formatter': 'simple' }, 'distributed_logfile':{ 'filename':'distributed.log', 'class': 'logging.FileHandler', 'formatter': 'simple', 'level': 'DEBUG' } }, 'loggers': { '': { 'level': 'INFO', 'handlers':['file_handler', 'console_handler'] }, 'autosklearn': { 'level': 'INFO', 'propagate': False, 'handlers': ['spec_handler'] }, 'smac': { 'level': 'INFO', 'propagate': False, 'handlers': ['spec_handler'] }, 'EnsembleBuilder': { 'level': 'INFO', 'propagate': False, 'handlers': ['spec_handler'] }, }, } def load_data(dataset_path): logger.info(dataset_path) df_train = pd.read_csv(f"{dataset_path}/train.csv", lineterminator='\n') logger.info(df_train.shape) df_test = pd.read_csv(f"{dataset_path}/test.csv") logger.info(df_test.shape) df_test_labels = pd.read_csv(f"{dataset_path}/test_labels.csv") logger.info(df_test_labels.shape) df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID") logger.info(df_test.shape) features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID") features = [x for x in df_test.columns if x not in features_to_exclude ] logger.info(len(features)) logger.info(features) return features, df_train, df_test def get_recall(fpr, tpr, fpr_target=0.01): return np.interp(fpr_target, fpr, tpr) def run_autosklearn(dataset_path): features, df_train, df_test = load_data(dataset_path) dateTimeObj = datetime.datetime.now() timestampStr = dateTimeObj.strftime("%Y%m%d_%H%M%S") numeric_features = [f for f in features if is_numeric_dtype(df_train[f])] categorical_features = [f for f in features if f not in numeric_features] logger.info(f'categorical: {categorical_features}') logger.info(f'numeric: {numeric_features}') labels = sorted(df_train['EVENT_LABEL'].unique()) df_train['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True) df_test['EVENT_LABEL'].replace({labels[0]: 0, labels[1]: 1}, inplace=True) for df in [df_train, df_test]: df[categorical_features] = df[categorical_features].fillna('') df[categorical_features] = df[categorical_features].astype('category') out_dir = f"{dataset_path}/AutoSklearnModels/" if os.path.exists(out_dir): shutil.rmtree(out_dir) automl = AutoSklearnClassifier( metric=roc_auc, scoring_functions=[roc_auc, log_loss], tmp_folder=out_dir, # for debugging delete_tmp_folder_after_terminate=False, logging_config=logging_config, n_jobs=-1, memory_limit=None ) assert len(categorical_features) + len(numeric_features) == len(features) logger.info('Fitting') automl.fit(df_train[features], df_train['EVENT_LABEL']) joblib.dump(automl, f"{dataset_path}/automl.joblib") cv = pd.DataFrame(automl.cv_results_) cv.to_csv(f"{dataset_path}/cv_results_autosklearn.csv", index=False) df_pred = automl.predict_proba(df_test[features])[:,1] auc_score = roc_auc_score(df_test['EVENT_LABEL'], df_pred) logger.info(f"auc on test data: {auc_score}") fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'], df_pred) recall = get_recall(fpr, tpr, fpr_target=0.01) logger.info(f"tpr@1%fpr on test data: {recall}") test_metrics = { "labels": df_test['EVENT_LABEL'], "pred_prob": df_pred, "auc": auc_score, "tpr@1%fpr": recall, "fpr": fpr, "tpr": tpr, "thresholds": thresholds } joblib.dump(test_metrics, f"{dataset_path}/test_metrics_autosklearn.joblib") if __name__ == "__main__": args = sys.argv logger.info(args) run_autosklearn(args[1]) ================================================ FILE: scripts/reproducibility/benchmark_utils.py ================================================ import numpy as np import pandas as pd import os import matplotlib as mpl mpl.rcParams['figure.dpi'] = 150 pd.set_option('display.max_columns', 500) pd.set_option('display.max_rows', 500) pd.set_option('display.width', 200) pd.set_option('display.float_format', lambda x: '%.3f' % x) import logging FORMAT = "%(levelname)s: %(name)s: %(message)s" DATE_FORMAT = "%Y-%m-%d %H:%M:%S" logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) logger = logging.getLogger(os.path.basename(__file__)) logger.setLevel(logging.DEBUG) def load_data(dataset, base_path): logger.info(dataset) df_train = pd.read_csv(f"{base_path}/{dataset}/train.csv", lineterminator='\n') logger.info(df_train.shape) df_test = pd.read_csv(f"{base_path}/{dataset}/test.csv") logger.info(df_test.shape) df_test_labels = pd.read_csv(f"{base_path}/{dataset}/test_labels.csv") logger.info(df_test_labels.shape) df_test = df_test.merge(df_test_labels, how="inner", on="EVENT_ID") logger.info(df_test.shape) features_to_exclude = ("EVENT_LABEL", "EVENT_TIMESTAMP", "LABEL_TIMESTAMP", "ENTITY_TYPE", "ENTITY_ID", "EVENT_ID") features = [x for x in df_test.columns if x not in features_to_exclude ] logger.info(len(features)) logger.info(features) return features, df_train, df_test def get_recall(fpr, tpr, fpr_target=0.01): return np.interp(fpr_target, fpr, tpr) ================================================ FILE: scripts/reproducibility/h2o/README.md ================================================ - benchmark_h2o.py: a script for h2o benchmarking - example-h2o-ieeecis.ipynb: an example notebook using benchmark_h2o.py Note that h2o is not perfectly reproducible because some underlying models are not deterministically seeded, you might see slightly different results than in the paper. ================================================ FILE: scripts/reproducibility/h2o/benchmark_h2o.py ================================================ import pandas as pd import os import gc import joblib import matplotlib as mpl from sklearn.metrics import roc_auc_score, roc_curve mpl.rcParams['figure.dpi'] = 150 pd.set_option('display.max_columns', 500) pd.set_option('display.max_rows', 500) pd.set_option('display.width', 200) pd.set_option('display.float_format', lambda x: '%.3f' % x) import logging FORMAT = "%(levelname)s: %(name)s: %(message)s" DATE_FORMAT = "%Y-%m-%d %H:%M:%S" logging.basicConfig(level=logging.WARN, format=FORMAT, datefmt=DATE_FORMAT) logger = logging.getLogger(os.path.basename(__file__)) logger.setLevel(logging.DEBUG) import sys sys.path.append('../') from benchmark_utils import load_data, get_recall import h2o from h2o.automl import H2OAutoML def run_h2o(dataset, base_path, connect_url=None, time_limit=None, include_algos=None, exclude_algos=None, verbosity="info", seed=10): if connect_url is not None: _ = h2o.connect(url=connect_url, https=True, verbose=True) h2o.cluster().show_status(True) else: h2o.init() gc.collect() features, df_train, df_test = load_data(dataset, base_path) df_train_h2o = h2o.H2OFrame(df_train) feature_types_h2o = {k:df_train_h2o.types[k] for k in df_train_h2o.types if k in features} # force test schema the same as train schema, otherwise predict will throw errors df_test_h2o = h2o.H2OFrame(df_test, column_types=feature_types_h2o) df_train_h2o['EVENT_LABEL'] = df_train_h2o['EVENT_LABEL'].asfactor() df_test_h2o['EVENT_LABEL'] = df_test_h2o['EVENT_LABEL'].asfactor() aml = H2OAutoML(max_runtime_secs = time_limit, seed = seed, include_algos=include_algos, exclude_algos=exclude_algos, export_checkpoints_dir=f"{base_path}/{dataset}/H2OModels/", verbosity=verbosity) # use validation error in the leaderboard to avoid leakage when calling aml.predict aml.train(x = features, y = 'EVENT_LABEL', training_frame = df_train_h2o, ) lb = aml.leaderboard # lb.head(rows=lb.nrows) h2o.h2o.download_csv(lb, f"{base_path}/{dataset}/leaderboard_h2o.csv") lb_2 = h2o.automl.get_leaderboard(aml, extra_columns = "ALL") h2o.h2o.download_csv(lb_2, f"{base_path}/{dataset}/leaderboard_h2o_full.csv") # Get training timing info info = aml.training_info joblib.dump(info, f"{base_path}/{dataset}/training_info.joblib") df_pred_h2o = aml.predict(df_test_h2o[features]) pos_label = df_test_h2o['EVENT_LABEL'].levels()[0][-1] # levels are ordered alphabetically pos_label2 = 'p'+pos_label if pos_label=='1' else pos_label df_pred_h2o = (h2o.as_list(df_pred_h2o[pos_label2]))[pos_label2] auc = roc_auc_score(df_test['EVENT_LABEL'], df_pred_h2o) logger.info(f"auc on test data: {auc}") fpr, tpr, thresholds = roc_curve(df_test['EVENT_LABEL'].astype(str), df_pred_h2o, pos_label=pos_label) y_true = df_test['EVENT_LABEL'] y_true = (y_true.astype(str)==pos_label) recall = get_recall(fpr, tpr, fpr_target=0.01) logger.info(f"tpr@1%fpr on test data: {recall}") test_metrics_h2o = { "pos_label": pos_label, "labels": df_test['EVENT_LABEL'], "pred_prob": df_pred_h2o, "auc": auc, "tpr@1%fpr": recall, "fpr": fpr, "tpr": tpr, "thresholds": thresholds } joblib.dump(test_metrics_h2o, f"{base_path}/{dataset}/test_metrics_h2o.joblib") h2o.cluster().shutdown(prompt=False) ================================================ FILE: scripts/reproducibility/h2o/example-h2o-ieeecis.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "afc2eecf", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "id": "f00a81aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.core.display import display, HTML\n", "from IPython.display import clear_output\n", "display(HTML(\"\"))" ] }, { "cell_type": "code", "execution_count": 14, "id": "11759d10", "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "id": "2baa2261", "metadata": {}, "source": [ "## Step 1: pip install required packages if not installed already" ] }, { "cell_type": "code", "execution_count": 3, "id": "1efdc80c", "metadata": {}, "outputs": [], "source": [ "# !pip install h2o\n", "import benchmark_h2o\n", "from benchmark_h2o import load_data, run_h2o" ] }, { "cell_type": "markdown", "id": "6b1e0a24", "metadata": {}, "source": [ "## Step 2: download data using fdb\n", "Example: https://github.com/amazon-research/fraud-dataset-benchmark/blob/main/scripts/examples/Test_FDB_Loader.ipynb" ] }, { "cell_type": "code", "execution_count": 4, "id": "0a34e883", "metadata": {}, "outputs": [], "source": [ "# This is where datasets are stored: {BASE_PATH}/{dataset}/\n", "BASE_PATH = \"/home/ec2-user/SageMaker/official-dataset-names\"\n", "dataset = \"IEEE-CIS Fraud Detection\"" ] }, { "cell_type": "markdown", "id": "8aed893e", "metadata": {}, "source": [ "Make sure three files are downloaded:\n", "1. {BASE_PATH}/{dataset}/train.csv\n", "2. {BASE_PATH}/{dataset}/test.csv\n", "3. {BASE_PATH}/{dataset}/test_labels.csv" ] }, { "cell_type": "markdown", "id": "b02d77e0", "metadata": {}, "source": [ "## Step 3: look at data" ] }, { "cell_type": "code", "execution_count": 5, "id": "9dfd0df9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n", "INFO: benchmark_utils.py: (313060, 194)\n", "INFO: benchmark_utils.py: (27330, 71)\n", "INFO: benchmark_utils.py: (29527, 2)\n", "INFO: benchmark_utils.py: (27329, 72)\n", "INFO: benchmark_utils.py: 67\n", "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n" ] } ], "source": [ "features, df_train, df_test = load_data(dataset, BASE_PATH)" ] }, { "cell_type": "code", "execution_count": 6, "id": "eebaa1d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_LABELtransactionamtproductcdcard1card2card3card5card6addr1addr2dist1dist2p_emaildomainr_emaildomainc1c2c4c5c6c7c8c9c10c11c12c13c14d1d2d3d4d5d10d11d15m1m2m3m4m6m7m8m9v1v3v4v6v8v11v13v14v17v20v23v26v27v30v36v37v40v41v44v47v48v54v56v59v62v65v67v68v70v76v78v80v82v86v88v89v91v107v108v111v115v117v120v121v123v124v127v129v130v136v138v139v142v147v156v160v162v165v166v169v171v173v175v176v178v180v182v185v187v188v198v203v205v207v209v210v215v218v220v221v223v224v226v228v229v234v235v238v240v250v252v253v257v258v260v261v264v266v267v271v274v277v281v283v284v285v286v289v291v294v296v297v301v303v305v307v309v310v314v320id_01id_02id_03id_04id_05id_06id_09id_10id_11id_12id_13id_15id_16id_17id_18id_19id_20id_28id_29id_31id_35id_36id_37id_38devicetypedeviceinfoENTITY_IDEVENT_TIMESTAMPENTITY_TYPEEVENT_IDLABEL_TIMESTAMP
0068.500W13926.000NaN150.000142.000credit315.00087.00019.000NaNNaNNaN1.0001.0000.0000.0001.0000.0000.0001.0000.0002.0000.0001.0001.00014.000NaN13.000NaNNaN12.00012.000-1.000TTTM2TNaNNaNNaN1.0001.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.000NaNNaNNaNNaNNaNNaNNaN1.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0000.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.000117.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0000.0000.0000.0001.0001.0000.0000.0000.0000.0001.000117.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN13926.0_315.0_-13.02021-01-02 00:00:00user2987000.0002022-01-01T20:30:04Z
1029.000W2755.000404.000150.000102.000credit325.00087.000NaNNaNgmail.comNaN1.0001.0000.0000.0001.0000.0000.0000.0000.0001.0000.0001.0001.0000.000NaNNaN-1.000NaN-1.000NaN-1.000NaNNaNNaNM0TNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0001.0001.0001.0000.0000.0000.0001.0000.0001.0001.0001.0000.0000.0001.0000.0001.0001.0001.0000.0000.0000.0001.0000.0001.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2755.0_325.0_1.02021-01-02 00:00:01user2987001.0002022-01-01T20:30:04Z
2059.000W4663.000490.000150.000166.000debit330.00087.000287.000NaNoutlook.comNaN1.0001.0000.0000.0001.0000.0000.0001.0000.0001.0000.0001.0001.0000.000NaNNaN-1.001NaN-1.001313.999313.999TTTM0FFFF1.0001.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0001.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.0001.0000.0000.0000.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN4663.0_330.0_1.02021-01-02 00:01:09user2987002.0002022-01-01T20:30:04Z
3050.000W18132.000567.000150.000117.000debit476.00087.000NaNNaNyahoo.comNaN2.0005.0000.0000.0004.0000.0000.0001.0000.0001.0000.00025.0001.000112.000112.0000.00092.9990.00082.999NaN109.999NaNNaNNaNM0FNaNNaNNaNNaNNaNNaNNaNNaNNaN1.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0000.0001.0001.0001.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001758.0000.000354.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0000.0000.00010.0000.0000.0001.00038.0000.0000.0000.0000.0001.0001758.0000.000354.0000.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN18132.0_476.0_-111.02021-01-02 00:01:39user2987003.0002022-01-01T20:30:04Z
4050.000H4497.000514.000150.000102.000credit420.00087.000NaNNaNgmail.comNaN1.0001.0000.0000.0001.0000.0001.0000.0001.0001.0000.0001.0001.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.000169690.8000.0005155.0002840.0000.0001.0000.0000.0001.0000.0000.0000.0000.0001.0001.0001.0000.0000.0000.0000.0000.0000.0000.0000.0001.0000.0000.0000.0001.0001.0000.0000.0000.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.0000.0000.0000.0000.0001.0000.0000.0000.0000.0001.0000.0000.0000.0000.0001.0001.0000.0000.0000.0000.0000.0000.00070787.000NaNNaNNaNNaNNaNNaN100.000NotFoundNaNNewNotFound166.000NaN542.000144.000NewNotFoundsamsung browser 6.2TFTTmobileSAMSUNG SM-G892A Build/NRD90M4497.0_420.0_1.02021-01-02 00:01:46user2987004.0002022-01-01T20:30:04Z
\n", "
" ], "text/plain": [ " EVENT_LABEL transactionamt productcd card1 card2 card3 card5 card6 addr1 addr2 dist1 dist2 p_emaildomain r_emaildomain c1 c2 c4 c5 c6 c7 c8 c9 c10 \\\n", "0 0 68.500 W 13926.000 NaN 150.000 142.000 credit 315.000 87.000 19.000 NaN NaN NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000 \n", "1 0 29.000 W 2755.000 404.000 150.000 102.000 credit 325.000 87.000 NaN NaN gmail.com NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 \n", "2 0 59.000 W 4663.000 490.000 150.000 166.000 debit 330.000 87.000 287.000 NaN outlook.com NaN 1.000 1.000 0.000 0.000 1.000 0.000 0.000 1.000 0.000 \n", "3 0 50.000 W 18132.000 567.000 150.000 117.000 debit 476.000 87.000 NaN NaN yahoo.com NaN 2.000 5.000 0.000 0.000 4.000 0.000 0.000 1.000 0.000 \n", "4 0 50.000 H 4497.000 514.000 150.000 102.000 credit 420.000 87.000 NaN NaN gmail.com NaN 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 \n", "\n", " c11 c12 c13 c14 d1 d2 d3 d4 d5 d10 d11 d15 m1 m2 m3 m4 m6 m7 m8 m9 v1 v3 v4 v6 v8 v11 v13 v14 v17 v20 v23 v26 \\\n", "0 2.000 0.000 1.000 1.000 14.000 NaN 13.000 NaN NaN 12.000 12.000 -1.000 T T T M2 T NaN NaN NaN 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 \n", "1 1.000 0.000 1.000 1.000 0.000 NaN NaN -1.000 NaN -1.000 NaN -1.000 NaN NaN NaN M0 T NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 1.000 1.000 1.000 \n", "2 1.000 0.000 1.000 1.000 0.000 NaN NaN -1.001 NaN -1.001 313.999 313.999 T T T M0 F F F F 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 \n", "3 1.000 0.000 25.000 1.000 112.000 112.000 0.000 92.999 0.000 82.999 NaN 109.999 NaN NaN NaN M0 F NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 1.000 0.000 1.000 1.000 1.000 \n", "4 1.000 0.000 1.000 1.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "\n", " v27 v30 v36 v37 v40 v41 v44 v47 v48 v54 v56 v59 v62 v65 v67 v68 v70 v76 v78 v80 v82 v86 v88 v89 v91 v107 v108 v111 v115 v117 v120 v121 \\\n", "0 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "1 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "2 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "3 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 1.000 1.000 1.000 1.000 1.000 1.000 \n", "\n", " v123 v124 v127 v129 v130 v136 v138 v139 v142 v147 v156 v160 v162 v165 v166 v169 v171 v173 v175 v176 v178 v180 v182 v185 v187 v188 v198 v203 v205 v207 \\\n", "0 1.000 1.000 117.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 1.000 1.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2 1.000 1.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "3 1.000 1.000 1758.000 0.000 354.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "4 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 169690.800 0.000 5155.000 2840.000 0.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 1.000 1.000 0.000 0.000 0.000 \n", "\n", " v209 v210 v215 v218 v220 v221 v223 v224 v226 v228 v229 v234 v235 v238 v240 v250 v252 v253 v257 v258 v260 v261 v264 v266 v267 v271 v274 v277 v281 v283 v284 v285 \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 0.000 \n", "1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 0.000 \n", "2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 1.000 0.000 0.000 \n", "3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000 0.000 0.000 10.000 \n", "4 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 \n", "\n", " v286 v289 v291 v294 v296 v297 v301 v303 v305 v307 v309 v310 v314 v320 id_01 id_02 id_03 id_04 id_05 id_06 id_09 id_10 id_11 id_12 id_13 id_15 id_16 \\\n", "0 0.000 0.000 1.000 1.000 0.000 0.000 0.000 0.000 1.000 117.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "2 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "3 0.000 0.000 1.000 38.000 0.000 0.000 0.000 0.000 1.000 1758.000 0.000 354.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "4 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 70787.000 NaN NaN NaN NaN NaN NaN 100.000 NotFound NaN New NotFound \n", "\n", " id_17 id_18 id_19 id_20 id_28 id_29 id_31 id_35 id_36 id_37 id_38 devicetype deviceinfo ENTITY_ID EVENT_TIMESTAMP ENTITY_TYPE \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13926.0_315.0_-13.0 2021-01-02 00:00:00 user \n", "1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2755.0_325.0_1.0 2021-01-02 00:00:01 user \n", "2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4663.0_330.0_1.0 2021-01-02 00:01:09 user \n", "3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18132.0_476.0_-111.0 2021-01-02 00:01:39 user \n", "4 166.000 NaN 542.000 144.000 New NotFound samsung browser 6.2 T F T T mobile SAMSUNG SM-G892A Build/NRD90M 4497.0_420.0_1.0 2021-01-02 00:01:46 user \n", "\n", " EVENT_ID LABEL_TIMESTAMP \n", "0 2987000.000 2022-01-01T20:30:04Z \n", "1 2987001.000 2022-01-01T20:30:04Z \n", "2 2987002.000 2022-01-01T20:30:04Z \n", "3 2987003.000 2022-01-01T20:30:04Z \n", "4 2987004.000 2022-01-01T20:30:04Z " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "c89c46e9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transactionamtproductcdcard1card2card3card5card6addr1dist1p_emaildomainr_emaildomainc1c2c4c5c6c7c8c9c10c11c12c13c14v62v70v76v78v82v91v127v130v139v160v165v187v203v207v209v210v221v234v257v258v261v264v266v267v271v274v277v283v285v289v291v294id_01id_02id_05id_06id_09id_13id_17id_19id_20devicetypedeviceinfoEVENT_TIMESTAMPENTITY_IDENTITY_TYPEEVENT_IDEVENT_LABEL
0125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109411.0002301.0000.0002401.00066104.0001.000103183.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000926.000-10.0001411.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:11:1515775.0_330.0_129.0user3548013.0000
1125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109536.0002301.0000.0002401.00066229.0001.000103308.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000927.000-10.000693.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:11:2915775.0_330.0_129.0user3548014.0000
2125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109661.0002301.0000.0002401.00066354.0001.000103433.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000928.000-10.0001116.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:11:4515775.0_330.0_129.0user3548015.0000
3125.000S15775.000481.000150.000102.000credit330.000NaNNaNyahoo.com5.0003.0003.0000.0000.0000.0008.0000.0003.0005.0000.00061.0005.0000.0000.000NaNNaNNaNNaN109786.0002301.0000.0002401.00066479.0001.000103558.000877.0001961.000465.0000.00073.000NaNNaNNaNNaNNaNNaN0.000NaNNaN1.00026.0001.0002.000929.000-10.0001589.0006.0000.0000.00052.000166.000633.000533.000desktopWindows2021-06-21 23:12:0015775.0_330.0_129.0user3548016.0000
431.950W9500.000321.000150.000226.000debit204.00074.000NaNNaN3.0003.0000.0001.0001.0000.0000.0001.0000.0001.0000.0006.0003.0001.0001.0001.0002.0001.0001.00027.95027.950NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.0001.0001.0001.0000.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2021-06-21 23:12:119500.0_204.0_150.0user3548017.0000
\n", "
" ], "text/plain": [ " transactionamt productcd card1 card2 card3 card5 card6 addr1 dist1 p_emaildomain r_emaildomain c1 c2 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 v62 \\\n", "0 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "1 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "2 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "3 125.000 S 15775.000 481.000 150.000 102.000 credit 330.000 NaN NaN yahoo.com 5.000 3.000 3.000 0.000 0.000 0.000 8.000 0.000 3.000 5.000 0.000 61.000 5.000 0.000 \n", "4 31.950 W 9500.000 321.000 150.000 226.000 debit 204.000 74.000 NaN NaN 3.000 3.000 0.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 6.000 3.000 1.000 \n", "\n", " v70 v76 v78 v82 v91 v127 v130 v139 v160 v165 v187 v203 v207 v209 v210 v221 v234 v257 v258 v261 v264 v266 v267 v271 v274 v277 v283 \\\n", "0 0.000 NaN NaN NaN NaN 109411.000 2301.000 0.000 2401.000 66104.000 1.000 103183.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "1 0.000 NaN NaN NaN NaN 109536.000 2301.000 0.000 2401.000 66229.000 1.000 103308.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "2 0.000 NaN NaN NaN NaN 109661.000 2301.000 0.000 2401.000 66354.000 1.000 103433.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "3 0.000 NaN NaN NaN NaN 109786.000 2301.000 0.000 2401.000 66479.000 1.000 103558.000 877.000 1961.000 465.000 0.000 73.000 NaN NaN NaN NaN NaN NaN 0.000 NaN NaN 1.000 \n", "4 1.000 1.000 2.000 1.000 1.000 27.950 27.950 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 \n", "\n", " v285 v289 v291 v294 id_01 id_02 id_05 id_06 id_09 id_13 id_17 id_19 id_20 devicetype deviceinfo EVENT_TIMESTAMP ENTITY_ID ENTITY_TYPE EVENT_ID EVENT_LABEL \n", "0 26.000 1.000 2.000 926.000 -10.000 1411.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:11:15 15775.0_330.0_129.0 user 3548013.000 0 \n", "1 26.000 1.000 2.000 927.000 -10.000 693.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:11:29 15775.0_330.0_129.0 user 3548014.000 0 \n", "2 26.000 1.000 2.000 928.000 -10.000 1116.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:11:45 15775.0_330.0_129.0 user 3548015.000 0 \n", "3 26.000 1.000 2.000 929.000 -10.000 1589.000 6.000 0.000 0.000 52.000 166.000 633.000 533.000 desktop Windows 2021-06-21 23:12:00 15775.0_330.0_129.0 user 3548016.000 0 \n", "4 1.000 1.000 1.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2021-06-21 23:12:11 9500.0_204.0_150.0 user 3548017.000 0 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_test.head()" ] }, { "cell_type": "markdown", "id": "71fd76ac", "metadata": {}, "source": [ "## Step 3: run H2O" ] }, { "cell_type": "markdown", "id": "7393d373", "metadata": {}, "source": [ "1. The function run_h2o below also saves a leaderboard file (leaderboard_xxx.csv) and a test metrics file (test_metrics_xxx.joblib) into {BASE_PATH}/{dataset}/, respectively\n", "2. H2O models are saved at {BASE_PATH}/{dataset}/H2OModels" ] }, { "cell_type": "code", "execution_count": 8, "id": "f851663a", "metadata": {}, "outputs": [], "source": [ "# H2OStartupError: Your java is not supported: java version \"1.7.0_261\"; OpenJDK Runtime Environment (amzn-2.6.22.1.84.amzn1-x86_64 u261-b02); OpenJDK 64-Bit Server VM (build 24.261-b02, mixed mode)\n", "# If you see this error above, you may need to run the following instructions:\n", "\n", "# !sudo yum install -y java-1.8.0-openjdk.x86_64\n", "# !sudo /usr/sbin/alternatives --set java /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java\n", "# !sudo /usr/sbin/alternatives --set javac /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/javac" ] }, { "cell_type": "markdown", "id": "d259b767", "metadata": {}, "source": [ "https://swiftotter.com/technical/amazon-aws-jenkins-2-60-1-java-8-update#/" ] }, { "cell_type": "code", "execution_count": 9, "id": "63ec8755", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "openjdk version \"1.8.0_312\"\n", "OpenJDK Runtime Environment (build 1.8.0_312-b07)\n", "OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)\n" ] } ], "source": [ "!java -version" ] }, { "cell_type": "code", "execution_count": 13, "id": "9dc0646d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'3.36.1.2'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import h2o\n", "h2o.__version__" ] }, { "cell_type": "code", "execution_count": 21, "id": "7c060159", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.\n", "Attempting to start a local H2O server...\n", " Java Version: openjdk version \"1.8.0_312\"; OpenJDK Runtime Environment (build 1.8.0_312-b07); OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)\n", " Starting server from /home/ec2-user/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar\n", " Ice root: /tmp/tmpag6zcv5a\n", " JVM stdout: /tmp/tmpag6zcv5a/h2o_ec2_user_started_from_python.out\n", " JVM stderr: /tmp/tmpag6zcv5a/h2o_ec2_user_started_from_python.err\n", " Server is running at http://127.0.0.1:54321\n", "Connecting to H2O server at http://127.0.0.1:54321 ... successful.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
H2O_cluster_uptime:02 secs
H2O_cluster_timezone:UTC
H2O_data_parsing_timezone:UTC
H2O_cluster_version:3.36.1.2
H2O_cluster_version_age:20 days
H2O_cluster_name:H2O_from_python_ec2_user_t9z3ig
H2O_cluster_total_nodes:1
H2O_cluster_free_memory:26.64 Gb
H2O_cluster_total_cores:64
H2O_cluster_allowed_cores:64
H2O_cluster_status:locked, healthy
H2O_connection_url:http://127.0.0.1:54321
H2O_connection_proxy:{\"http\": null, \"https\": null}
H2O_internal_security:False
Python_version:3.7.10 final
" ], "text/plain": [ "-------------------------- -------------------------------\n", "H2O_cluster_uptime: 02 secs\n", "H2O_cluster_timezone: UTC\n", "H2O_data_parsing_timezone: UTC\n", "H2O_cluster_version: 3.36.1.2\n", "H2O_cluster_version_age: 20 days\n", "H2O_cluster_name: H2O_from_python_ec2_user_t9z3ig\n", "H2O_cluster_total_nodes: 1\n", "H2O_cluster_free_memory: 26.64 Gb\n", "H2O_cluster_total_cores: 64\n", "H2O_cluster_allowed_cores: 64\n", "H2O_cluster_status: locked, healthy\n", "H2O_connection_url: http://127.0.0.1:54321\n", "H2O_connection_proxy: {\"http\": null, \"https\": null}\n", "H2O_internal_security: False\n", "Python_version: 3.7.10 final\n", "-------------------------- -------------------------------" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "INFO: benchmark_utils.py: IEEE-CIS Fraud Detection\n", "INFO: benchmark_utils.py: (313060, 194)\n", "INFO: benchmark_utils.py: (27330, 71)\n", "INFO: benchmark_utils.py: (29527, 2)\n", "INFO: benchmark_utils.py: (27329, 72)\n", "INFO: benchmark_utils.py: 67\n", "INFO: benchmark_utils.py: ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n", "Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%\n", "AutoML progress: |\n", "17:56:29.266: Project: AutoML_1_20220615_175629\n", "17:56:29.267: 5-fold cross-validation will be used.\n", "17:56:29.270: Setting stopping tolerance adaptively based on the training frame: 0.0017872537194430643\n", "17:56:29.271: Build control seed: 10\n", "17:56:29.274: training frame: Frame key: AutoML_1_20220615_175629_training_py_11_sid_9519 cols: 194 rows: 313060 chunks: 68 size: 137888486 checksum: -8673498857111412012\n", "17:56:29.274: validation frame: NULL\n", "17:56:29.274: leaderboard frame: NULL\n", "17:56:29.275: blending frame: NULL\n", "17:56:29.275: response column: EVENT_LABEL\n", "17:56:29.275: fold column: null\n", "17:56:29.275: weights column: null\n", "17:56:29.289: Loading execution steps: [{XGBoost : [def_2 (1g, 10w), def_1 (2g, 10w), def_3 (3g, 10w), grid_1 (4g, 90w), lr_search (6g, 30w)]}, {GLM : [def_1 (1g, 10w)]}, {DRF : [def_1 (2g, 10w), XRT (3g, 10w)]}, {GBM : [def_5 (1g, 10w), def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w), def_1 (3g, 10w), grid_1 (4g, 60w), lr_annealing (6g, 10w)]}, {DeepLearning : [def_1 (3g, 10w), grid_1 (4g, 30w), grid_2 (5g, 30w), grid_3 (5g, 30w)]}, {completion : [resume_best_grids (10g, 60w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w), best_of_family_2 (2g, 5w), best_of_family_3 (3g, 5w), best_of_family_4 (4g, 5w), best_of_family_5 (5g, 5w), all_2 (2g, 10w), all_3 (3g, 10w), all_4 (4g, 10w), all_5 (5g, 10w), monotonic (6g, 10w), best_of_family_gbm (6g, 10w), all_gbm (7g, 10w), best_of_family_xglm (8g, 10w), all_xglm (8g, 10w), best_of_family (10g, 10w), best_N (10g, 10w)]}]\n", "17:56:29.317: AutoML job created: 2022.06.15 17:56:29.241\n", "17:56:29.318: AutoML build started: 2022.06.15 17:56:29.317\n", "17:56:29.402: AutoML: starting XGBoost_1_AutoML_1_20220615_175629 model training\n", "\n", "████████████████\n", "18:10:12.338: New leader: XGBoost_1_AutoML_1_20220615_175629, auc: 0.9547716353973954\n", "18:10:12.344: AutoML: starting GLM_1_AutoML_1_20220615_175629 model training\n", "\n", "██\n", "18:12:16.9: AutoML: starting GBM_1_AutoML_1_20220615_175629 model training\n", "\n", "███\n", "18:15:33.456: New leader: GBM_1_AutoML_1_20220615_175629, auc: 0.9552574031901246\n", "18:15:33.476: AutoML: starting StackedEnsemble_BestOfFamily_1_AutoML_1_20220615_175629 model training\n", "\n", "\n", "18:15:37.718: New leader: StackedEnsemble_BestOfFamily_1_AutoML_1_20220615_175629, auc: 0.9629154248601777\n", "18:15:37.722: AutoML: starting XGBoost_2_AutoML_1_20220615_175629 model training\n", "\n", "███████\n", "18:22:03.32: AutoML: starting DRF_1_AutoML_1_20220615_175629 model training\n", "\n", "█\n", "18:23:30.486: AutoML: starting GBM_2_AutoML_1_20220615_175629 model training\n", "\n", "██\n", "18:25:25.304: AutoML: starting GBM_3_AutoML_1_20220615_175629 model training\n", "\n", "██\n", "18:27:17.200: AutoML: starting GBM_4_AutoML_1_20220615_175629 model training\n", "\n", "██\n", "18:29:30.748: AutoML: starting StackedEnsemble_BestOfFamily_2_AutoML_1_20220615_175629 model training\n", "\n", "\n", "18:29:34.124: New leader: StackedEnsemble_BestOfFamily_2_AutoML_1_20220615_175629, auc: 0.9629166327028895\n", "18:29:34.130: AutoML: starting StackedEnsemble_AllModels_1_AutoML_1_20220615_175629 model training\n", "\n", "\n", "18:29:37.696: New leader: StackedEnsemble_AllModels_1_AutoML_1_20220615_175629, auc: 0.9641251732621091\n", "18:29:37.699: AutoML: starting XGBoost_3_AutoML_1_20220615_175629 model training\n", "\n", "████\n", "18:33:45.401: AutoML: starting XRT_1_AutoML_1_20220615_175629 model training\n", "\n", "██\n", "18:35:01.402: AutoML: starting GBM_5_AutoML_1_20220615_175629 model training\n", "\n", "██\n", "18:37:04.114: AutoML: starting DeepLearning_1_AutoML_1_20220615_175629 model training\n", "\n", "████████\n", "18:45:23.506: AutoML: starting StackedEnsemble_BestOfFamily_3_AutoML_1_20220615_175629 model training\n", "\n", "\n", "18:45:26.972: AutoML: starting StackedEnsemble_AllModels_2_AutoML_1_20220615_175629 model training\n", "\n", "\n", "18:45:30.973: AutoML: starting XGBoost_grid_1_AutoML_1_20220615_175629 hyperparameter search\n", "\n", "██████\n", "18:50:41.172: AutoML: starting GBM_grid_1_AutoML_1_20220615_175629 hyperparameter search\n", "\n", "███\n", "18:54:01.180: AutoML: starting DeepLearning_grid_1_AutoML_1_20220615_175629 hyperparameter search\n", "\n", "███| (done) 100%\n", "\n", "18:56:30.999: Actual modeling steps: [{XGBoost : [def_2 (1g, 10w)]}, {GLM : [def_1 (1g, 10w)]}, {GBM : [def_5 (1g, 10w)]}, {StackedEnsemble : [best_of_family_1 (1g, 5w)]}, {XGBoost : [def_1 (2g, 10w)]}, {DRF : [def_1 (2g, 10w)]}, {GBM : [def_2 (2g, 10w), def_3 (2g, 10w), def_4 (2g, 10w)]}, {StackedEnsemble : [best_of_family_2 (2g, 5w), all_2 (2g, 10w)]}, {XGBoost : [def_3 (3g, 10w)]}, {DRF : [XRT (3g, 10w)]}, {GBM : [def_1 (3g, 10w)]}, {DeepLearning : [def_1 (3g, 10w)]}, {StackedEnsemble : [best_of_family_3 (3g, 5w), all_3 (3g, 10w)]}, {XGBoost : [grid_1 (4g, 90w)]}, {GBM : [grid_1 (4g, 60w)]}, {DeepLearning : [grid_1 (4g, 30w)]}]\n", "18:56:30.999: AutoML build stopped: 2022.06.15 18:56:30.999\n", "18:56:30.999: AutoML build done: built 16 models\n", "18:56:30.999: AutoML duration: 1:00:01.682\n", "\n", "stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO: benchmark_h2o.py: auc on test data: 0.8816474193300993\n", "INFO: benchmark_h2o.py: tpr@1%fpr on test data: 0.43807763401109057\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "H2O session _sid_9519 closed.\n" ] } ], "source": [ "run_h2o(dataset, BASE_PATH, time_limit=3600)" ] }, { "cell_type": "code", "execution_count": null, "id": "b0a81d69", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "70efd958", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_mxnet_latest_p37", "language": "python", "name": "conda_mxnet_latest_p37" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: scripts/reproducibility/label-noise/benchmark_experiments.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": null, "id": "c77e5eb5", "metadata": {}, "outputs": [], "source": [ "#! pip install humanize\n", "#! pip install catboost" ] }, { "cell_type": "markdown", "id": "f8bd366d", "metadata": {}, "source": [ "# Label noise\n", "\n", "\n", "## Problem statement \n", "Have some binary classification task, traditionally assume data of the form X,y\n", "\n", "In reality, some of the labels may be incorrect, distinguish\n", "```\n", "y - true label\n", "y* - observed, possibly incorrect label\n", "```\n", "\n", "This can obviously effect model training, validation. Would also effect benchmarking process (comparing performance on noisy data doesn't tell you about performance on actual data).\n", "\n", "## Types of noise\n", "\n", "Can be completely independent:\n", "`p(y* != y | x, y) = p(y* != y)`\n", "\n", "class-dependent, depends on y:\n", "`p(y* != y | x, y) = p(y* != y | y)`\n", "\n", "feature-dependent, depends on x:\n", "`p(y* != y | x, y) = p(y* != y | x, y)`\n", "\n", "In fraud modeling, higher likelihood of `(y*, y) = (0, 1)` than reverse.\n", "(missed fraud, label maturity, intentional data poisoning, etc.)\n", "\n", "\"feature-dependent\" is probably most realistic in fraud but fewer removal techniques and also harder to synthetically generate. We will work with \"boundary conditional\" noise, probability of being mislabeled is weighted by distance from some decision boundary (score from model trained on clean data), implemented in scikit-clean.\n", "\n", "## Literature/packages\n", "\n", "Many methods in the literature to address this; can build loss functions that are robust to noise, can try to identify and filter (remove) or clean (flip label) examples identified as noisy.\n", "\n", "Some packages including CleanLab and scikit-clean. Can also hand-code an ensemble method. Most of these are model-agnostic." ] }, { "cell_type": "markdown", "id": "9b172deb", "metadata": {}, "source": [ "## CleanLab\n", "\n", "well-established, state of the art, open source package with some theoretical guarantees\n", "\n", "score all examples with y* = 1, determine average score t_1\n", "now score all examples with y* = 0. Any that score above t_1 are marked as noise\n", "\n", "can wrap any (sklearn-compatible) model with this process. \n", "\n", "## scikit-clean \n", "\n", "library of several different approaches including filtering as well as noise generation. Is similarly designed to be model-agnostic but doesn't always do a great job (doesn't handle unencoded categorical features well). Some of its methods can also be *very* slow relative to others\n", "\n", "## micro-models\n", "\n", "slice up training data, train a model on each slice, let models vote on whether to remove data. Can use majority (more than half of models \"misclassify\" example), consensus (all models misclassify) or any other threshold.\n", "\n", "## experiment design\n", "\n", "take 7 of the datasets - [‘ieeecis’, ‘ccfraud’, ‘fraudecom’, ‘sparknov’, ‘fakejob’, ‘vehicleloan’,‘twitterbot’]\n", "* drop IP and malurl dataset as they are difficult to work with \"out of the box\"\n", "* use numerical and categorical features, target-encode categorical features (drop text and enrichable features)\n", "\n", "add boundary-conditional noise `n` to training data (flipping both classes).\n", "\n", "values: `n in [0, 0.1, 0.2, 0.3, 0.4, 0.5]`\n", " \n", "target encoding is done after noise is added\n", " \n", "Catboost used as base classifier in all cases (with default settings)\n", "\n", "compare following methods for cleaning training data\n", "* baseline (no cleaning done)\n", "* CleanLab\n", "* scikit-clean MCS \n", "* micro-model majority voting (hand-built)\n", "* micro-model consensus voting (hand-built)\n", "\n", "measure AUC on (clean) test data\n", "\n", "repeat process 5 times for each experiment (start with clean data, add random noise, filter noise back out, train classifier, etc.), compute mean and std. dev of AUC for each\n", "\n", "CleanLab usually winds up being the best, but not uniformly. Baseline is sometimes the best for zero noise (as expected), and sometimes MCS or micro-model majority will come out ahead" ] }, { "cell_type": "code", "execution_count": null, "id": "846f161f", "metadata": {}, "outputs": [], "source": [ "# basic imports\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import warnings\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import humanize\n", "import pickle\n", "\n", "# basics from sklearn\n", "from sklearn.metrics import roc_auc_score\n", "from category_encoders.target_encoder import TargetEncoder\n", "\n", "# noise generation\n", "from skclean.simulate_noise import flip_labels_cc, BCNoise\n", "\n", "# base classifiers\n", "from catboost import CatBoostClassifier\n", "\n", "# cleaning methods/helpers\n", "from cleanlab.classification import CleanLearning\n", "from micro_models import MicroModelCleaner\n", "from skclean.pipeline import Pipeline\n", "from skclean.handlers import Filter\n", "from skclean.detectors import MCS\n", "\n", "# dataset loader\n", "from load_fdb_datasets import prepare_noisy_dataset, dataset_stats" ] }, { "cell_type": "code", "execution_count": null, "id": "85117ba5", "metadata": {}, "outputs": [], "source": [ "# wrapper definitions for the various types of cleaning methods we will use. \n", "# Each one wraps a model_class (in our case catboost, but could use xgboost, etc.)\n", "# resulting model_class can then take noisy data in its .fit() method and clean before training\n", "\n", "def baseline_model(model_class, params):\n", " return model_class(**params)\n", "\n", "def cleanlab_model(model_class, params, pulearning=False):\n", " if pulearning:\n", " return CleanLearning(model_class(**params), pulearning=pulearning)\n", " else:\n", " return CleanLearning(model_class(**params))\n", " \n", "def micromodels(model_class, pulearning, num_clfs, threshold, params):\n", " return MicroModelCleaner(model_class, pulearning=pulearning, num_clfs=num_clfs, threshold=threshold, **params)\n", "\n", "def skclean_MCS(model_class, params):\n", " skclean_pipeline = Pipeline([\n", " ('detector',MCS(classifier=model_class(**params))),\n", " ('handler',Filter(model_class(**params)))\n", " ])\n", " return skclean_pipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "bd6bcd08", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# some high-level parameters, \n", "# the number of runs for each experiment (determine mean/std. dev)\n", "num_samples = 5 \n", "# whether to use target encoding on categorical features\n", "target_encoding = True\n", "# whether to save intermediate results to disk (in case of failure etc.)\n", "save_results = True\n", "\n", "# we will be creating a lot of classifiers, let's use the same parameters for each\n", "model_config_dict = {\n", " 'catboost': {\n", " 'model_class': CatBoostClassifier,\n", " 'default_params': {\n", " 'verbose': False,\n", " 'iterations': 100\n", " }\n", " }\n", "}\n", "\n", "# all of our experiments will use catboost and boundary-consistent noise\n", "base_model_type = 'catboost'\n", "noise_type = 'boundary-consistent'\n", "model_class = model_config_dict[base_model_type]['model_class']\n", "\n", "# the set of experimental parameters, we will iterate over all these datasets\n", "keys = ['ieeecis', 'sparknov', 'ccfraud', 'fraudecom', 'fakejob', 'vehicleloan', 'twitterbot']\n", "# all these cleaning methods\n", "clf_types = ['baseline', 'skclean_MCS', 'cleanlab', 'micromodels_majority', 'micromodels_consensus']\n", "# all these noise levels\n", "noise_amounts = [0, 0.1, 0.2, 0.3, 0.4, 0.5]\n", "# and we will let cleaning methods know that noise can happen for either class\n", "pulearning = None\n", "\n", "# a little bit of setup for saving intermediate results to disk\n", "if save_results:\n", " results_file_path = './results'\n", " results_file_name = '{}_noise_benchmark_results.pkl'\n", " try:\n", " os.mkdir(results_file_path)\n", " except OSError as error:\n", " print(error) " ] }, { "cell_type": "code", "execution_count": null, "id": "ef2e3bd8", "metadata": {}, "outputs": [], "source": [ "# initialize results dict, we will index results by dataset/noise_amount/cleaning_method\n", "results = {}\n", "\n", "# main experimental loop \n", "for key in keys:\n", " # check to see if we have already run this experiment and saved to disk\n", " full_result_path = os.path.join(results_file_path,results_file_name.format(key))\n", " if os.path.exists(full_result_path) and save_results:\n", " with open(full_result_path, 'rb') as results_file:\n", " results[key] = pickle.load(results_file)\n", " # otherwise start from scratch\n", " else:\n", " # initialize sub-results\n", " results[key] = {}\n", " model_params = model_config_dict[base_model_type]['default_params']\n", " \n", " for noise_amount in noise_amounts:\n", " print(f\"\\n =={key}_{noise_amount}== \\n\")\n", " \n", " # initialize sub-sub-results\n", " results[key][noise_amount] = {}\n", "\n", " # these are the cleaning classifiers we will use\n", " clfs = {\n", " 'baseline': baseline_model(model_class, model_params),\n", " 'skclean_MCS': skclean_MCS(model_class, model_params),\n", " 'cleanlab': cleanlab_model(model_class, model_params, pulearning),\n", " 'micromodels_majority': micromodels(model_class, pulearning=pulearning,\n", " num_clfs=8, threshold=0.5, params=model_params),\n", " 'micromodels_consensus': micromodels(model_class, pulearning=pulearning,\n", " num_clfs=8, threshold=1, params=model_params),\n", "\n", " }\n", " print('generating datasets')\n", " # preparing a dataset has some overhead, we want to do this five times for each dataset/noise level\n", " # we will save a little bit of time by doing this in advance and using same set of five\n", " # for each cleaning method\n", " datasets = [prepare_noisy_dataset(key, noise_type, noise_amount, split=1, target_encoding=target_encoding) \n", " for i in range(num_samples)]\n", " \n", " # now for each cleaning method, train a \"clean\" model on noisy training data, then determine\n", " # auc on clean test data and record the results. Do this five times for each cleaning method\n", " # to determine mean/std. dev\n", " for clf_type in clfs:\n", " print(f\"testing {clf_type}\")\n", " auc = []\n", " try:\n", " for i in range(num_samples):\n", " # grab the dataset we need for this run and extract metadata and subsets\n", " dataset = datasets[i]\n", " features, cat_features, label = dataset['features'], dataset['cat_features'], dataset['label']\n", " train, test = dataset['train'], dataset['test']\n", " X_tr, y_tr = train[features], train[label].values.reshape(-1)\n", " X_ts, y_ts = test[features], test[label].values.reshape(-1)\n", " clf = clfs[clf_type]\n", " # fit the \"clean\" classifier on noisy training data\n", " clf.fit(X_tr, y_tr)\n", " # make predictions on clean test data and calculate AUC\n", " y_pred = clf.predict_proba(X_ts)[:, 1]\n", " auc.append(roc_auc_score(y_ts, y_pred))\n", " print(f\"{clf_type} auc: {auc}\", end=\"\\r\", flush=True)\n", " # store mean/std. dev for this run in the results dict\n", " results[key][noise_amount][clf_type] = (np.mean(auc), np.std(auc), auc)\n", " print('\\n{} auc: {:.2f} ± {:.4f}\\n'.format(clf_type,\n", " *results[key][noise_amount][clf_type][:2]))\n", " # if this run failed for some reason, handle it gracefully\n", " except Exception as e:\n", " results[key][noise_amount][clf_type] = (0, 0, [0] * num_samples)\n", " print(e)\n", " \n", " # if we are saving intermediate results to disk, do so now\n", " if save_results:\n", " with open(full_result_path, 'wb') as results_file:\n", " pickle.dump(results[key], results_file)" ] }, { "cell_type": "code", "execution_count": null, "id": "7a8a4509", "metadata": { "scrolled": false }, "outputs": [], "source": [ "# a couple of helper functions to analyze/summarize results\n", "\n", "def highlight_max(s, props=''):\n", " return np.where(s == np.nanmax(s.values), props, '')\n", "\n", "def record_places(places, scores):\n", " scores = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}\n", " last_score, last_stddev, last_placement = (2, 0, 1)\n", " for i, clf in enumerate(scores.keys()): \n", " if scores[clf][0] + scores[clf][1] >= last_score:\n", " placement = last_placement \n", " else:\n", " placement = i+1\n", " last_score, last_stddev = scores[clf] \n", " last_placement = i+1\n", " places[clf][placement] += 1 " ] }, { "cell_type": "code", "execution_count": null, "id": "7fa49c8e", "metadata": { "scrolled": false }, "outputs": [], "source": [ "# create dataframe of results for each experiment, also process results into dict for keeping track of \n", "# 1st/2nd/etc. place, as well as a dict for plotting later\n", "\n", "places = {clf:{p:0 for p in range(1,len(clf_types)+1)} for clf in clf_types}\n", "plots = {key:{clf:[[],[]] for clf in clf_types} for key in keys}\n", " \n", "for key in results.keys():\n", " print(f\"\\n =={key}==\\n\")\n", " rows = pd.Index([clf_type for clf_type in clf_types])\n", " columns = pd.MultiIndex.from_product([noise_amounts, ['mean','std_dev']], names=['type 2 noise', 'auc'])\n", " df = pd.DataFrame(index=rows, columns=columns)\n", " \n", " for noise_amount in noise_amounts:\n", " scores = {}\n", " for clf_type in clf_types:\n", " auc = results[key][noise_amount][clf_type] \n", " df.loc[clf_type, (noise_amount, 'mean')] = auc[0] \n", " df.loc[clf_type, (noise_amount, 'std_dev')] = auc[1]\n", " scores[clf_type] = (auc[0], auc[1])\n", "\n", " plots[key][clf_type][0].append(noise_amount)\n", " plots[key][clf_type][1].append(auc[0])\n", " record_places(places, scores)\n", " display(df.style.set_caption(f\"{key}\")\n", " .format({(n,'mean'): \"{:.2f}\" for n in noise_amounts})\n", " .format({(n,'std_dev'): \"{:.4f}\" for n in noise_amounts})\n", " .apply(highlight_max, props='font-weight:bold;background-color:lightblue', axis=0,\n", " subset=[[n,'mean'] for n in noise_amounts]))" ] }, { "cell_type": "code", "execution_count": null, "id": "8cb8dbd8", "metadata": {}, "outputs": [], "source": [ "# produce \"race results\" (i.e. how many first place, second place, etc. finishes)\n", "\n", "race_results = pd.DataFrame.from_dict(places).rename(index=lambda x : humanize.ordinal(x))\n", "race_results['totals'] = race_results.sum(axis=1)\n", "display(race_results)\n", "print(race_results.to_latex())" ] }, { "cell_type": "code", "execution_count": null, "id": "602877ee", "metadata": { "scrolled": false }, "outputs": [], "source": [ "# finally, we can plot the results of individual experiments\n", "\n", "colors = ['black','purple','green','red','orange']\n", "linestyles = ['-','--',':']\n", "ylims = {\n", " 'boundary-consistent': {\n", " 'ieeecis':[0.5,0.9],\n", " 'sparknov':[0.5,1],\n", " 'ccfraud':[0.25,1],\n", " 'fraudecom':[0.48,0.52],\n", " 'fakejob':[0.5,1],\n", " 'vehicleloan':[0.57,0.66],\n", " 'twitterbot':[0.7,0.95]\n", " },\n", " 'class-conditional': {\n", " 'ieeecis':[0.7,0.9],\n", " 'sparknov':[0.7,1],\n", " 'ccfraud':[0.8,1],\n", " 'fraudecom':[0.48,0.52],\n", " 'fakejob':[0.7,1],\n", " 'vehicleloan':[0.5,0.7],\n", " 'twitterbot':[0.8,0.95]\n", " }\n", "}\n", "\n", "x_labels = {\n", " 'boundary-consistent':'Boundary-Consistent Noise Level',\n", " 'class-conditional':'Class-Conditional Type 2 Noise Level'\n", "}\n", "\n", "legends = {\n", " 'boundary-consistent':'Cleaning Method',\n", " 'class-conditional':'Type 1 Noise, Cleaning Method'\n", "}\n", "def fix_failures(x):\n", " if x == 0:\n", " return None\n", " else:\n", " return x\n", "\n", "def labels(noise_type, noise_amount, clf_type):\n", " if noise_type == 'boundary-consistent':\n", " return '{}'.format(clf_type)\n", " elif noise_type == 'class-conditional':\n", " return '{}, {}'.format(noise_amount, clf_type)\n", "\n", "for key in results.keys():\n", " plt.figure(figsize=(10,10))\n", " \n", " for c, clf_type in enumerate(clf_types):\n", " a = plots[key][clf_type]\n", " plt.plot(a[0],[fix_failures(c) for c in a[1]],\n", " label=labels(noise_type, noise_amount, clf_type),\n", " color=colors[c],\n", " linestyle=linestyles[0])\n", " plt.title(key)\n", " plt.xlabel(x_labels[noise_type])\n", " plt.ylabel('Test AUC')\n", " plt.ylim(ylims[noise_type][key])\n", " plt.legend(title=legends[noise_type])\n", " plt.savefig(f\"./figures/label_noise_{key}.png\")\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "b891c49a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: scripts/reproducibility/label-noise/feature_dict.py ================================================ feature_dict = { 'ieeecis': { 'transactionamt': 'numeric', 'productcd': 'categorical', 'card1': 'numeric', 'card2': 'numeric', 'card3': 'numeric', 'card5': 'numeric', 'card6': 'categorical', 'addr1': 'numeric', 'dist1': 'numeric', 'p_emaildomain': 'categorical', 'r_emaildomain': 'categorical', 'c1': 'numeric', 'c2': 'numeric', 'c4': 'numeric', 'c5': 'numeric', 'c6': 'numeric', 'c7': 'numeric', 'c8': 'numeric', 'c9': 'numeric', 'c10': 'numeric', 'c11': 'numeric', 'c12': 'numeric', 'c13': 'numeric', 'c14': 'numeric', 'v62': 'numeric', 'v70': 'numeric', 'v76': 'numeric', 'v78': 'numeric', 'v82': 'numeric', 'v91': 'numeric', 'v127': 'numeric', 'v130': 'numeric', 'v139': 'numeric', 'v160': 'numeric', 'v165': 'numeric', 'v187': 'numeric', 'v203': 'numeric', 'v207': 'numeric', 'v209': 'numeric', 'v210': 'numeric', 'v221': 'numeric', 'v234': 'numeric', 'v257': 'numeric', 'v258': 'numeric', 'v261': 'numeric', 'v264': 'numeric', 'v266': 'numeric', 'v267': 'numeric', 'v271': 'numeric', 'v274': 'numeric', 'v277': 'numeric', 'v283': 'numeric', 'v285': 'numeric', 'v289': 'numeric', 'v291': 'numeric', 'v294': 'numeric', 'id_01': 'numeric', 'id_02': 'numeric', 'id_05': 'numeric', 'id_06': 'numeric', 'id_09': 'numeric', 'id_13': 'numeric', 'id_17': 'numeric', 'id_19': 'numeric', 'id_20': 'numeric', 'devicetype': 'categorical', 'deviceinfo': 'categorical' }, 'ccfraud': { 'v1': 'numeric', 'v2': 'numeric', 'v3': 'numeric', 'v4': 'numeric', 'v5': 'numeric', 'v6': 'numeric', 'v7': 'numeric', 'v8': 'numeric', 'v9': 'numeric', 'v10': 'numeric', 'v11': 'numeric', 'v12': 'numeric', 'v13': 'numeric', 'v14': 'numeric', 'v15': 'numeric', 'v16': 'numeric', 'v17': 'numeric', 'v18': 'numeric', 'v19': 'numeric', 'v20': 'numeric', 'v21': 'numeric', 'v22': 'numeric', 'v23': 'numeric', 'v24': 'numeric', 'v25': 'numeric', 'v26': 'numeric', 'v27': 'numeric', 'v28': 'numeric', 'amount': 'numeric' }, 'fraudecom': { 'purchase_value': 'numeric', 'source': 'categorical', 'browser': 'categorical', 'age': 'numeric', 'ip_address': 'enrichable', 'time_since_signup': 'numeric' }, 'sparknov': { 'cc_num': 'categorical', 'category': 'categorical', 'amt': 'numeric', 'first': 'categorical', 'last': 'categorical', 'gender': 'categorical', 'street': 'categorical', 'city': 'categorical', 'state': 'categorical', 'zip': 'categorical', 'lat': 'numeric', 'long': 'numeric', 'city_pop': 'numeric', 'job': 'categorical', 'dob': 'text', 'merch_lat': 'numeric', 'merch_long': 'numeric' }, 'twitterbot': { 'created_at' : 'text', 'default_profile': 'categorical', 'default_profile_image': 'categorical', 'description': 'text', 'favourites_count': 'numeric', 'followers_count': 'numeric', 'friends_count': 'numeric', 'geo_enabled': 'categorical', 'lang': 'categorical', 'location': 'categorical', 'profile_background_image_url': 'text', 'profile_image_url': 'text', 'screen_name': 'text', 'statuses_count': 'numeric', 'verified': 'categorical', 'average_tweets_per_day': 'numeric', 'account_age_days': 'numeric' }, 'fakejob': { 'title': 'categorical', 'location': 'categorical', 'department': 'categorical', 'salary_range': 'text', 'company_profile': 'text', 'description': 'text', 'requirements': 'text', 'benefits': 'text', 'telecommuting': 'categorical', 'has_company_logo': 'categorical', 'has_questions': 'categorical', 'employment_type': 'categorical', 'required_experience': 'categorical', 'required_education': 'categorical', 'industry': 'categorical', 'function': 'categorical' }, 'vehicleloan': { 'disbursed_amount': 'numeric', 'asset_cost': 'numeric', 'ltv': 'numeric', 'branch_id': 'categorical', 'supplier_id': 'categorical', 'manufacturer_id': 'categorical', 'current_pincode_id': 'categorical', 'date_of_birth': 'text', 'employment_type': 'categorical', 'state_id': 'categorical', 'employee_code_id': 'categorical', 'mobileno_avl_flag': 'categorical', 'aadhar_flag': 'categorical', 'pan_flag': 'categorical', 'voterid_flag': 'categorical', 'driving_flag': 'categorical', 'passport_flag': 'categorical', 'perform_cns_score': 'numeric', 'perform_cns_score_description': 'categorical', 'pri_no_of_accts': 'numeric', 'pri_active_accts': 'numeric', 'pri_overdue_accts': 'numeric', 'pri_current_balance': 'numeric', 'pri_sanctioned_amount': 'numeric', 'pri_disbursed_amount': 'numeric', 'sec_no_of_accts': 'numeric', 'sec_active_accts': 'numeric', 'sec_overdue_accts': 'numeric', 'sec_current_balance': 'numeric', 'sec_sanctioned_amount': 'numeric', 'sec_disbursed_amount': 'numeric', 'primary_instal_amt': 'numeric', 'sec_instal_amt': 'numeric', 'new_accts_in_last_six_months': 'numeric', 'delinquent_accts_in_last_six_months': 'numeric', 'average_acct_age': 'text', 'credit_history_length': 'text', 'no_of_inquiries': 'numeric' } } ================================================ FILE: scripts/reproducibility/label-noise/load_fdb_datasets.py ================================================ import os import re import json import pandas as pd import numpy as np import warnings from datetime import datetime from category_encoders.target_encoder import TargetEncoder from skclean.simulate_noise import flip_labels_cc, BCNoise from fdb.datasets import FraudDatasetBenchmark import feature_dict DATASET_PATH = './data/dataset.csv' METADATA_PATH = './data/feature_metadata.json' FD = feature_dict.feature_dict def noise_amount(df): return df[df.noise == 1].shape[0] def noise_rate(df): if df.shape[0] > 0: return noise_amount(df)/df.shape[0] else: return None def type_1_noise_amount(df): # examples with true label 0, mislabeled as 1 # here 'df.label' is the observed label, not the true one return df[(df.label==1) & (df.noise == 1)].shape[0] def type_2_noise_amount(df): # examples with true label 1, mislabeled as 0 # here 'df.label' is the observed label, not the true one return df[(df.label==0) & (df.noise == 1)].shape[0] def actual_legit_amount(df): return df[(df.label == 0) | (df.noise == 1)].shape[0] def observed_legit_amount(df): return df[df.label == 0].shape[0] def actual_fraud_amount(df): return df[((df.label == 1) & (df.noise == 0)) | ((df.label == 0) & (df.noise == 1))].shape[0] def observed_fraud_amount(df): return df[df.label == 1].shape[0] def actual_fraud_rate(df): if df.shape[0] > 0: return actual_fraud_amount(df)/df.shape[0] else: return None def observed_fraud_rate(df): if df.shape[0] > 0: return observed_fraud_amount(df)/df.shape[0] else: return None def type_1_noise_rate(df): if df.shape[0] > 0: return type_1_noise_amount(df)/actual_legit_amount(df) else: return None def type_2_noise_rate(df): if df.shape[0] > 0: return type_2_noise_amount(df)/actual_fraud_amount(df) else: return None def prepare_data_fdb(key, drop_text_enr_features=True): """ main function, gets datasets from FDB and then does some preprocessing/cleaning so they are suitable for modeling, returns data and metadata inputs: key - the FDB dataset to load drop_text_enr_features - whether we want to drop text/enrichable features this returns df - full pandas dataframe containing features, labels and metadata this includes training and test data, with a 'dataset' column to indicate which all of these datasets have a timestamp column (even if it is "fake") and by default data will be sorted by this column. All test > train w.r.t. this timestamp features - list of feature names cat_features - list of categorical feature names (subset of features) label - name of label column record_id - name of unique id column """ obj = FraudDatasetBenchmark(key=key) print(obj.key) # extract training and testing data (and test labels) from the return object # sort training data by event timestamp train_df = obj.train.sort_values(by='EVENT_TIMESTAMP',ignore_index=True) test_df = obj.test.reset_index(drop=True) test_labels = obj.test_labels.reset_index(drop=True) # define metadata and label column names metadata = ['EVENT_LABEL', 'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID', 'label', 'LABEL_TIMESTAMP', 'noise', 'dataset'] label = ['label'] # we maintain a feature dictionary in another file, this helps us determine which are categorical, numerical, etc. feature_dict = FD[key] raw_features = feature_dict.keys() num_features = [f for f in raw_features if feature_dict[f] == 'numeric'] cat_features = [f for f in raw_features if feature_dict[f] == 'categorical'] txt_features = [f for f in raw_features if feature_dict[f] == 'text'] enr_features = [f for f in raw_features if feature_dict[f] == 'enrichable'] # add / rename labels train_df.rename({'EVENT_LABEL':'label'}, axis=1, inplace=True) test_df['label'] = test_labels['EVENT_LABEL'] if key == 'twitterbot': train_df.loc[train_df.label == 'bot', 'label'] = 1 test_df.loc[test_df.label == 'bot', 'label'] = 1 train_df.loc[train_df.label == 'human', 'label'] = 0 test_df.loc[test_df.label == 'human', 'label'] = 0 # put train / test into single dataframe, create a 'dataset' column to keep track train_df['dataset'] = 'train' test_df['dataset'] = 'test' # create noise column - we won't generate any noise now but it may be useful to have (can also be ignored) train_df['noise'] = 0 test_df['noise'] = 0 # concatenate train/test into single dataframe # (remember we have 'dataset' column to separate them again if needed) df = pd.concat([train_df, test_df], axis=0, ignore_index=True) # there are a few date columns that are timestamps, we convert those to epoch # the new values are put into new columns, those column names are added to the numerical features if key == 'twitterbot': df['eng_created_at'] = df['created_at'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp()) num_features.append('eng_created_at') if key == 'sparknov': df['eng_dob'] = df['dob'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d').timestamp()) num_features.append('eng_dob') # fakejob has a salary range column, e.g. "10000 - 20000" that can be converted into two numerical columns if key == 'fakejob': def convert(x): r = re.search(r"([0-9]*)-([0-9]*)",str(x)) try: m, M = r.group(1), r.group(2) if m == '' or M == '': m, M = 0,0 except: m, M = 0,0 return m,M df['salary_min'], df['salary_max'] = zip(*df['salary_range'].map(convert)) num_features = num_features + ['salary_min','salary_max'] # vehicleloan has a timestamp column that we convert to epoch # it also has "account age" and "credit history" length cols # in form "Xyrs Ymon" that can be converted to numeric if key == 'vehicleloan': df['eng_dob'] = df['date_of_birth'].apply(lambda x : datetime.strptime(x, '%d-%m-%Y').timestamp()) def convert(x): r = re.search(r"([0-9]*)yrs ([0-9]*)mon", x) try: age = 12*float(r.group(1)) + float(r.group(2)) except: age = 0 return age df['eng_average_acct_age'] = df['average_acct_age'].apply(convert) df['eng_credit_history_length'] = df['credit_history_length'].apply(convert) num_features = num_features + ['eng_dob','eng_average_acct_age','eng_credit_history_length'] # by default we will drop any remaining text or enrichable (IP address) features as we won't use them # but you can pass in False for this if they are of interest if drop_text_enr_features: df.drop(txt_features + enr_features, axis=1, inplace=True) features = num_features + cat_features # cast all numeric features to float just in case they aren't for feature in num_features: df[feature] = df[feature].astype('float64') df[feature].fillna(0, inplace=True) # cast all categorical features to str in case they aren't for feature in cat_features: df[feature] = df[feature].astype(str) df[feature].fillna('', inplace=True) # rename the timestamp column df.rename({'EVENT_TIMESTAMP':'creation_date'}, axis=1, inplace=True) # cast the label to int just to be sure df['label'] = df['label'].astype('int') # name of unique id column will always be EVENT_ID record_id = 'EVENT_ID' if drop_text_enr_features: return df, features, cat_features, label, record_id else: return df, features, cat_features, txt_features, enr_features, label, record_id def add_noise(df, noise_type, noise_amount, *, time_index=None, features=None, cat_features=None, label=None): if noise_type not in ['random', 'time-dependent', 'boundary-consistent']: raise(Exception('Invalid Noise Type')) # if we want time-dependent noise it will be useful to convert timestamps into epoch def convert_to_millis(x): try: m = datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').timestamp() except: m = datetime.strptime(x, '%Y-%m-%d %H:%M:%S').timestamp() return m # random noise can be class-conditional in both directions (other types of noise cannot) # if noise_amount is passed in as [r,s] we can flip labels in both directions: # r is percent of 0s flipped to 1s # s is percent of 1s flipped to 0s # for random noise, if noise_amount is a single number, assume it is s, and that r=0 # (i.e. class-conditional noise where only 1s get flipped to 0s) if isinstance(noise_amount, tuple) or isinstance(noise_amount, list): if noise_type != 'random': raise(Exception('For time-dependent and boundary-consistent noise,' 'only a single value is allowed for noise_amount')) r = noise_amount[0] s = noise_amount[1] else: r = 0 s = noise_amount # we will add noise to a *copy* of the dataframe df_copy = df.copy() if noise_type == 'time-dependent': df_copy['event_millis'] = df_copy[time_index].apply(convert_to_millis) df_copy['event_millis'] = df_copy['event_millis'] - df_copy['event_millis'].min() mislabel = df_copy[(df_copy.noise == 0) & (df_copy.label == 1)].sample(frac = s, weights=df_copy['event_millis']).index df_copy.loc[mislabel,'noise'] = 1 df_copy.loc[mislabel,'label'] = 0 else: if noise_type == 'boundary-consistent': from catboost import CatBoostClassifier warnings.filterwarnings("ignore", category=FutureWarning) target_encoder = TargetEncoder(cols=cat_features) reshaped_y = df_copy[label].values.reshape(df_copy[label].shape[0],) X = target_encoder.fit_transform(df_copy[features], reshaped_y) clf = CatBoostClassifier(verbose=False) clf.fit(X, reshaped_y) _, noisy_labels = BCNoise(clf, noise_level=s).simulate_noise(X, reshaped_y) else: lcm = np.array([[1-r,r],[s,1-s]]) noisy_labels = flip_labels_cc(df_copy.label,lcm) idx = (df_copy.label != noisy_labels) df_copy.loc[idx,'noise'] = 1 df_copy['label'] = noisy_labels return df_copy def train_valid_split(df, split=0.7, shuffle=True, sort_key='creation_date'): if shuffle: df = df.sample(frac=1).reset_index(drop=True) else: df = df.sort_values(by=sort_key, ignore_index=True) train_idx = int(round(split*df.shape[0])) train = df[:train_idx].reset_index(drop=True) valid = df[train_idx:].reset_index(drop=True) return train, valid def prepare_noisy_dataset(key, noise_type, noise_amount, split=0.7, shuffle=True, sort_key='creation_date', target_encoding=False): """ this function can be used to fetch datasets from FDB, starts by calling prepare_data_fdb and then adding noise input: key - name of FDB dataset noise_type - what type of noise to add noise_amount - how much noise to add split - training/validation split shuffle - whether or not to shuffle or sort before doing train/valid split sort_key - key to use to sort for train/valid split as well as weight for time-dependent noise """ # start by getting clean dataset df, features, cat_features, label, record_id = prepare_data_fdb(key) if noise_type == 'boundary-consistent': train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount, time_index=sort_key, features=features, cat_features=cat_features, label=label) else: train_and_valid = add_noise(df[df.dataset == 'train'], noise_type, noise_amount, time_index=sort_key) train, valid = train_valid_split(train_and_valid, split, shuffle=shuffle, sort_key=sort_key) test = df[df.dataset == 'test'].reset_index(drop=True) train = train[features + ['noise'] + label] valid = valid[features + ['noise'] + label] test = test[features + ['noise'] + label] if target_encoding: warnings.filterwarnings("ignore", category=FutureWarning) target_encoder = TargetEncoder(cols=cat_features) reshaped_y = train[label].values.reshape(train[label].shape[0],) train.loc[:, features] = target_encoder.fit_transform(train[features], reshaped_y) valid.loc[:, features] = target_encoder.transform(valid[features]) test.loc[:, features] = target_encoder.transform(test[features]) cat_features = None dataset = { 'description': f"{key} dataset with noise type: {noise_type}, noise amount: {noise_amount} ", 'features':features, 'cat_features':cat_features, 'label':label, 'record_id':record_id, 'train':train, 'valid':valid, 'test':test, 'noise':(noise_rate(train), noise_rate(valid), noise_rate(test)), 'fraud_level':(actual_fraud_rate(train), actual_fraud_rate(valid), actual_fraud_rate(test)), 'observed_fraud_level':(observed_fraud_rate(train),observed_fraud_rate(valid),observed_fraud_rate(test)), 'type_1_noise_rate':(type_1_noise_rate(train),type_1_noise_rate(valid),type_1_noise_rate(test)), 'type_2_noise_rate':(type_2_noise_rate(train),type_2_noise_rate(valid),type_2_noise_rate(test)) } return dataset def dataset_stats(dataset): noise = dataset['noise'] fraud_level = dataset['fraud_level'] observed_fraud_level = dataset['observed_fraud_level'] type_1_noise_rate = dataset['type_1_noise_rate'] type_2_noise_rate = dataset['type_2_noise_rate'] stats = list(zip(['train','valid','test'],noise,type_1_noise_rate,type_2_noise_rate,fraud_level,observed_fraud_level)) print(dataset['description']) for stat in stats: print('{} - total noise rate: {:.3f}, type 1 noise rate: {:.3f}, type 2 noise rate: {:.3f},\n' '(actual) fraud rate: {:.3f}, observed fraud rate: {:.3f}'.format(*stat)) ================================================ FILE: scripts/reproducibility/label-noise/micro_models.py ================================================ import logging import pandas as pd import numpy as np class MicroModelError(Exception): """ basic exception type for micro-model specific errors """ def __init__(self, error_message): logging.error(error_message) class MicroModel: """ Basic wrapper for the model to be used in ensemble noise removal, ModelClass can be anything that implements fit and predict_proba. Mainly used by MicroModelEnsemble, user is probably not calling this directly """ def __init__(self, ModelClass, *args, **kwargs): """ initialization of the class, ModelClass should be a *class* not an object e.g. CatBoostClassifier, not CatBoostClassifier() """ self.clf = ModelClass(*args, **kwargs) self.thresh = None def set_thresh(self, thresh): # can set a threshold to be used in model predictions self.thresh = thresh def fit(self, x, y, *args, **kwargs): # pass-through method to call model.fit() self.clf.fit(x, y.values.ravel(), *args, **kwargs) def predict_proba(self, x, *args, **kwargs): # pass-through method to call model.predict_proba() if 'predict_proba' in dir(self.clf): return self.clf.predict_proba(x, *args, **kwargs) else: raise (MicroModelError('ModelClass must implement predict_proba')) def predict(self, x): # make predictions, using either defined threshold (if set) or default value of 0.5 if self.thresh is not None: t = self.thresh else: t = 0.5 scores = self.predict_proba(x)[:, 1] preds = [int(s > t) for s in scores] return scores, preds class MicroModelEnsemble: """ Ensemble of micro-models used to remove noise """ def __init__(self, ModelClass, num_clfs=16, score_type='preds_avg', *args, **kwargs): """ initialization of the class, ModelClass should be a *class* not an object e.g. CatBoostClassifier, not CatBoostClassifier() params: ModelClass - base class to use, needs to implement fit and predict_proba num_clfs - number of classifiers to use in cleaning ensemble score_type - means of computing anomaly score from micro-model scores args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost """ self.score_type = score_type if type(num_clfs) is not int or num_clfs <= 0: raise (MicroModelError('num_clfs must be a positive integer')) self.ModelClass = ModelClass # one classifier that will be trained over entire dataset self.big_clf = MicroModel(ModelClass=ModelClass, *args, **kwargs) # micro-models to later be trained over slices self.num_clfs = num_clfs self.clfs = [] for i in range(num_clfs): self.clfs.append(MicroModel(ModelClass=ModelClass, *args, **kwargs)) self.thresholds = {} def fit(self, x, y, *args, **kwargs): # assumption that data is already shuffled or sorted (by date or other appropriate key) # according to the usecase if not isinstance(y, pd.DataFrame): y = pd.DataFrame(y) # fit one classifier on all the data self.big_clf.fit(x, y, *args, **kwargs) # now fit individual models on slices of data stride = round(x.shape[0] / self.num_clfs) for i, clf in enumerate(self.clfs): idx = slice(i * stride, min((i + 1) * stride, x.shape[0])) x_i = x.iloc[idx, :] y_i = y.iloc[idx, :] clf.fit(x_i, y_i, *args, **kwargs) def predict_proba(self, x, *args, **kwargs): # output is the mean of the (binary) predictions of all models in the ensemble # e.g. the percentage of models that voted on the example results = pd.DataFrame(index=np.arange(x.shape[0])) if self.score_type == 'preds_avg': for i, clf in enumerate(self.clfs): _, results[i] = clf.predict(x, *args, **kwargs) elif self.score_type == 'score_avg': for i, clf in enumerate(self.clfs): results[i] = clf.predict_proba(x, *args, **kwargs)[:, 1] scores = results.mean(axis=1, numeric_only=True) return scores def predict(self, x, threshold=0.5, *args, **kwargs): # compare output of predict_proba to a threshold in order to make a binary prediction, default is 0.5 scores = self.predict_proba(x) preds = np.array([int(s >= threshold) for s in scores]) return scores, preds def filter_noise(self, x, y, pulearning=True, threshold=0.5): # compare ensemble predictions to observed labels and return the examples that are NOT considered noise # i.e. this is noise REMOVAL # pu_learning=True means a class-conditional assumption is being made, # there no examples of true 0s mislabeled as 1s scores, susp = self.predict(x, threshold) if pulearning: conf = ((y == 1) | ((y == 0) & (susp == 0))) else: conf = (((y == 1) & (scores > 1 - threshold)) | ((y == 0) & (scores < threshold))) return x[conf].reset_index(drop=True), y[conf] def clean_noise(self, x, y, pulearning=True, threshold=0.5): # compare ensemble predictions to observed labels and return all examples with corrected labels # i.e. this is noise CLEANING # pu_learning=True means a class-conditional assumption is being made, # there no examples of true 0s mislabeled as 1s x = x.copy() y = y.copy() _, susp = self.predict(x, threshold) # flip all the probable 1s to actual 1s probable_1 = (y == 0) & (susp == 1) y[probable_1] = 1 if not pulearning: # if there are both types of noise, flip probable 0s to actual 0s probable_0 = (y == 1) & (susp == 0) y[probable_0] = 0 return x, y class MicroModelCleaner: """ This class performs the entire model training process end-to-end - given a dataset it will first train an ensemble then remove noise, then train a final model on the clean data """ def __init__(self, ModelClass, strategy='filter', pulearning=True, num_clfs=16, threshold=0.5, *args, **kwargs): """ initialization of the class, ModelClass should be a *class* not an object e.g. CatBoostClassifier, not CatBoostClassifier() params: ModelClass - base class to use, needs to implement fit and predict_proba strategy - whether to remove noise ('filter') or flip labels ('clean') pulearning - class-conditional assumption, if True assume there is no true 0's mislabeled as 1's num_clfs - number of classifiers to use in cleaning ensemble threshold - percentage of classifiers that have to vote to remove noise (0.5 is majority voting) args/kwargs - any other parameters to pass to model constructor, e.g. cat_features or iterations for CatBoost """ self.detector = MicroModelEnsemble(ModelClass, num_clfs, *args, **kwargs) self.clf = ModelClass(*args, **kwargs) if strategy.lower() not in ['filter', 'clean']: raise (MicroModelError('strategy must be filter or clean')) self.strategy = strategy.lower() self.pulearning = pulearning self.threshold = threshold def fit(self, x, y, *args, **kwargs): # first train the Ensemble to deal with the noise self.detector.fit(x, y, *args, **kwargs) if self.strategy == 'filter': x_clean, y_clean = self.detector.filter_noise(x, y, self.pulearning, self.threshold) else: x_clean, y_clean = self.detector.clean_noise(x, y, self.pulearning, self.threshold) # then train final model on clean data self.clf.fit(x_clean, y_clean, *args, **kwargs) def predict(self, x, *args, **kwargs): return self.clf.predict(x, *args, **kwargs) def predict_proba(self, x, *args, **kwargs): return self.clf.predict_proba(x, *args, **kwargs) ================================================ FILE: setup.py ================================================ import os from glob import glob from setuptools import find_packages, setup setup( name='fraud_dataset_benchmark', version='1.0', # declare your packages packages=find_packages(where='src', exclude=('test',)), package_dir={'': 'src'}, include_package_data=True, data_files=[('.',[ 'src/fdb/versioned_datasets/ipblock/20220607.zip', ])], # Enable build-time format checking check_format=False, # Enable type checking test_mypy=False, # Enable linting at build time test_flake8=False, # exclude_package_data={ # '': glob('fdb/*/__pycache__', recursive=True), # } ) ================================================ FILE: src/__init__.py ================================================ ================================================ FILE: src/fdb/__init__.py ================================================ ================================================ FILE: src/fdb/datasets.py ================================================ from abc import abstractmethod, ABC from fdb.preprocessing import * from fdb.preprocessing_objects import load_data from sklearn.metrics import roc_auc_score, roc_curve, auc class FraudDatasetBenchmark(ABC): def __init__( self, key, load_pre_downloaded=False, delete_downloaded=True, add_random_values_if_real_na = { "EVENT_TIMESTAMP": True, "LABEL_TIMESTAMP": True, "ENTITY_ID": True, "ENTITY_TYPE": True, "EVENT_ID": True }): self.key = key self.obj = load_data(self.key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na) @property def train(self): return self.obj.train @property def test(self): return self.obj.test @property def test_labels(self): return self.obj.test_labels def eval(self, y_pred): """ Method to evaluate predictions against the test set """ roc_score = roc_auc_score(self.test_labels['EVENT_LABEL'], y_pred) fpr, tpr, thres = roc_curve(self.test_labels['EVENT_LABEL'], y_pred) tpr_1fpr = np.interp(0.01, fpr, tpr) metrics = {'roc_score': roc_score, 'tpr_1fpr': tpr_1fpr} return metrics ================================================ FILE: src/fdb/kaggle_configs.py ================================================ KAGGLE_CONFIGS = { "fakejob": { "owner": "shivamb", "dataset": "real-or-fake-fake-jobposting-prediction", "filename": 'fake_job_postings.csv', "name": "Real / Fake Job Posting Prediction", "type": "datasets", "version": 1 }, "vehicleloan": { "owner": "avikpaul4u", "dataset": "vehicle-loan-default-prediction", "filename": 'train.csv', "name": "Vehicle Loan Default Prediction", "type": "datasets", "version": 4 }, "malurl": { "owner": "sid321axn", "dataset": "malicious-urls-dataset", "filename": 'malicious_phish.csv', "name": "Malicious URLs Dataset", "type": "datasets", "version": 1 }, "ieeecis": { "owner": "ieee-fraud-detection", "name": "IEEE-CIS Fraud Detection", "type": "competitions", }, "ccfraud": { "owner": "mlg-ulb", "dataset": "creditcardfraud", "filename": 'creditcard.csv', "name": "Credit Card Fraud Detection", "type": "datasets", "version": 3 }, "fraudecom": { "owner": "vbinh002", "dataset": "fraud-ecommerce", "filename": 'Fraud_Data.csv', "name": "Fraud ecommerce", "type": "datasets", "version": 1 }, "sparknov": { "owner": "kartik2112", "dataset": "fraud-detection", "name": "Simulated Credit Card Transactions generated using Sparkov", "type": "datasets", "version": 1 }, "twitterbot": { "owner": "davidmartngutirrez", "dataset": "twitter-bots-accounts", "filename": "twitter_human_bots_dataset.csv", "name": "Twitter Bots Accounts", "type": "datasets", "version": 2 } } ================================================ FILE: src/fdb/preprocessing.py ================================================ import os import re import shutil import kaggle import pkgutil import requests import zipfile import numpy as np from abc import ABC import pandas as pd import socket, struct from faker import Faker from zipfile import ZipFile from datetime import datetime from datetime import timedelta from io import StringIO, BytesIO from dateutil.relativedelta import relativedelta from fdb.kaggle_configs import KAGGLE_CONFIGS fake = Faker(['en_US']) # Naming convention for the meta data columns in standardized datasets _EVENT_TIMESTAMP = 'EVENT_TIMESTAMP' # timestamp column _ENTITY_TYPE = 'ENTITY_TYPE' # afd specific requirement _EVENT_LABEL = 'EVENT_LABEL' # label column _EVENT_ID = 'EVENT_ID' # transaction/event id _ENTITY_ID = 'ENTITY_ID' # represents user/account id _LABEL_TIMESTAMP = 'LABEL_TIMESTAMP' # added in a cases where entity id is meaninful # Kaggle config related strings _OWNER = 'owner' _COMPETITIONS = 'competitions' _TYPE = 'type' _FILENAME = 'filename' _DATASETS = 'datasets' _DATASET = 'dataset' _VERSION = 'version' # Some fixed parameters _RANDOM_STATE = 1 _CWD = os.getcwd() _DOWNLOAD_LOCATION = os.path.join(_CWD, 'tmp') _TIMESTAMP_FORMAT = '%Y-%m-%dT%H:%M:%SZ' _DEFAULT_LABEL_TIMESTAMP = datetime.now().strftime(_TIMESTAMP_FORMAT) class BasePreProcessor(ABC): def __init__( self, key = None, train_percentage = 0.8, timestamp_col = None, label_col = None, label_timestamp_col = None, event_id_col = None, entity_id_col = None, features_to_drop = [], load_pre_downloaded = False, delete_downloaded = True, add_random_values_if_real_na = { "EVENT_TIMESTAMP": True, "LABEL_TIMESTAMP": True, "ENTITY_ID": True, "ENTITY_TYPE": True, "EVENT_ID": True } ): self.key = key self.train_percentage = train_percentage self.features_to_drop = features_to_drop self.delete_downloaded = delete_downloaded self._timestamp_col = timestamp_col self._label_col = label_col self._label_timestamp_col = label_timestamp_col self._event_id_col = event_id_col self._entity_id_col = entity_id_col self._add_random_values_if_real_na = add_random_values_if_real_na # Simply get all required objects at the time of object creation if KAGGLE_CONFIGS.get(self.key) and not load_pre_downloaded: self.download_kaggle_data() # download the data when an object is created self.load_data() self.preprocess() self.train_test_split() def _download_kaggle_data_from_competetions(self): file_name = KAGGLE_CONFIGS[self.key][_OWNER] kaggle.api.competition_download_files( competition = KAGGLE_CONFIGS[self.key][_OWNER], path = _DOWNLOAD_LOCATION ) return file_name def _download_kaggle_data_from_datasets_with_given_filename(self): file_name = KAGGLE_CONFIGS[self.key][_FILENAME] response = kaggle.api.datasets_download_file( owner_slug = KAGGLE_CONFIGS[self.key][_OWNER], dataset_slug = KAGGLE_CONFIGS[self.key][_DATASET], file_name = file_name, dataset_version_number=KAGGLE_CONFIGS[self.key][_VERSION], _preload_content = False, ) with open(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'wb') as f: f.write(response.data) return file_name def _download_kaggle_data_from_datasets_containing_single_file(self): file_name = KAGGLE_CONFIGS[self.key][_DATASET] kaggle.api.dataset_download_files( dataset = os.path.join(KAGGLE_CONFIGS[self.key][_OWNER], KAGGLE_CONFIGS[self.key][_DATASET]), path = _DOWNLOAD_LOCATION ) return file_name def download_kaggle_data(self): """ Download and extract the data from Kaggle. Puts the data in tmp directory within current directory. """ if not os.path.exists(_DOWNLOAD_LOCATION): os.mkdir(_DOWNLOAD_LOCATION) print('Data download location', _DOWNLOAD_LOCATION) if KAGGLE_CONFIGS[self.key][_TYPE] == _COMPETITIONS: file_name = self._download_kaggle_data_from_competetions() elif KAGGLE_CONFIGS[self.key][_TYPE] == _DATASETS: # If filename is given, download single file, # Else download all files. if KAGGLE_CONFIGS[self.key].get(_FILENAME): file_name = self._download_kaggle_data_from_datasets_with_given_filename() else: file_name = self._download_kaggle_data_from_datasets_containing_single_file() else: raise ValueError('Type should be among competetions or datasets in config') with zipfile.ZipFile(os.path.join(_DOWNLOAD_LOCATION, file_name + '.zip'), 'r') as zip_ref: zip_ref.extractall(_DOWNLOAD_LOCATION) def load_data(self): self.df = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION, KAGGLE_CONFIGS[self.key]['filename']), dtype='object') # delete downloaded data after loading in memory if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION) @property def timestamp_col(self): return self._timestamp_col # If timestamp not available, will create fake timestamps @property def label_col(self): if self._label_col is None: raise ValueError('Label column not specified') else: return self._label_col @property def event_id_col(self): return self._event_id_col # If event id not available, will create fake event ids @property def entity_id_col(self): return self._entity_id_col def standardize_timestamp_col(self): if self.timestamp_col is not None: self.df[_EVENT_TIMESTAMP] = pd.to_datetime(self.df[self.timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT)) self.df.drop(self.timestamp_col, axis=1, inplace=True) elif self.timestamp_col is None and self._add_random_values_if_real_na[_EVENT_TIMESTAMP]: self.df[_EVENT_TIMESTAMP] = self.df[_EVENT_LABEL].apply( lambda x: fake.date_time_between( start_date='-1y', # think about making it to fixed date. vs from now? end_date='now', tzinfo=None).strftime(_TIMESTAMP_FORMAT)) if self._label_timestamp_col is None and self._add_random_values_if_real_na[_LABEL_TIMESTAMP]: self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date elif self._label_timestamp_col is not None: self.df[_LABEL_TIMESTAMP] = pd.to_datetime(self.df[self._label_timestamp_col]).apply(lambda x: x.strftime(_TIMESTAMP_FORMAT)) self.df.drop(self._label_timestamp_col, axis=1, inplace=True) def standardize_label_col(self): self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True) self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].astype(int) def standardize_event_id_col(self): if self.event_id_col is not None: self.df.rename({self.event_id_col: _EVENT_ID}, axis=1, inplace=True) self.df[_EVENT_ID] = self.df[_EVENT_ID].astype(str) elif self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: # add fake one if not exist self.df[_EVENT_ID] = self.df[_EVENT_LABEL].apply( lambda x: fake.uuid4()) def standardize_entity_id_col(self): if self.entity_id_col is not None: self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True) elif self.entity_id_col is None and self._add_random_values_if_real_na[_ENTITY_ID]: # add fake one if not exist self.df[_ENTITY_ID] = self.df[_EVENT_LABEL].apply( lambda x: fake.uuid4()) def rename_features(self): rename_map = {} # default is empty map that won't rename any columns self.df.rename(rename_map, axis=1, inplace=True) def subset_features(self): features_to_select = self.df.columns.tolist() self.df = self.df[features_to_select] # all by default def drop_features(self): self.df.drop(self.features_to_drop, axis=1, inplace=True) def add_meta_data(self): if self._add_random_values_if_real_na[_ENTITY_TYPE]: self.df[_ENTITY_TYPE] = 'user' def sort_by_timestamp(self): self.df.sort_values(by=_EVENT_TIMESTAMP, ascending=True, inplace=True) def lower_case_col_names(self): self.df.columns = [s.lower() for s in self.df.columns] def preprocess(self): self.lower_case_col_names() self.standardize_label_col() self.standardize_event_id_col() self.standardize_entity_id_col() self.standardize_timestamp_col() self.add_meta_data() self.rename_features() self.subset_features() self.drop_features() if self.timestamp_col: self.sort_by_timestamp() def train_test_split(self): """ Default setting is out of time with 80%-20% into training and testing respectively """ if self.timestamp_col: split_pt = int(self.df.shape[0]*self.train_percentage) self.train = self.df.copy().iloc[:split_pt, :] self.test = self.df.copy().iloc[split_pt:, :] else: # random if no timestamp col available self.train = self.df.sample(frac=self.train_percentage, random_state=_RANDOM_STATE) self.test = self.df.copy()[~self.df.index.isin(self.train.index)] self.test.reset_index(drop=True, inplace=True) self.test_labels = self.test[[_EVENT_LABEL]] if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: self.test_labels[_EVENT_ID] = self.test[_EVENT_ID] self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore") class FakejobPreProcessor(BasePreProcessor): def __init__(self, **kw): super(FakejobPreProcessor, self).__init__(**kw) class VehicleloanPreProcessor(BasePreProcessor): def __init__(self, **kw): super(VehicleloanPreProcessor, self).__init__(**kw) class MalurlPreProcessor(BasePreProcessor): """ This one originally multiple classes for manignant. We will combine all malignant one class to keep benchmark binary for now """ def __init__(self, **kw): super(MalurlPreProcessor, self).__init__(**kw) def standardize_label_col(self): self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True) binary_mapper = { 'defacement': 1, 'phishing': 1, 'malware': 1, 'benign': 0 } self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper) def add_dummy_col(self): self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4()) def preprocess(self): super(MalurlPreProcessor, self).preprocess() self.add_dummy_col() class IEEEPreProcessor(BasePreProcessor): """ Some pre-processing was done using kaggle kernels below. References: Data Source: https://www.kaggle.com/c/ieee-fraud-detection/data Some processing from: https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600 Feature selection to reduce to 100: https://www.kaggle.com/code/pavelvpster/ieee-fraud-feature-selection-rfecv/notebook """ def __init__(self, **kw): super(IEEEPreProcessor, self).__init__(**kw) @staticmethod def _dtypes_cols(): # FIRST 53 COLUMNS cols = ['TransactionID', 'TransactionDT', 'TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9'] # V COLUMNS TO LOAD DECIDED BY CORRELATION EDA # https://www.kaggle.com/cdeotte/eda-for-columns-v-and-id v = [1, 3, 4, 6, 8, 11] v += [13, 14, 17, 20, 23, 26, 27, 30] v += [36, 37, 40, 41, 44, 47, 48] v += [54, 56, 59, 62, 65, 67, 68, 70] v += [76, 78, 80, 82, 86, 88, 89, 91] #v += [96, 98, 99, 104] #relates to groups, no NAN v += [107, 108, 111, 115, 117, 120, 121, 123] # maybe group, no NAN v += [124, 127, 129, 130, 136] # relates to groups, no NAN # LOTS OF NAN BELOW v += [138, 139, 142, 147, 156, 162] #b1 v += [165, 160, 166] #b1 v += [178, 176, 173, 182] #b2 v += [187, 203, 205, 207, 215] #b2 v += [169, 171, 175, 180, 185, 188, 198, 210, 209] #b2 v += [218, 223, 224, 226, 228, 229, 235] #b3 v += [240, 258, 257, 253, 252, 260, 261] #b3 v += [264, 266, 267, 274, 277] #b3 v += [220, 221, 234, 238, 250, 271] #b3 v += [294, 284, 285, 286, 291, 297] # relates to grous, no NAN v += [303, 305, 307, 309, 310, 320] # relates to groups, no NAN v += [281, 283, 289, 296, 301, 314] # relates to groups, no NAN # COLUMNS WITH STRINGS str_type = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain','M1', 'M2', 'M3', 'M4','M5', 'M6', 'M7', 'M8', 'M9', 'id_12', 'id_15', 'id_16', 'id_23', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo'] str_type += ['id-12', 'id-15', 'id-16', 'id-23', 'id-27', 'id-28', 'id-29', 'id-30', 'id-31', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38'] cols += ['V'+str(x) for x in v] dtypes = {} for c in cols+['id_0'+str(x) for x in range(1,10)]+['id_'+str(x) for x in range(10,34)]+\ ['id-0'+str(x) for x in range(1,10)]+['id-'+str(x) for x in range(10,34)]: dtypes[c] = 'float32' for c in str_type: dtypes[c] = 'category' return dtypes, cols def load_data(self): """ Hard coded file names for this dataset as it contains multiple files to be combined """ dtypes, cols = IEEEPreProcessor._dtypes_cols() self.df = pd.read_csv( os.path.join(_DOWNLOAD_LOCATION, 'train_transaction.csv'), index_col='TransactionID', dtype=dtypes, usecols=cols+['isFraud']) self.df_id = pd.read_csv( os.path.join(_DOWNLOAD_LOCATION, 'train_identity.csv'), index_col='TransactionID', dtype=dtypes) self.df = self.df.merge(self.df_id, how='left', left_index=True, right_index=True) # delete downloaded data after loading in memory if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION) def normalization(self): # NORMALIZE D COLUMNS for i in range(1,16): if i in [1,2,3,5,9]: continue self.df['d'+str(i)] = self.df['d'+str(i)] - self.df[self.timestamp_col]/np.float32(24*60*60) def standardize_entity_id_col(self): def _encode_CB(col1, col2, df): nm = col1+'_'+col2 df[nm] = df[col1].astype(str)+'_'+df[col2].astype(str) _encode_CB('card1', 'addr1', self.df) self.df['day'] = self.df[self.timestamp_col] / (24*60*60) self.df[_ENTITY_ID] = self.df['card1_addr1'].astype(str) + '_' + np.floor(self.df['day'] - self.df['d1']).astype(str) @staticmethod def _add_seconds(x): init_time = '2021-01-01T00:00:00Z' dt_format = _TIMESTAMP_FORMAT init_time = datetime.strptime(init_time, dt_format) # start date from last 18 months final_time = init_time + timedelta(seconds=x) return final_time.strftime(_TIMESTAMP_FORMAT) def standardize_timestamp_col(self): self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: IEEEPreProcessor._add_seconds(x)) self.df.drop(self.timestamp_col, axis=1, inplace=True) if self._add_random_values_if_real_na["LABEL_TIMESTAMP"]: self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date def subset_features(self): features_to_select = \ ['transactionamt', 'productcd', 'card1', 'card2', 'card3', 'card5', 'card6', 'addr1', 'dist1', 'p_emaildomain', 'r_emaildomain', 'c1', 'c2', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11', 'c12', 'c13', 'c14', 'v62', 'v70', 'v76', 'v78', 'v82', 'v91', 'v127', 'v130', 'v139', 'v160', 'v165', 'v187', 'v203', 'v207', 'v209', 'v210', 'v221', 'v234', 'v257', 'v258', 'v261', 'v264', 'v266', 'v267', 'v271', 'v274', 'v277', 'v283', 'v285', 'v289', 'v291', 'v294', 'id_01', 'id_02', 'id_05', 'id_06', 'id_09', 'id_13', 'id_17', 'id_19', 'id_20', 'devicetype', 'deviceinfo', 'EVENT_TIMESTAMP', 'ENTITY_ID', 'ENTITY_TYPE', 'EVENT_ID', 'EVENT_LABEL', 'LABEL_TIMESTAMP'] self.df = self.df.loc[:, self.df.columns.isin(features_to_select)] def preprocess(self): self.lower_case_col_names() self.normalization() # normalize D columns self.standardize_label_col() self.standardize_event_id_col() self.standardize_entity_id_col() self.standardize_timestamp_col() self.add_meta_data() self.rename_features() self.subset_features() if self.timestamp_col: self.sort_by_timestamp() class CCFraudPreProcessor(BasePreProcessor): def __init__(self, **kw): super(CCFraudPreProcessor, self).__init__(**kw) @staticmethod def _add_minutes(x): dt_format = _TIMESTAMP_FORMAT init_time = datetime.strptime('2021-09-01T00:00:00Z', dt_format) # chose randomly but in last 18 months final_time = init_time + timedelta(minutes=x) return final_time.strftime(_TIMESTAMP_FORMAT) def standardize_timestamp_col(self): self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].astype(float).apply(lambda x: CCFraudPreProcessor._add_minutes(x)) self.df.drop(self.timestamp_col, axis=1, inplace=True) if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]: self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date class FraudecomPreProcessor(BasePreProcessor): def __init__(self, ip_address_col, signup_time_col, **kw): self.ip_address_col = ip_address_col self.signup_time_col = signup_time_col super(FraudecomPreProcessor, self).__init__(**kw) @staticmethod def _add_years(init_time): dt_format = '%Y-%m-%d %H:%M:%S' init_time = datetime.strptime(init_time, dt_format) final_time = init_time + relativedelta(years=6) # move to more recent time range return final_time.strftime(_TIMESTAMP_FORMAT) def standardize_timestamp_col(self): self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: FraudecomPreProcessor._add_years(x)) self.df.drop(self.timestamp_col, axis=1, inplace=True) # Also add _LABEL_TIMESTAMP to allow training of this dataset with TFI if self._add_random_values_if_real_na[_LABEL_TIMESTAMP]: self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date def process_ip(self): """ This dataset has ip address as a feature, but needs to be converted into standard IPV4. """ self.df[self.ip_address_col] = self.df[self.ip_address_col].astype(float).astype(int).\ apply(lambda x: socket.inet_ntoa(struct.pack('!L', x))) def create_time_since_signup(self): self.df['time_since_signup'] = ( pd.to_datetime(self.df[self.timestamp_col]) -\ pd.to_datetime(self.df[self.signup_time_col])).dt.seconds def preprocess(self): self.lower_case_col_names() self.standardize_label_col() self.standardize_event_id_col() self.standardize_entity_id_col() self.create_time_since_signup() # One manually engineered feature self.standardize_timestamp_col() self.add_meta_data() self.process_ip() # This extra step added self.rename_features() self.drop_features() # Replace select with drop if self.timestamp_col: self.sort_by_timestamp() class SparknovPreProcessor(BasePreProcessor): def __init__(self, **kw): super(SparknovPreProcessor, self).__init__(**kw) def load_data(self): """ Hard coded file names for this dataset as it contains multiple files to be combined """ df_train = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTrain.csv')) df_train['seg'] = 'train' df_test = pd.read_csv(os.path.join(_DOWNLOAD_LOCATION,'fraudTest.csv')) df_test['seg'] = 'test' self.df = pd.concat([df_train, df_test], ignore_index=True) # delete downloaded data after loading in memory if self.delete_downloaded: shutil.rmtree(_DOWNLOAD_LOCATION) @staticmethod def _add_months(x): _TIMESTAMP_FORMAT_SPARKNOV = '%Y-%m-%d %H:%M:%S' x = datetime.strptime(x, _TIMESTAMP_FORMAT_SPARKNOV) final_time = x + relativedelta(months=20) # chosen to move dates close to now() return final_time.strftime(_TIMESTAMP_FORMAT) def standardize_timestamp_col(self): self.df[_EVENT_TIMESTAMP] = self.df[self.timestamp_col].apply(lambda x: SparknovPreProcessor._add_months(x)) self.df.drop(self.timestamp_col, axis=1, inplace=True) self.df[_LABEL_TIMESTAMP] = _DEFAULT_LABEL_TIMESTAMP # most recent date def standardize_entity_id_col(self): self.df.rename({self.entity_id_col: _ENTITY_ID}, axis=1, inplace=True) self.df[_ENTITY_ID] = self.df[_ENTITY_ID].\ str.lower().\ apply(lambda x: re.sub(r'[^A-Za-z0-9]+', '_', x)) def train_test_split(self): self.train = self.df.copy()[self.df['seg'] == 'train'] self.train.reset_index(drop=True, inplace=True) self.train.drop(['seg'], axis=1, inplace=True) self.test = self.df.copy()[self.df['seg'] == 'test'] self.test.reset_index(drop=True, inplace=True) self.test.drop(['seg'], axis=1, inplace=True) self.test = self.test.sample(n=20000, random_state=1) self.test_labels = self.test[[_EVENT_LABEL]] if self.event_id_col is None and self._add_random_values_if_real_na[_EVENT_ID]: self.test_labels[_EVENT_ID] = self.test[_EVENT_ID] self.test.drop([_EVENT_LABEL, _LABEL_TIMESTAMP], axis=1, inplace=True, errors="ignore") class TwitterbotPreProcessor(BasePreProcessor): def __init__(self, **kw): super(TwitterbotPreProcessor, self).__init__(**kw) def standardize_label_col(self): self.df.rename({self.label_col: _EVENT_LABEL}, axis=1, inplace=True) binary_mapper = { 'bot': 1, 'human': 0 } self.df[_EVENT_LABEL] = self.df[_EVENT_LABEL].map(binary_mapper) class IPBlocklistPreProcessor(BasePreProcessor): """ The dataset source is http://cinsscore.com/list/ci-badguys.txt. In order to download/access the latest version of this dataset, a sign-in/sign-up to is not required Since this dataset is not version controlled from the source, we added the version of dataset we used for experiments discussed in the paper. The versioned dataset is as of 2022-06-07. The code is set to pick the fixed version. If the user is interested to use the latest version, 'version' argument will need to be turned off (i.e. set to None) """ def __init__(self, version, **kw): self.version = version # string or None. If string, picks one from versioned_datasets, else creates one from source super(IPBlocklistPreProcessor, self).__init__(**kw) def load_data(self): if self.version is None: # load malicious IPs from the source _URL = 'http://cinsscore.com/list/ci-badguys.txt' # contains confirmed malicious IPs _N_BENIGN = 200000 res = requests.get(_URL) ip_mal = pd.read_csv(StringIO(res.text), sep='\n', names=['ip'], header=None) ip_mal['is_ip_malign'] = 1 # add fake IPs as benign ip_ben = pd.DataFrame({ 'ip': [fake.ipv4() for i in range(_N_BENIGN)], 'is_ip_malign': 0 }) self.df = pd.concat([ip_mal, ip_ben], axis=0, ignore_index=True) else: _VERSIONED_DATA_PATH = f'versioned_datasets/{self.key}/{self.version}.zip' data = pkgutil.get_data(__name__, _VERSIONED_DATA_PATH) with zipfile.ZipFile(BytesIO(data)) as f: self.train = pd.read_csv(f.open('train.csv')) self.test = pd.read_csv(f.open('test.csv')) self.test_labels = pd.read_csv(f.open('test_labels.csv')) def add_dummy_col(self): self.df['dummy_cat'] = self.df[_EVENT_LABEL].apply(lambda x: fake.uuid4()) def train_test_split(self): if self.version is None: super(IPBlocklistPreProcessor, self).train_test_split() def preprocess(self): if self.version is None: super(IPBlocklistPreProcessor, self).preprocess() self.add_dummy_col() ================================================ FILE: src/fdb/preprocessing_objects.py ================================================ from fdb.preprocessing import * def load_data(key, load_pre_downloaded, delete_downloaded, add_random_values_if_real_na): common_kw = { "key": key, "load_pre_downloaded": load_pre_downloaded, "delete_downloaded": delete_downloaded, "add_random_values_if_real_na": add_random_values_if_real_na } if key == 'fakejob': obj = FakejobPreProcessor( train_percentage = 0.8, timestamp_col = None, label_col = 'fraudulent', event_id_col = 'job_id', **common_kw ) elif key == 'vehicleloan': obj = VehicleloanPreProcessor( train_percentage = 0.8, timestamp_col = None, label_col = 'loan_default', event_id_col = 'uniqueid', features_to_drop = ['disbursal_date'], **common_kw ) elif key == 'malurl': obj = MalurlPreProcessor( train_percentage = 0.9, timestamp_col = None, label_col = 'type', event_id_col = None, **common_kw ) elif key == 'ieeecis': obj = IEEEPreProcessor( train_percentage = 0.95, timestamp_col = 'transactiondt', label_col = 'isfraud', event_id_col = None, entity_id_col = None, # manually created in code **common_kw ) elif key == 'ccfraud': obj = CCFraudPreProcessor( train_percentage = 0.8, timestamp_col = 'time', label_col = 'class', event_id_col = None, **common_kw ) elif key == 'fraudecom': obj = FraudecomPreProcessor( train_percentage = 0.8, timestamp_col = 'purchase_time', signup_time_col = 'signup_time', label_col = 'class', event_id_col = 'user_id', entity_id_col = 'device_id', ip_address_col = 'ip_address', features_to_drop = ['signup_time', 'sex'], **common_kw ) elif key == 'sparknov': obj = SparknovPreProcessor( timestamp_col = 'trans_date_trans_time', label_col = 'is_fraud', event_id_col = 'trans_num', entity_id_col = 'merchant', features_to_drop = ['unix_time', 'unnamed: 0'], **common_kw ) elif key == 'twitterbot': obj = TwitterbotPreProcessor( train_percentage = 0.8, timestamp_col = None, label_col = 'account_type', event_id_col = 'id', **common_kw ) elif key == 'ipblock': obj = IPBlocklistPreProcessor( label_col = 'is_ip_malign', version = '20220607', **common_kw ) else: raise ValueError('Invalid key') return obj ================================================ FILE: src/fdb/versioned_datasets/__init__.py ================================================ ================================================ FILE: src/fdb/versioned_datasets/ipblock/__init__.py ================================================