Repository: sophos/SOREL-20M
Branch: master
Commit: 3664addd05c4
Files: 14
Total size: 21.8 MB

Directory structure:
gitextract__adqnldz/

├── LICENSE.md
├── README.md
├── build_numpy_arrays_for_lightgbm.py
├── config.py
├── dataset.py
├── environment.yml
├── evaluate.py
├── generators.py
├── lightgbm_config.json
├── nets.py
├── pe_full_metadata_example/
│   └── 32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json
├── plot.py
├── shas_missing_ember_features.json
└── train.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE.md
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2020 Sophos PLC

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
[SoReL-20M](#SoReL-20M)

[Terms of use](#terms-of-use)

[Requirements](#Requirements)

[Downloading the data](#downloading-the-data)

[A note on dataset size](#a-note-on-dataset-size)

[Quickstart](#Quickstart)

[Neural network training](#neural-network-training)

[LightGBM training](#lightgbm-training)

[Frequently Asked Questions](#frequently-asked-questions)

[Copyright and License](#copyright-and-license)


# SoReL-20M
Sophos-ReversingLabs 20 Million dataset

The code included in this repository produced the baseline models available at `s3://sorel-20m/09-DEC-2020/baselines`

This code depends on the SOREL dataset available via Amazon S3 at `s3://sorel-20m/09-DEC-2020/processed-data/`; to train the lightGBM models you can use the npz files available at `s3://sorel-20m/09-DC-2020/lightGBM-features/` or use the scripts included here to extract the required files from the processed data.

If you use this code or this data in your own research, please cite our paper: "SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection
" found at https://arxiv.org/abs/2012.07634 using the following citation:
```
@misc{harang2020sorel20m,
      title={SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection}, 
      author={Richard Harang and Ethan M. Rudd},
      year={2020},
      eprint={2012.07634},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}
```

# Terms of use

Please read the [Terms of Use](https://github.com/sophos-ai/SOREL-20M/blob/master/Terms%20and%20Conditions%20of%20Use.pdf) before using this code or accessing the data.

# Requirements

Python 3.6+.  See `environment.yml` for additional package requirements.

# Downloading the data

Individual files are available directly via https; e.g. you can download one of the baseline checkpoints via web at the url `http://sorel-20m.s3.amazonaws.com/09-DEC-2020/baselines/checkpoints/FFNN/seed0/epoch_1.pt`

For a large number of files, we recommend using the [AWS command line interface](https://aws.amazon.com/cli/).  The SOREL-20M S3 bucket is public, so no credentials are required.  For example, to download all feedforward neural network checkpoints for all seeds, use the command `aws s3 cp s3://sorel-20m/09-DEC-2020/baselines/checkpoints/FFNN/ . --recursive`

It is possible to download the entire dataset this way, however we strongly recomend reading about the [dataset size](#a-note-on-dataset-size) before doing so and ensuring that you will not incur bandwidth fees or exhaust your available disk space in so doing.

# A note on dataset size

The full size of this dataset is approximately 8TB.  It is highly recommended that you only obtain the specific elements you need. Files larger than 1GB are noted below.

```
s3://sorel-20m/09-DEC-2020/
|   Terms and Conditions of Use.pdf -- the terms you agree to by using this data and code
|
+---baselines
|   +---checkpoints
|   |   +---FFNN - per-epoch checkpoints for 5 seeds of the feed-forward neural network
|   |   +---lightGBM - final trained lightGBM model for 5 seeds
|   |
|   +---results
|       |  ffnn_results.json - index file of results, required for plotting
|       |  lgbm_results.json - index file of results, required for plotting
|       |
|       +---FFNN
|       |   +---seed0-seed4 - individual seed results, ~1GB each
|       |
|       +---lightgbm
|           +---seed0-seed4 - individual seed results, ~1GB each
|
+---binaries
|      approximately 8TB of zlib compressed malware binaries
|
+---lightGBM-features
|      test-features.npz - array of test data for lightGBM; 37GB
|      train-features.npz - array of training data for lightGBM; 113GB
|      validation-features.npz - array of validation data for lightGBM; 22GB
|
+---processed-data
    |   meta.db - contains index, labels, tags, and counts for the data; 3.5GB
    |
    +---ember_features - LMDB directory with baseline features, ~72GB
    +---pe_metadata - LMDB directory with full metadata dumps, ~480GB
```

Note: values in the LMDB files are serialized via msgpack and compressed via zlib; the code below handles this extraction automatically, however you will need to decompress and deserialize by hand if you use your own code to handle the data.

Please see the file `./pe_full_metadata_example/32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json` for an example of the metadata contained in the pe_metadata lmdb database.

# Quickstart

The main scripts of interest are:
1. `train.py` for training deep learning or (on a machine with sufficient RAM) LightGBM models
2. `evaluate.py` for taking a pretrained model and producing a results csv
3. `plot.py` for plotting the results

All scripts have multiple commands, documented via --help

Once you have cloned the repository, enter the repository directory and create a conda environment:

```
cd SoReL-20M
conda env create -f environment.yml
conda activate sorel
```

Ensure that you have the SOREL processed data in a local directory.  Edit `config.py` to indicate the device to use (CPU or CUDA) as well as the dataset location and desired checkpoint directory.  The dataset location should point to the folder that contains the `meta.db` file.


*Please note*: the complete contents of processed-data require approximately 552 GB of disk space, the bulk of which is the PE metadata and not used in training the baseline models.  If you only wish to retrain the baseline models, then you will need only the following files (approximately 78GB in total): 

```
/meta.db
/ember_features/data.mdb
/ember_features/lock.mdb
```

The file `shas_missing_ember_features.json` within this repository contains a list of sha256 values that indicate samples for which no Ember v2 feature values could be extracted; it is _highly recommended_ that the location of this file be passed to `--remove_missing_features` parameter in `train.train_network`, `evaluate.evaluate_network`, and `evaluate.evaluate_lgb` to significantly speed up the data loading time. If is it not provided, you should specify `--remove_missing_features='scan'` which will scan all keys to check for and remove ones with missing features prior to building the dataloader; if the dataloader reaches a missing feature it will cause an error.

You can train a neural network model with the following (note that config.py values can be overridden via command line switches:
```
python train.py train_network --remove_missing_features=shas_missing_ember_features.json 
```

Assuming that the checkpoint has been written to /home/ubuntu/checkpoints/ and you wish to place the results.csv file in /home/ubuntu/results/0 you may produce a test set evaluation as follows:

```
python evaluate.py evaluate_network /home/ubuntu/results/0 /home/ubuntu/checkpoints/epoch_9.pt 
```

To enable plotting of multiple series, the `plot.plot_roc_distributions_for_tag` function requires a json file that maps the name for a particular run to the results.csv file for that run.  

```
# Re-plot baselines -- note that the below command assumes 
# that the baseline models at s3://sorel-20m/09-DEC-2020/baselines
# have been downloaded to the /baselines directory
python plot.py plot_roc_distribution_for_tag /baselines/results/ffnn_results.json ./ffnn_results.png
```

# Neural network training

While a GPU allows for faster training (10 epochs can be completed in approximately 90 minutes), this model can be also trained via CPU; the provided results were obtained via GPU on an Amazon g3.4xlarge EC2 instance starting with a "Deep Learning AMI (Ubuntu 16.04) Version 26.0 (ami-025ed45832b817a35)" and updating it as above.  In practice, disk I/O loading features from the feature database seems to be a rate-limiting step assuming a GPU is used, so running on a machine with multiple cores and using a drive with high IOPS is recommended.  Training the network requires approximately 12GB of RAM when trained via CPU, though it varies slightly with the number of cores.  It is also highly recommended to use the `--remove_missing_features=shas_missing_ember_features.json` option as this significantly improves loading time of the data.

Note: if you get an error message `RuntimeError: received 0 items of ancdata` this is typically caused by the limit on the maximum number of open files being too low; this may be increased via the `ulimit` command.  In some cases -- if you use a large number of parallel workers -- it may also be neccessary to increase shared memory.

The commands to train and evaluate a neural network model are

```
python train.py train_network
python evaluate.py evaluate_network
```

Use `--help` for either script to see details and options.  The model itself is given in `nets.PENetwork`

# LightGBM training

Due to the size of the dataset, training a boosted model is difficult.  We use lightGBM, which has relatively memory-efficient data handlers, allowing it to fit a model in-memory using approximately 175GB of RAM.  The lightGBM model provided in this repository was trained on an Amazon m5.24xlarge instance.  

The script `build_numpy_arrays_for_lightgbm.py` will take training/validation/testing datasets and split them into three .npz files in the specified data location that can then be used for training a LightGBM model.  Please note that these files will be extremely large (113GB, 23GB, and 38GB, respectively) using the provided Ember features.

Alternatively, you may use the pre-extracted npz files available at `s3://sorel-20m/09-DEC-2020/lightGBM-features/` which contain Ember features using the default time splits for training, validation, and testing.

The lightGBM model can be trained in much the same manner as the neural network

```
python train.py train_lightGBM --train_npz_file=/dataset/train-features.npz --validation_npz_file=/dataset/validation-features.npz --model_configuration_file=./lightgbm_config.json --checkpoint_dir=/dataset/baselines/checkpoints/lightGBM/run0/
```

Assuming that you've placed the S3 dataset in /dataset as suggested above, this command will perform a single evaluation run.
```
python evaluate.py evaluate_lgb /dataset/baselines/checkpoints/lightGBM/seed0/lightgbm.model /home/ubuntu/lightgbm_eval --remove_missing_features=./shas_missing_ember_features.json 
```

The script used to generate the numpy array files from the database are found in `generate_numpy_arrays_for_lightgbm.dump_data_to_numpy`.  Note that this script requires approximately as much memory as training the model; a m5.24xlarge or equivalent EC2 instance type is recommended.

# Frequently Asked Questions

**Are there any benign samples available?**

Unfortunately, due to the risk of intellectual property violations, we are not able to make the benign samples freely available. The samples are available via ReversingLabs, and anectodally a large number of them also appear to be available via VirusTotal. We are not able to provide any further assistance in this respect.

**I computed the SHA256 for a malware sample and it's different from the SHA256 value suggested by the file name; why?**

All malware samples have been disarmed as described below; the SHA256 value in the file name is for the original, unmodified file.

**How were the files disarmed?**

The OptionalHeader.Subsystem flag and the FileHeader.Machine header value were both set to 0 to prevent accidental execution of the files.  

**Can you provide a tool to re-arm the files, or the original non-disarmed file?**

Unfortunately, we cannot assist anyone in re-arming the file or in obtaining the original, non-disarmed sample.  As with the benign files, they are available via ReversingLabs, and also a large number of them appear to be available via VirusTotal. 

**How are the malware/benign labels determined?**

We use a combination of non-public, internal information as well as a number of static rules and analyses to obtain the ground truth labels.

**Isn't releasing this data dangerous?**

As we describe in our [blog post](https://ai.sophos.com/2020/12/14/sophos-reversinglabs-sorel-20-million-sample-malware-dataset/):

> The malware we’re releasing is “disarmed” so that it will not execute.  This means it would take knowledge, skill, and time to reconstitute the samples and get them to actually run.  That said, we recognize that there is at least some possibility that a skilled attacker could learn techniques from these samples or use samples from the dataset to assemble attack tools to use as part of their malicious activities.  However, in reality, there are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use. In other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers. 

**Is the feature extraction code available for me to apply to my own samples?**

The feature extraction function is available from the [EMBER repository](https://github.com/elastic/ember/) -- specifically we used the `PEFeatureExtractor.feature_vector()` method in [features.py](https://github.com/elastic/ember/blob/master/ember/features.py).

We parallelized this code and constructed the dataset using Sophos AI internal tools, and are unable to provide this code; please see below for some notes on feature extraction and extending the dataset.

**How can I add additional files/features to the dataset?**

We are not accepting additional data for the main dataset. To add new features, files, or both to your own personal copy of it, we have the following recommendations:

1. The `meta.db` sqlite file serves as the index for the LMDB database, and contains metadata and labels.  At a minimum, for each file, the sqlite database should contain columns for: the file sha256, the malware label, and a first-seen timestamp.
2. To serialize a feature vector to a LMDB database, each individual sample's feature vector needs to be encoded into a dictionary with a key of zero and a value that is a 1-d list of floats, then serialized via msgpack and compressed via zlib, then inserted into an LMDB database with a key as the hash of the original file.  If you are extracting new features for the existing files, it's important to note that the filenames of the samples are the sha256 values of the original, non-disarmed files, and so you should just re-use that filename rather than compute the hash of the file yourself. 
3. We obtained best performance for feature extraction using RAM disks wherever possible -- for the files that features are being extracted from at a minimum, and if memory permits, for the LMDB databases as well.

**What are the .npz file and how do they differ from the LMDB data?**

The .npz files in the lightGBM-features directory contain features that are identical to the features in the LMDB database (with training, validation, and test splits given as per the timestamps in `config.py`) but converted to flat numpy arrays for convenience in training the lightGBM models. They contain only binary labels, no tag information.

**The values for the "tag" columns are counts, not binary values; why?**

As described in our [paper](https://arxiv.org/abs/1905.06262) on the tag generation, we parse vendor threat feed information for tokens indicative of the behavioral category of the mwlware; the value in these columns denote the number of tokens we identified for that tag for that sample. It may be taken as correlated with the degree of certainty in the tag, but not calibrated to a standard scale. For most applications we suggest binarizing this value by zero/non-zero.


# Copyright and License

Copyright 2020, Sophos Limited. All rights reserved.

'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
Sophos Limited and Sophos Group. All other product and company
names mentioned are trademarks or registered trademarks of their
respective owners.


Licensed under the Apache License, Version 2.0 (the "License");
you may not use these files except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


================================================
FILE: build_numpy_arrays_for_lightgbm.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.

import baker
from copy import deepcopy
import sys
import numpy as np
from config import validation_test_split, train_validation_split, db_path
from generators import get_generator


@baker.command
def dump_data_to_numpy(mode, output_file, workers=1, batchsize=1000, remove_missing_features='scan'):
    """
    Produce numpy files required for training lightgbm model from SQLite + LMDB database.

    :param mode: One of 'train', 'validation', or 'test' representing which set of the
        data to process to file. Splits are obtained based on timestamps in config.py
    :param output_file: The name of the output file to produce for the indicated split.
    :param workers: How many worker processes to use (default 1)
    :param batchsize: The batch size to use in collecting samples (default 1000)
    :param remove_missing_features: How to check for and remove missing features; see
        README.md for recommendations (default 'scan')
    """
    _generator = get_generator(path=db_path,
                               mode=mode,
                               batch_size=batchsize,
                               use_malicious_labels=True,
                               use_count_labels=False,
                               use_tag_labels=False,
                               num_workers = workers,
                               remove_missing_features=remove_missing_features,
                               shuffle=False)
    feature_array = []
    label_array = []
    for i, (features, labels) in enumerate(_generator):
        feature_array.append(deepcopy(features.numpy()))
        label_array.append(deepcopy(labels['malware'].numpy()))
        sys.stdout.write(f"\r{i} / {len(_generator)}")
        sys.stdout.flush()
    np.savez(output_file, feature_array, label_array)
    print(f"\nWrote output to {output_file}")


if __name__ == '__main__':
    baker.run()


================================================
FILE: config.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.


# set this to the desired device, e.g. 'cuda:0' if a GPU is available
device = 'cuda:0'
#device = 'cpu'

# NOTE -- if you change the below values, your results will not be comparable with those from
# 		  other users of this data set.
# This is the timestamp that divides the validation data (used to check convergence/overfitting)
# from test data (used to assess final performance)
validation_test_split =  1547279640.0
# This is the timestamp that splits training data from validation data
train_validation_split = 1543542570.0

# modify these paths as needed to point to the directory that contains the meta_db
# and to indicate where the checkpoints should be placed during model training
db_path='/dataset/SoReL20M'
checkpoint_dir='/dataset/checkpoints'

# adjust the batch size as needed give memory/bus constraints
batch_size=8192


================================================
FILE: dataset.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.


from torch.utils import data
import lmdb
import sqlite3
import baker
import msgpack
import zlib
import numpy as np
import os
import tqdm
from logzero import logger

import config
import json

class LMDBReader(object):

    def __init__(self, path, postproc_func=None):
        self.env = lmdb.open(path, readonly=True, map_size=1e13, max_readers=1024)
        self.postproc_func = postproc_func

    def __call__(self, key):
        with self.env.begin() as txn:
            x = txn.get(key.encode('ascii'))
        if x is None:return None
        x = msgpack.loads(zlib.decompress(x),strict_map_key=False)
        if self.postproc_func is not None:
            x = self.postproc_func(x)
        return x


def features_postproc_func(x):
    x = np.asarray(x[0], dtype=np.float32)
    lz = x < 0
    gz = x > 0
    x[lz] = - np.log(1 - x[lz])
    x[gz] = np.log(1 + x[gz])
    return x


def tags_postproc_func(x):
    x = list(x[b'labels'].values())
    x = np.asarray(x)
    return x


class Dataset(data.Dataset):
    tags = ["adware", "flooder", "ransomware", "dropper", "spyware", "packed",
            "crypto_miner", "file_infector", "installer", "worm", "downloader"]

    def __init__(self, metadb_path, features_lmdb_path,
                 return_malicious=True, return_counts=True, return_tags=True, return_shas=False,
                 mode='train', binarize_tag_labels=True, n_samples=None, remove_missing_features=True,
                 postprocess_function=features_postproc_func):

        self.return_counts = return_counts
        self.return_tags = return_tags
        self.return_malicious = return_malicious
        self.return_shas = return_shas

        self.features_lmdb_reader = LMDBReader(features_lmdb_path, postproc_func=postprocess_function)


        retrieve = ["sha256"]
        if return_malicious:
            retrieve += ["is_malware"]
        if return_counts:
            retrieve += ["rl_ls_const_positives"]
        if return_tags:
            retrieve.extend(Dataset.tags)

        conn = sqlite3.connect(metadb_path)
        cur = conn.cursor()
        query = 'select ' + ','.join(retrieve)
        query += " from meta"

        if mode == 'train':
            query += ' where(rl_fs_t <= {})'.format(config.train_validation_split)
        elif mode == 'validation':
            query += ' where((rl_fs_t >= {}) and (rl_fs_t < {}))'.format(config.train_validation_split,
                                                                         config.validation_test_split)
        elif mode == 'test':
            query += ' where(rl_fs_t >= {})'.format(config.validation_test_split)
        else:
            raise ValueError('invalid mode: {}'.format(mode))

        logger.info('Opening Dataset at {} in {} mode.'.format(metadb_path, mode))

        if type(n_samples) != type(None):
            query += ' limit {}'.format(n_samples)
        vals = cur.execute(query).fetchall()
        conn.close()
        logger.info(f"{len(vals)} samples loaded.")
        # map the items we're retrieving to an index
        retrieve_ind = dict(zip(retrieve, list(range(len(retrieve)))))

        if remove_missing_features=='scan':
            logger.info("Removing samples with missing features...")
            indexes_to_remove = []
            logger.info("Checking dataset for keys with missing features.")
            temp_env = lmdb.open(features_lmdb_path, readonly=True, map_size=1e13, max_readers=256)
            with temp_env.begin() as txn:
                for index, item in tqdm.tqdm(enumerate(vals), total=len(vals), mininterval=.5, smoothing=0.):
                    if txn.get(item[retrieve_ind['sha256']].encode('ascii')) is None:
                        indexes_to_remove.append(index)
            indexes_to_remove = set(indexes_to_remove)
            vals = [value for index, value in enumerate(vals) if index not in indexes_to_remove]
            logger.info(f"{len(indexes_to_remove)} samples had no associated feature and were removed.")
            logger.info(f"Dataset now has {len(vals)} samples.")
        elif (remove_missing_features is False) or (remove_missing_features is None):
            pass
        else:
            # assume filepath
            logger.info(f"Trying to load shas to ignore from {remove_missing_features}...")
            with open(remove_missing_features, 'r') as f:
                shas_to_remove = json.load(f)
            shas_to_remove = set(shas_to_remove)
            vals = [value for value in vals if value[retrieve_ind['sha256']] not in shas_to_remove]
            logger.info(f"Dataset now has {len(vals)} samples.")
        self.keylist = list(map(lambda x: x[retrieve_ind['sha256']], vals))
        if self.return_malicious:
            self.labels = list(map(lambda x: x[retrieve_ind['is_malware']], vals))
        if self.return_counts:
            self.count_labels = list(map(lambda x: x[retrieve_ind['rl_ls_const_positives']], vals))
        if self.return_tags:
            self.tag_labels = np.asarray([list(map(lambda x: x[retrieve_ind[t]], vals)) for t in Dataset.tags]).T
            if binarize_tag_labels:
                self.tag_labels = (self.tag_labels != 0).astype(int)


    def __len__(self):
        return len(self.keylist)

    def __getitem__(self, index):
        labels = {}
        key = self.keylist[index]
        features = self.features_lmdb_reader(key)
        if self.return_malicious:
            labels['malware'] = self.labels[index]
        if self.return_counts:
            labels['count'] = self.count_labels[index]
        if self.return_tags:
            labels['tags'] = self.tag_labels[index]
        if self.return_shas:
            return key, features, labels
        else:
            return features, labels


if __name__ == '__main__':
    baker.run()


================================================
FILE: environment.yml
================================================
name: sorel
channels:
  - defaults
  - pytorch
dependencies:
  - pip
  - python=3.6
  - pytorch::pytorch
  - pytorch::torchvision
  - cudatoolkit=10.1
  - tqdm
  - scikit-learn
  - lightgbm
  - matplotlib
  - pandas
  - pip:
    - baker==1.3
    - lmdb==0.98
    - logzero==1.5.0
    - msgpack==0.6.2
prefix: /home/ubuntu/anaconda3/envs/sorel


================================================
FILE: evaluate.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.


import torch
import baker
from nets import PENetwork
from generators import get_generator
import tqdm
import os
from config import device
import config
from dataset import Dataset
import pickle
from logzero import logger
from copy import deepcopy
import pandas as pd
import numpy as np

all_tags = Dataset.tags

def detach_and_copy_array(array):
    if isinstance(array, torch.Tensor):
        return deepcopy(array.cpu().detach().numpy()).ravel()
    elif isinstance(array, np.ndarray):
        return deepcopy(array).ravel()
    else:
        raise ValueError("Got array of unknown type {}".format(type(array)))

def normalize_results(labels_dict, results_dict, use_malware=True, use_count=True, use_tags=True):
    """
    Take a set of results dicts and break them out into
    a single dict of 1d arrays with appropriate column names
    that pandas can convert to a DataFrame.
    """
    # we do a lot of deepcopy stuff here to avoid a FD "leak" in the dataset generator
    # see here: https://github.com/pytorch/pytorch/issues/973#issuecomment-459398189
    rv = {}
    if use_malware:
        rv['label_malware'] = detach_and_copy_array(labels_dict['malware'])
        rv['pred_malware'] = detach_and_copy_array(results_dict['malware'])
    if use_count:
        rv['label_count'] = detach_and_copy_array(labels_dict['count'])
        rv['pred_count'] = detach_and_copy_array(results_dict['count'])
    if use_tags:
        for column, tag in enumerate(all_tags):
            rv[f'label_{tag}_tag'] = detach_and_copy_array(labels_dict['tags'][:, column])
            rv[f'pred_{tag}_tag']=detach_and_copy_array(results_dict['tags'][:, column])
    return rv

@baker.command
def evaluate_network(results_dir, checkpoint_file,
                     db_path=config.db_path,
                     evaluate_malware=True,
                     evaluate_count=True,
                     evaluate_tags=True,
                     remove_missing_features='scan'):
    """
    Take a trained feedforward neural network model and output evaluation results to a csv in the specified location.

    :param results_dir: The directory to which to write the 'results.csv' file; WARNING -- this will overwrite any
        existing results in that location
    :param checkpoint_file: The checkpoint file containing the weights to evaluate
    :param db_path: the path to the directory containing the meta.db file; defaults to the value in config.py
    :param evaluate_malware: defaults to True; whether or not to record malware labels and predictions
    :param evaluate_count: defaults to True; whether or not to record count labels and predictions
    :param evaluate_tags: defaults to True; whether or not to record individual tag labels and predictions
    :param remove_missing_features: See help for remove_missing_features in train.py / train_network
    """
    os.system('mkdir -p {}'.format(results_dir))
    model = PENetwork(use_malware=True, use_counts=True, use_tags=True, n_tags=len(Dataset.tags),
                      feature_dimension=2381)
    model.load_state_dict(torch.load(checkpoint_file))
    model.to(device)
    generator = get_generator(mode='test', path=db_path, use_malicious_labels=evaluate_malware,
                              use_count_labels=evaluate_count,
                              use_tag_labels=evaluate_tags, return_shas=True,
                              remove_missing_features=remove_missing_features)
    logger.info('...running network evaluation')
    f = open(os.path.join(results_dir,'results.csv'),'w')
    first_batch = True
    for shas, features, labels in tqdm.tqdm(generator):
        features = features.to(device)
        predictions = model(features)
        results = normalize_results(labels, predictions)
        pd.DataFrame(results, index=shas).to_csv(f, header=first_batch)
        first_batch=False
    f.close()
    print('...done')


import lightgbm as lgb

@baker.command
def evaluate_lgb(lightgbm_model_file,
                 results_dir,
                 db_path=config.db_path,
                 remove_missing_features='scan'
                 ):
    """
    Take a trained lightGBM model and perform an evaluation on it. Results will be saved to 
    results.csv in the path specified in results_dir

    :param lightgbm_model_file: Full path to the trained lightGBM model
    :param results_dir: The directory to which to write the 'results.csv' file; WARNING -- this will overwrite any
        existing results in that location
    :param db_path: the path to the directory containing the meta.db file; defaults to the value in config.py
    :param remove_missing_features: See help for remove_missing_features in train.py / train_network
    """
    os.system('mkdir -p {}'.format(results_dir))

    logger.info(f'Loading lgb model from {lightgbm_model_file}')
    model = lgb.Booster(model_file=lightgbm_model_file)
    generator = get_generator(mode='test', path=db_path, use_malicious_labels=True,
                              use_count_labels=False,
                              use_tag_labels=False, return_shas=True,
                              remove_missing_features=remove_missing_features)
    logger.info('running lgb evaluation')
    f = open(os.path.join(results_dir, 'results.csv'), 'w')
    first_batch = True
    for shas, features, labels in tqdm.tqdm(generator):
        predictions = {'malware':model.predict(features)}
        results = normalize_results(labels, predictions, use_malware=True, use_count=False, use_tags=False)
        pd.DataFrame(results, index=shas).to_csv(f, header=first_batch)
        first_batch = False
    f.close()
    print('...done')

if __name__ == '__main__':
    baker.run()


================================================
FILE: generators.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.


from dataset import Dataset
import os
from torch.utils import data
import config
from multiprocessing import cpu_count

max_workers = cpu_count()


class GeneratorFactory(object):
    def __init__(self, ds_root, batch_size=None, mode='train', num_workers=max_workers, use_malicious_labels=False,
                 use_count_labels=False, use_tag_labels=False, return_shas=False, features_lmdb='ember_features',
                 remove_missing_features='scan', shuffle=None):
        if mode not in {'train', 'validation', 'test'}:
            raise ValueError('invalid mode {}'.format(mode))
        ds = Dataset(metadb_path=os.path.join(ds_root, 'meta.db'),
                     features_lmdb_path=os.path.join(ds_root, features_lmdb),
                     return_malicious=use_malicious_labels,
                     return_counts=use_count_labels,
                     return_tags=use_tag_labels,
                     return_shas=return_shas, mode=mode,
                     remove_missing_features=remove_missing_features)
        if batch_size is None:
            batch_size = 1024
        
        # check passed in value for shuffle; pick a good one if it's None
        if shuffle is not None:
            if not ( (shuffle is True) or (shuffle is False)):
                raise ValueError(f"'shuffle' should be either True or False, got {shuffle}")
        else:
            if mode=='train':shuffle=True
            else:shuffle=False

        params = {'batch_size': batch_size,
                  'shuffle': shuffle,
                  'num_workers': num_workers}

        self.generator = data.DataLoader(ds, **params)

    def __call__(self):
        return self.generator


def get_generator(mode, path=config.db_path, use_malicious_labels=True, use_count_labels=True,
                  use_tag_labels=True,
                  batch_size=config.batch_size, return_shas=False,
                  remove_missing_features='scan', num_workers=None, shuffle=None, 
                  feature_lmdb = 'ember_features'):
    if num_workers is None:
        num_workers = max_workers
    return GeneratorFactory(path, batch_size=batch_size, mode=mode, num_workers=num_workers,
                            use_malicious_labels=use_malicious_labels,
                            use_count_labels=use_count_labels, use_tag_labels=use_tag_labels,
                            return_shas=return_shas, remove_missing_features=remove_missing_features,
                            shuffle=shuffle, features_lmdb=feature_lmdb)()


================================================
FILE: lightgbm_config.json
================================================
{"objective": "binary", "task": "train", "boosting": "gbdt", "num_iterations": 500, "learning_rate": 0.1, "max_depth": -1, "num_leaves": 64, "tree_learner": "serial", "num_threads": 0, "device_type": "cpu", "seed": 0, "min_data_in_leaf": 100, "min_sum_hessian_in_leaf": 0.001, "bagging_fraction": 0.9, "bagging_freq": 1, "bagging_seed": 0, "feature_fraction": 0.9, "feature_fraction_bynode": 0.9, "feature_fraction_seed": 0, "early_stopping_rounds": 10, "first_metric_only": true, "max_delta_step": 0, "lambda_l1": 0, "lambda_l2": 1.0, "verbosity": 2, "is_unbalance": true, "sigmoid": 1.0, "boost_from_average": true, "metric": ["binary_logloss", "auc", "binary_error"]}

================================================
FILE: nets.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.


import torch
from torch import nn
import torch.nn.functional as F

class PENetwork(nn.Module):
    """
    This is a simple network loosely based on the one used in ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation (https://arxiv.org/abs/1903.05700)

    Note that it uses fewer (and smaller) layers, as well as a single layer for all tag predictions, performance will suffer accordingly.
    """
    def __init__(self,use_malware=True,use_counts=True,use_tags=True,n_tags=None,feature_dimension=1024, layer_sizes = None):
        self.use_malware=use_malware
        self.use_counts=use_counts
        self.use_tags=use_tags
        self.n_tags = n_tags
        if self.use_tags and self.n_tags == None:
            raise ValueError("n_tags was None but we're trying to predict tags. Please include n_tags")
        super(PENetwork,self).__init__()
        p = 0.05
        layers = []
        if layer_sizes is None:layer_sizes=[512,512,128]
        for i,ls in enumerate(layer_sizes):
            if i == 0:
                layers.append(nn.Linear(feature_dimension,ls))
            else:
                layers.append(nn.Linear(layer_sizes[i-1],ls))
            layers.append(nn.LayerNorm(ls))
            layers.append(nn.ELU())
            layers.append(nn.Dropout(p))
        self.model_base = nn.Sequential(*tuple(layers))
        self.malware_head = nn.Sequential(nn.Linear(layer_sizes[-1], 1),
                                          nn.Sigmoid())
        self.count_head = nn.Linear(layer_sizes[-1], 1)
        self.sigmoid = nn.Sigmoid()
        self.tag_head = nn.Sequential(nn.Linear(layer_sizes[-1],64),
                                        nn.ELU(), 
                                        nn.Linear(64,64),
                                        nn.ELU(),
                                        nn.Linear(64,n_tags),
                                        nn.Sigmoid())

    def forward(self,data):
        rv = {}
        base_result = self.model_base.forward(data)
        if self.use_malware:
            rv['malware'] = self.malware_head(base_result)
        if self.use_counts:
            rv['count'] = self.count_head(base_result)
        if self.use_tags:
            rv['tags'] = self.tag_head(base_result)
        return rv


================================================
FILE: pe_full_metadata_example/32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json
================================================
{
    "0": {
        "DOS_HEADER": {
            "Structure": "IMAGE_DOS_HEADER",
            "e_magic": {
                "FileOffset": 0,
                "Offset": 0,
                "Value": 23117
            },
            "e_cblp": {
                "FileOffset": 2,
                "Offset": 2,
                "Value": 144
            },
            "e_cp": {
                "FileOffset": 4,
                "Offset": 4,
                "Value": 3
            },
            "e_crlc": {
                "FileOffset": 6,
                "Offset": 6,
                "Value": 0
            },
            "e_cparhdr": {
                "FileOffset": 8,
                "Offset": 8,
                "Value": 4
            },
            "e_minalloc": {
                "FileOffset": 10,
                "Offset": 10,
                "Value": 0
            },
            "e_maxalloc": {
                "FileOffset": 12,
                "Offset": 12,
                "Value": 65535
            },
            "e_ss": {
                "FileOffset": 14,
                "Offset": 14,
                "Value": 0
            },
            "e_sp": {
                "FileOffset": 16,
                "Offset": 16,
                "Value": 184
            },
            "e_csum": {
                "FileOffset": 18,
                "Offset": 18,
                "Value": 0
            },
            "e_ip": {
                "FileOffset": 20,
                "Offset": 20,
                "Value": 0
            },
            "e_cs": {
                "FileOffset": 22,
                "Offset": 22,
                "Value": 0
            },
            "e_lfarlc": {
                "FileOffset": 24,
                "Offset": 24,
                "Value": 64
            },
            "e_ovno": {
                "FileOffset": 26,
                "Offset": 26,
                "Value": 0
            },
            "e_res": {
                "FileOffset": 28,
                "Offset": 28,
                "Value": "\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00"
            },
            "e_oemid": {
                "FileOffset": 36,
                "Offset": 36,
                "Value": 0
            },
            "e_oeminfo": {
                "FileOffset": 38,
                "Offset": 38,
                "Value": 0
            },
            "e_res2": {
                "FileOffset": 40,
                "Offset": 40,
                "Value": "\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00"
            },
            "e_lfanew": {
                "FileOffset": 60,
                "Offset": 60,
                "Value": 128
            }
        },
        "NT_HEADERS": {
            "Structure": "IMAGE_NT_HEADERS",
            "Signature": {
                "FileOffset": 128,
                "Offset": 0,
                "Value": 17744
            }
        },
        "FILE_HEADER": {
            "Structure": "IMAGE_FILE_HEADER",
            "Machine": {
                "FileOffset": 132,
                "Offset": 0,
                "Value": 332
            },
            "NumberOfSections": {
                "FileOffset": 134,
                "Offset": 2,
                "Value": 3
            },
            "TimeDateStamp": {
                "FileOffset": 136,
                "Offset": 4,
                "Value": "0x586FEBD9 [Fri Jan  6 19:11:21 2017 UTC]"
            },
            "PointerToSymbolTable": {
                "FileOffset": 140,
                "Offset": 8,
                "Value": 0
            },
            "NumberOfSymbols": {
                "FileOffset": 144,
                "Offset": 12,
                "Value": 0
            },
            "SizeOfOptionalHeader": {
                "FileOffset": 148,
                "Offset": 16,
                "Value": 224
            },
            "Characteristics": {
                "FileOffset": 150,
                "Offset": 18,
                "Value": 8450
            }
        },
        "Flags": [
            "IMAGE_FILE_EXECUTABLE_IMAGE",
            "IMAGE_FILE_32BIT_MACHINE",
            "IMAGE_FILE_DLL"
        ],
        "OPTIONAL_HEADER": {
            "Structure": "IMAGE_OPTIONAL_HEADER",
            "Magic": {
                "FileOffset": 152,
                "Offset": 0,
                "Value": 267
            },
            "MajorLinkerVersion": {
                "FileOffset": 154,
                "Offset": 2,
                "Value": 8
            },
            "MinorLinkerVersion": {
                "FileOffset": 155,
                "Offset": 3,
                "Value": 0
            },
            "SizeOfCode": {
                "FileOffset": 156,
                "Offset": 4,
                "Value": 7168
            },
            "SizeOfInitializedData": {
                "FileOffset": 160,
                "Offset": 8,
                "Value": 1536
            },
            "SizeOfUninitializedData": {
                "FileOffset": 164,
                "Offset": 12,
                "Value": 0
            },
            "AddressOfEntryPoint": {
                "FileOffset": 168,
                "Offset": 16,
                "Value": 15278
            },
            "BaseOfCode": {
                "FileOffset": 172,
                "Offset": 20,
                "Value": 8192
            },
            "BaseOfData": {
                "FileOffset": 176,
                "Offset": 24,
                "Value": 16384
            },
            "ImageBase": {
                "FileOffset": 180,
                "Offset": 28,
                "Value": 4194304
            },
            "SectionAlignment": {
                "FileOffset": 184,
                "Offset": 32,
                "Value": 8192
            },
            "FileAlignment": {
                "FileOffset": 188,
                "Offset": 36,
                "Value": 512
            },
            "MajorOperatingSystemVersion": {
                "FileOffset": 192,
                "Offset": 40,
                "Value": 4
            },
            "MinorOperatingSystemVersion": {
                "FileOffset": 194,
                "Offset": 42,
                "Value": 0
            },
            "MajorImageVersion": {
                "FileOffset": 196,
                "Offset": 44,
                "Value": 0
            },
            "MinorImageVersion": {
                "FileOffset": 198,
                "Offset": 46,
                "Value": 0
            },
            "MajorSubsystemVersion": {
                "FileOffset": 200,
                "Offset": 48,
                "Value": 4
            },
            "MinorSubsystemVersion": {
                "FileOffset": 202,
                "Offset": 50,
                "Value": 0
            },
            "Reserved1": {
                "FileOffset": 204,
                "Offset": 52,
                "Value": 0
            },
            "SizeOfImage": {
                "FileOffset": 208,
                "Offset": 56,
                "Value": 32768
            },
            "SizeOfHeaders": {
                "FileOffset": 212,
                "Offset": 60,
                "Value": 512
            },
            "CheckSum": {
                "FileOffset": 216,
                "Offset": 64,
                "Value": 0
            },
            "Subsystem": {
                "FileOffset": 220,
                "Offset": 68,
                "Value": 3
            },
            "DllCharacteristics": {
                "FileOffset": 222,
                "Offset": 70,
                "Value": 34112
            },
            "SizeOfStackReserve": {
                "FileOffset": 224,
                "Offset": 72,
                "Value": 1048576
            },
            "SizeOfStackCommit": {
                "FileOffset": 228,
                "Offset": 76,
                "Value": 4096
            },
            "SizeOfHeapReserve": {
                "FileOffset": 232,
                "Offset": 80,
                "Value": 1048576
            },
            "SizeOfHeapCommit": {
                "FileOffset": 236,
                "Offset": 84,
                "Value": 4096
            },
            "LoaderFlags": {
                "FileOffset": 240,
                "Offset": 88,
                "Value": 0
            },
            "NumberOfRvaAndSizes": {
                "FileOffset": 244,
                "Offset": 92,
                "Value": 16
            }
        },
        "DllCharacteristics": [
            "IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE",
            "IMAGE_DLLCHARACTERISTICS_NX_COMPAT",
            "IMAGE_DLLCHARACTERISTICS_NO_SEH",
            "IMAGE_DLLCHARACTERISTICS_TERMINAL_SERVER_AWARE"
        ],
        "PE Sections": [
            {
                "Structure": "IMAGE_SECTION_HEADER",
                "Name": {
                    "FileOffset": 376,
                    "Offset": 0,
                    "Value": ".text\\x00\\x00\\x00"
                },
                "Misc": {
                    "FileOffset": 384,
                    "Offset": 8,
                    "Value": 7092
                },
                "Misc_PhysicalAddress": {
                    "FileOffset": 384,
                    "Offset": 8,
                    "Value": 7092
                },
                "Misc_VirtualSize": {
                    "FileOffset": 384,
                    "Offset": 8,
                    "Value": 7092
                },
                "VirtualAddress": {
                    "FileOffset": 388,
                    "Offset": 12,
                    "Value": 8192
                },
                "SizeOfRawData": {
                    "FileOffset": 392,
                    "Offset": 16,
                    "Value": 7168
                },
                "PointerToRawData": {
                    "FileOffset": 396,
                    "Offset": 20,
                    "Value": 512
                },
                "PointerToRelocations": {
                    "FileOffset": 400,
                    "Offset": 24,
                    "Value": 0
                },
                "PointerToLinenumbers": {
                    "FileOffset": 404,
                    "Offset": 28,
                    "Value": 0
                },
                "NumberOfRelocations": {
                    "FileOffset": 408,
                    "Offset": 32,
                    "Value": 0
                },
                "NumberOfLinenumbers": {
                    "FileOffset": 410,
                    "Offset": 34,
                    "Value": 0
                },
                "Characteristics": {
                    "FileOffset": 412,
                    "Offset": 36,
                    "Value": 1610612768
                },
                "Flags": [
                    "IMAGE_SCN_CNT_CODE",
                    "IMAGE_SCN_MEM_EXECUTE",
                    "IMAGE_SCN_MEM_READ"
                ],
                "Entropy": 5.312053634802128,
                "MD5": "3622aa030f4a1f45fae880db94dd6e58",
                "SHA1": "85d9dfca1ed27be856f1fccc6898940903582895",
                "SHA256": "3c5510d2f545515f943715fe6d70a3df6b8d6d1b3e71aa50ab73a84de61be224",
                "SHA512": "09e657ec9bd5bfa60b6449e63cdc61a8fe28c989789e14a8109db4e3cb242fdb258dac841fee24b1488b405e4dbd6e5caf78549764f2becb598394797a2cc141"
            },
            {
                "Structure": "IMAGE_SECTION_HEADER",
                "Name": {
                    "FileOffset": 416,
                    "Offset": 0,
                    "Value": ".rsrc\\x00\\x00\\x00"
                },
                "Misc": {
                    "FileOffset": 424,
                    "Offset": 8,
                    "Value": 688
                },
                "Misc_PhysicalAddress": {
                    "FileOffset": 424,
                    "Offset": 8,
                    "Value": 688
                },
                "Misc_VirtualSize": {
                    "FileOffset": 424,
                    "Offset": 8,
                    "Value": 688
                },
                "VirtualAddress": {
                    "FileOffset": 428,
                    "Offset": 12,
                    "Value": 16384
                },
                "SizeOfRawData": {
                    "FileOffset": 432,
                    "Offset": 16,
                    "Value": 1024
                },
                "PointerToRawData": {
                    "FileOffset": 436,
                    "Offset": 20,
                    "Value": 7680
                },
                "PointerToRelocations": {
                    "FileOffset": 440,
                    "Offset": 24,
                    "Value": 0
                },
                "PointerToLinenumbers": {
                    "FileOffset": 444,
                    "Offset": 28,
                    "Value": 0
                },
                "NumberOfRelocations": {
                    "FileOffset": 448,
                    "Offset": 32,
                    "Value": 0
                },
                "NumberOfLinenumbers": {
                    "FileOffset": 450,
                    "Offset": 34,
                    "Value": 0
                },
                "Characteristics": {
                    "FileOffset": 452,
                    "Offset": 36,
                    "Value": 1073741888
                },
                "Flags": [
                    "IMAGE_SCN_CNT_INITIALIZED_DATA",
                    "IMAGE_SCN_MEM_READ"
                ],
                "Entropy": 2.2461341734636235,
                "MD5": "54371107ba38386c1a7c0d2f6e8cb71e",
                "SHA1": "074a53a6946f543d03ec1f4bd60ed635db99db8a",
                "SHA256": "54be4b9cb66d217fadb7e478d1c18994b044bf3b9fb07b967527f6bba3302bd5",
                "SHA512": "9593dd761eeab980b8ff74288180c623e45e67aa98d0cbc2110b82aafeb98a5c0245f06c3c7c8e2256f97471daf1118a38af8ccddea63154ba98e6ee73a78735"
            },
            {
                "Structure": "IMAGE_SECTION_HEADER",
                "Name": {
                    "FileOffset": 456,
                    "Offset": 0,
                    "Value": ".reloc\\x00\\x00"
                },
                "Misc": {
                    "FileOffset": 464,
                    "Offset": 8,
                    "Value": 12
                },
                "Misc_PhysicalAddress": {
                    "FileOffset": 464,
                    "Offset": 8,
                    "Value": 12
                },
                "Misc_VirtualSize": {
                    "FileOffset": 464,
                    "Offset": 8,
                    "Value": 12
                },
                "VirtualAddress": {
                    "FileOffset": 468,
                    "Offset": 12,
                    "Value": 24576
                },
                "SizeOfRawData": {
                    "FileOffset": 472,
                    "Offset": 16,
                    "Value": 512
                },
                "PointerToRawData": {
                    "FileOffset": 476,
                    "Offset": 20,
                    "Value": 8704
                },
                "PointerToRelocations": {
                    "FileOffset": 480,
                    "Offset": 24,
                    "Value": 0
                },
                "PointerToLinenumbers": {
                    "FileOffset": 484,
                    "Offset": 28,
                    "Value": 0
                },
                "NumberOfRelocations": {
                    "FileOffset": 488,
                    "Offset": 32,
                    "Value": 0
                },
                "NumberOfLinenumbers": {
                    "FileOffset": 490,
                    "Offset": 34,
                    "Value": 0
                },
                "Characteristics": {
                    "FileOffset": 492,
                    "Offset": 36,
                    "Value": 1107296320
                },
                "Flags": [
                    "IMAGE_SCN_CNT_INITIALIZED_DATA",
                    "IMAGE_SCN_MEM_DISCARDABLE",
                    "IMAGE_SCN_MEM_READ"
                ],
                "Entropy": 0.08153941234324169,
                "MD5": "d9e08422d3077fe0be94f8ec16840100",
                "SHA1": "c4f4de8e850f478b5f7aedc719098b7283936488",
                "SHA256": "54543210ea95b9f1c287825f3361e06499aa1f6f32967ddd70efe5086694efd5",
                "SHA512": "757186cbe73a86467b3cd9dc69b3363806fbe88332ff07948399404ff2c8f3d36e603a230992c09cdb0776bca5bcc58296540d913bed535ab1ddcab1a71ee276"
            }
        ],
        "Directories": [
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_EXPORT",
                "VirtualAddress": {
                    "FileOffset": 248,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 252,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_IMPORT",
                "VirtualAddress": {
                    "FileOffset": 256,
                    "Offset": 0,
                    "Value": 15192
                },
                "Size": {
                    "FileOffset": 260,
                    "Offset": 4,
                    "Value": 83
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_RESOURCE",
                "VirtualAddress": {
                    "FileOffset": 264,
                    "Offset": 0,
                    "Value": 16384
                },
                "Size": {
                    "FileOffset": 268,
                    "Offset": 4,
                    "Value": 688
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_EXCEPTION",
                "VirtualAddress": {
                    "FileOffset": 272,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 276,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_SECURITY",
                "VirtualAddress": {
                    "FileOffset": 280,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 284,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_BASERELOC",
                "VirtualAddress": {
                    "FileOffset": 288,
                    "Offset": 0,
                    "Value": 24576
                },
                "Size": {
                    "FileOffset": 292,
                    "Offset": 4,
                    "Value": 12
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_DEBUG",
                "VirtualAddress": {
                    "FileOffset": 296,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 300,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_COPYRIGHT",
                "VirtualAddress": {
                    "FileOffset": 304,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 308,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_GLOBALPTR",
                "VirtualAddress": {
                    "FileOffset": 312,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 316,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_TLS",
                "VirtualAddress": {
                    "FileOffset": 320,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 324,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG",
                "VirtualAddress": {
                    "FileOffset": 328,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 332,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT",
                "VirtualAddress": {
                    "FileOffset": 336,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 340,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_IAT",
                "VirtualAddress": {
                    "FileOffset": 344,
                    "Offset": 0,
                    "Value": 8192
                },
                "Size": {
                    "FileOffset": 348,
                    "Offset": 4,
                    "Value": 8
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT",
                "VirtualAddress": {
                    "FileOffset": 352,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 356,
                    "Offset": 4,
                    "Value": 0
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_COM_DESCRIPTOR",
                "VirtualAddress": {
                    "FileOffset": 360,
                    "Offset": 0,
                    "Value": 8200
                },
                "Size": {
                    "FileOffset": 364,
                    "Offset": 4,
                    "Value": 72
                }
            },
            {
                "Structure": "IMAGE_DIRECTORY_ENTRY_RESERVED",
                "VirtualAddress": {
                    "FileOffset": 368,
                    "Offset": 0,
                    "Value": 0
                },
                "Size": {
                    "FileOffset": 372,
                    "Offset": 4,
                    "Value": 0
                }
            }
        ],
        "Version Information": [
            [
                {
                    "Structure": "VS_VERSIONINFO",
                    "Length": {
                        "FileOffset": 7768,
                        "Offset": 0,
                        "Value": 600
                    },
                    "ValueLength": {
                        "FileOffset": 7770,
                        "Offset": 2,
                        "Value": 52
                    },
                    "Type": {
                        "FileOffset": 7772,
                        "Offset": 4,
                        "Value": 0
                    }
                },
                {
                    "Structure": "VS_FIXEDFILEINFO",
                    "Signature": {
                        "FileOffset": 7808,
                        "Offset": 0,
                        "Value": 4277077181
                    },
                    "StrucVersion": {
                        "FileOffset": 7812,
                        "Offset": 4,
                        "Value": 65536
                    },
                    "FileVersionMS": {
                        "FileOffset": 7816,
                        "Offset": 8,
                        "Value": 720896
                    },
                    "FileVersionLS": {
                        "FileOffset": 7820,
                        "Offset": 12,
                        "Value": 0
                    },
                    "ProductVersionMS": {
                        "FileOffset": 7824,
                        "Offset": 16,
                        "Value": 720896
                    },
                    "ProductVersionLS": {
                        "FileOffset": 7828,
                        "Offset": 20,
                        "Value": 0
                    },
                    "FileFlagsMask": {
                        "FileOffset": 7832,
                        "Offset": 24,
                        "Value": 63
                    },
                    "FileFlags": {
                        "FileOffset": 7836,
                        "Offset": 28,
                        "Value": 0
                    },
                    "FileOS": {
                        "FileOffset": 7840,
                        "Offset": 32,
                        "Value": 4
                    },
                    "FileType": {
                        "FileOffset": 7844,
                        "Offset": 36,
                        "Value": 2
                    },
                    "FileSubtype": {
                        "FileOffset": 7848,
                        "Offset": 40,
                        "Value": 0
                    },
                    "FileDateMS": {
                        "FileOffset": 7852,
                        "Offset": 44,
                        "Value": 0
                    },
                    "FileDateLS": {
                        "FileOffset": 7856,
                        "Offset": 48,
                        "Value": 0
                    }
                }
            ]
        ],
        "Imported symbols": [
            [
                {
                    "Structure": "IMAGE_IMPORT_DESCRIPTOR",
                    "OriginalFirstThunk": {
                        "FileOffset": 7512,
                        "Offset": 0,
                        "Value": 15232
                    },
                    "Characteristics": {
                        "FileOffset": 7512,
                        "Offset": 0,
                        "Value": 15232
                    },
                    "TimeDateStamp": {
                        "FileOffset": 7516,
                        "Offset": 4,
                        "Value": "0x0        [Thu Jan  1 00:00:00 1970 UTC]"
                    },
                    "ForwarderChain": {
                        "FileOffset": 7520,
                        "Offset": 8,
                        "Value": 0
                    },
                    "Name": {
                        "FileOffset": 7524,
                        "Offset": 12,
                        "Value": 15262
                    },
                    "FirstThunk": {
                        "FileOffset": 7528,
                        "Offset": 16,
                        "Value": 8192
                    }
                },
                {
                    "DLL": "mscoree.dll",
                    "Name": "_CorDllMain",
                    "Hint": 0
                }
            ]
        ],
        "Resource directory": [
            {
                "Structure": "IMAGE_RESOURCE_DIRECTORY",
                "Characteristics": {
                    "FileOffset": 7680,
                    "Offset": 0,
                    "Value": 0
                },
                "TimeDateStamp": {
                    "FileOffset": 7684,
                    "Offset": 4,
                    "Value": "0x0        [Thu Jan  1 00:00:00 1970 UTC]"
                },
                "MajorVersion": {
                    "FileOffset": 7688,
                    "Offset": 8,
                    "Value": 0
                },
                "MinorVersion": {
                    "FileOffset": 7690,
                    "Offset": 10,
                    "Value": 0
                },
                "NumberOfNamedEntries": {
                    "FileOffset": 7692,
                    "Offset": 12,
                    "Value": 0
                },
                "NumberOfIdEntries": {
                    "FileOffset": 7694,
                    "Offset": 14,
                    "Value": 1
                }
            },
            {
                "Id": [
                    16,
                    "RT_VERSION"
                ],
                "Structure": "IMAGE_RESOURCE_DIRECTORY_ENTRY",
                "Name": {
                    "FileOffset": 7696,
                    "Offset": 0,
                    "Value": 16
                },
                "OffsetToData": {
                    "FileOffset": 7700,
                    "Offset": 4,
                    "Value": 2147483672
                }
            },
            [
                {
                    "Structure": "IMAGE_RESOURCE_DIRECTORY",
                    "Characteristics": {
                        "FileOffset": 7704,
                        "Offset": 0,
                        "Value": 0
                    },
                    "TimeDateStamp": {
                        "FileOffset": 7708,
                        "Offset": 4,
                        "Value": "0x0        [Thu Jan  1 00:00:00 1970 UTC]"
                    },
                    "MajorVersion": {
                        "FileOffset": 7712,
                        "Offset": 8,
                        "Value": 0
                    },
                    "MinorVersion": {
                        "FileOffset": 7714,
                        "Offset": 10,
                        "Value": 0
                    },
                    "NumberOfNamedEntries": {
                        "FileOffset": 7716,
                        "Offset": 12,
                        "Value": 0
                    },
                    "NumberOfIdEntries": {
                        "FileOffset": 7718,
                        "Offset": 14,
                        "Value": 1
                    }
                },
                {
                    "Id": 1,
                    "Structure": "IMAGE_RESOURCE_DIRECTORY_ENTRY",
                    "Name": {
                        "FileOffset": 7720,
                        "Offset": 0,
                        "Value": 1
                    },
                    "OffsetToData": {
                        "FileOffset": 7724,
                        "Offset": 4,
                        "Value": 2147483696
                    }
                },
                [
                    {
                        "Structure": "IMAGE_RESOURCE_DIRECTORY",
                        "Characteristics": {
                            "FileOffset": 7728,
                            "Offset": 0,
                            "Value": 0
                        },
                        "TimeDateStamp": {
                            "FileOffset": 7732,
                            "Offset": 4,
                            "Value": "0x0        [Thu Jan  1 00:00:00 1970 UTC]"
                        },
                        "MajorVersion": {
                            "FileOffset": 7736,
                            "Offset": 8,
                            "Value": 0
                        },
                        "MinorVersion": {
                            "FileOffset": 7738,
                            "Offset": 10,
                            "Value": 0
                        },
                        "NumberOfNamedEntries": {
                            "FileOffset": 7740,
                            "Offset": 12,
                            "Value": 0
                        },
                        "NumberOfIdEntries": {
                            "FileOffset": 7742,
                            "Offset": 14,
                            "Value": 1
                        }
                    },
                    {
                        "LANG": 0,
                        "SUBLANG": 0,
                        "LANG_NAME": "LANG_NEUTRAL",
                        "SUBLANG_NAME": "SUBLANG_NEUTRAL",
                        "Structure": "IMAGE_RESOURCE_DATA_ENTRY",
                        "Name": {
                            "FileOffset": 7744,
                            "Offset": 0,
                            "Value": 0
                        },
                        "OffsetToData": {
                            "FileOffset": 7752,
                            "Offset": 0,
                            "Value": 16472
                        },
                        "Size": {
                            "FileOffset": 7756,
                            "Offset": 4,
                            "Value": 600
                        },
                        "CodePage": {
                            "FileOffset": 7760,
                            "Offset": 8,
                            "Value": 0
                        },
                        "Reserved": {
                            "FileOffset": 7764,
                            "Offset": 12,
                            "Value": 0
                        }
                    }
                ]
            ]
        ],
        "Base relocations": [
            [
                {
                    "Structure": "IMAGE_BASE_RELOCATION",
                    "VirtualAddress": {
                        "FileOffset": 8704,
                        "Offset": 0,
                        "Value": 12288
                    },
                    "SizeOfBlock": {
                        "FileOffset": 8708,
                        "Offset": 4,
                        "Value": 12
                    }
                },
                {
                    "RVA": 15280,
                    "Type": "HIGHLOW"
                },
                {
                    "RVA": 12288,
                    "Type": "ABSOLUTE"
                }
            ]
        ]
    }
}


================================================
FILE: plot.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.


import baker
import matplotlib

matplotlib.use('Agg')
from matplotlib import pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
import pandas as pd
import numpy as np
import json

default_tags = ['adware_tag', 'flooder_tag', 'ransomware_tag',
                'dropper_tag', 'spyware_tag',
                'packed_tag', 'crypto_miner_tag',
                'file_infector_tag', 'installer_tag',
                'worm_tag', 'downloader_tag']
default_tag_colors = ['r', 'r', 'r', 'g', 'g', 'b', 'b', 'm', 'm', 'c', 'c']
default_tag_linestyles = [':', '--', '-.', ':', '--', ':', '--', ':', '--', ':', '--']

style_dict = {tag: (color, linestyle) for tag, color, linestyle in zip(default_tags,
                                                                       default_tag_colors,
                                                                       default_tag_linestyles)}

style_dict['malware'] = ('k', '-')


def collect_dataframes(run_id_to_filename_dictionary):
    loaded_dataframes = {}
    for k, v in run_id_to_filename_dictionary.items():
        loaded_dataframes[k] = pd.read_csv(v)
    return loaded_dataframes


def get_tprs_at_fpr(result_dataframe, key, target_fprs=None):
    """
    Estimate the True Positive Rate for a dataframe/key combination
    at specific False Positive Rates of interest.
    :param result_dataframe: a pandas dataframe
    :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided
    the dataframe is expected to have a column names `pred_malware` and `label_malware`
    :param target_fprs: The FPRs at which you wish to estimate the TPRs; None (uses default np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1]) or a 1-d numpy array
    :return: target_fprs, the corresponsing TPRs
    """
    if target_fprs is None:
        target_fprs = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1])
    fpr, tpr, thresholds = get_roc_curve(result_dataframe, key)
    return target_fprs, np.interp(target_fprs, fpr, tpr)


def get_roc_curve(result_dataframe, key):
    """
    Get the ROC curve for a single result in a dataframe
    :param result_dataframe: a dataframe
    :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided
    the dataframe is expected to have a column names `pred_malware` and `label_malware`
    :return: false positive rates, true positive rates, and thresholds (all np.arrays)
    """
    labels = result_dataframe['label_{}'.format(key)]
    predictions = result_dataframe['pred_{}'.format(key)]
    return roc_curve(labels, predictions)


def get_auc_score(result_dataframe, key):
    """
    Get the Area Under the Curve for the indicated key in the dataframe
    :param result_dataframe: a dataframe
    :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided
    the dataframe is expected to have a column names `pred_malware` and `label_malware`
    :return: the AUC for the ROC generated for the provided key
    """
    labels = result_dataframe['label_{}'.format(key)]
    predictions = result_dataframe['pred_{}'.format(key)]
    return roc_auc_score(labels, predictions)


def interpolate_rocs(id_to_roc_dictionary, eval_fpr_points=None):
    """
    This function takes several sets of ROC results and interpolates them to a common set of
    evaluation (FPR) values to allow for computing e.g. a mean ROC or pointwise variance of the curve
    across multiple model fittings.
    :param list_of_rocs: a list of results from get_roc_score (or sklearn.metrics.roc_curve) of the
    form [(fpr_1, tpr_1, threshold_1), (fpr_2, tpr_2, threshold_2)...]
    :param eval_fpr_points: the set of FPR values at which to interpolate the results; defaults to
    `np.logspace(-6, 0, 1000)`
    :return:
        eval_fpr_points  -- the set of common points to which TPRs have been interpolated
        interpolated_tprs -- an array with one row for each ROC provided, giving the interpolated TPR for that ROC at
    the corresponding column in eval_fpr_points
    """
    if eval_fpr_points is None:
        eval_fpr_points = np.logspace(-6, 0, 1000)

    interpolated_tprs = {}

    for k, (fpr, tpr, thresh) in id_to_roc_dictionary.items():
        interpolated_tprs[k] = np.interp(eval_fpr_points, fpr, tpr)

    return eval_fpr_points, interpolated_tprs


def plot_roc_with_confidence(id_to_dataframe_dictionary, key, filename, include_range=False, style=None, std_alpha=.2,
                             range_alpha=.1):
    """
    Compute the mean and standard deviation of the ROC curve from a sequence of results
    and plot it with shading.
    """
    if not len(id_to_dataframe_dictionary) > 1:
        raise ValueError("Need a minimum of 2 result sets to plot confidence region; found {}".format(
            len(id_to_dataframe_dictionary)
        ))
    if style is None:
        if key in style_dict:
            color, linestyle = style_dict[key]
        else:
            raise ValueError(
                "No default style information is available for key {}; please provide (linestyle, color)".format(key))
    else:
        linestyle, color = style
    id_to_roc_dictionary = {k: get_roc_curve(df, key) for k, df in id_to_dataframe_dictionary.items()}
    fpr_points, interpolated_tprs = interpolate_rocs(id_to_roc_dictionary)
    tpr_array = np.vstack([v for v in interpolated_tprs.values()])
    mean_tpr = tpr_array.mean(0)
    std_tpr = np.sqrt(tpr_array.var(0))

    aucs = np.array([get_auc_score(v, key) for v in id_to_dataframe_dictionary.values()])
    mean_auc = aucs.mean()
    min_auc = aucs.min()
    max_auc = aucs.max()
    std_auc = np.sqrt(aucs.var())

    plt.figure(figsize=(12, 12))
    plt.semilogx(fpr_points, mean_tpr, color + linestyle, linewidth=2.0,
                 label=f"{key}: {mean_auc:5.3f}$\pm${std_auc:5.3f} [{min_auc:5.3f}-{max_auc:5.3f}]")
    plt.fill_between(fpr_points, mean_tpr - std_tpr, mean_tpr + std_tpr, color=color, alpha=std_alpha)
    if include_range:
        plt.fill_between(fpr_points, tpr_array.min(0), tpr_array.max(0), color=color, alpha=range_alpha)
    plt.legend()
    plt.xlim(1e-6, 1.0)
    plt.ylim([0., 1.])
    plt.xlabel('False Positive Rate (FPR)')
    plt.ylabel('True Positive Rate (TPR)')
    plt.savefig(filename)
    plt.clf()


def plot_tag_results(dataframe, filename):
    all_tag_rocs = {tag: get_roc_curve(dataframe, tag) for tag in default_tags}
    eval_fpr_pts, interpolated_rocs = interpolate_rocs(all_tag_rocs)

    plt.figure(figsize=(12, 12))
    for tag in default_tags:
        color, linestyle = style_dict[tag]
        auc = get_auc_score(dataframe, tag)
        plt.semilogx(eval_fpr_pts, interpolated_rocs[tag], color + linestyle, linewidth=2.0,
                     label=f"{tag}:{auc:5.3f}")
    plt.legend(loc='best')
    plt.xlim(1e-6, 1.0)
    plt.ylim([0., 1.])
    plt.xlabel('False Positive Rate (FPR)')
    plt.ylabel('True Positive Rate (TPR)')
    plt.savefig(filename)
    plt.clf()


@baker.command
def plot_tag_result(results_file, output_filename):
    """
    Takes a result file from a feedforward neural network model that includes all
    tags, and produces multiple overlaid ROC plots for each tag individually.

    :param results_file: complete path to a results.csv file that contains the output of 
        a model run.  Note that the model must have been trained with --use_tag_labels=True
        and evaluated using --evaluate_tags=True
    :param output_filename: the name of the file in which ot save the resulting plot.
    """
    id_to_resultfile_dict = {'run': results_file}
    id_to_dataframe_dict = collect_dataframes(id_to_resultfile_dict)
    plot_tag_results(id_to_dataframe_dict['run'], output_filename)


@baker.command
def plot_roc_distribution_for_tag(run_to_filename_json, output_filename, tag_to_plot='malware', linestyle=None, color=None,
                                  include_range=False, std_alpha=.2, range_alpha=.1):
    """
    Compute the mean and standard deviation of the TPR at a range of FPRS (the ROC curve)
    over several sets of results for a given tag.  The run_to_filename_json file must have
    the following format:
    {"run_id_0": "/full/path/to/results.csv/for/run/0/results.csv",
     "run_id_1": "/full/path/to/results.csv/for/run/1/results.csv",
      ...
    }
    
    :param run_to_filename_json: A json file that contains a key-value map that links run IDs to
        the full path to a results file (including the file name)
    :param output_filename: The filename to save the resulting figure to
    :param tag_to_plot: the tag from the results to plot; defaults to "malware"
    :param linestyle: the linestyle to use in the plot (defaults to the tag value in 
        plot.style_dict)
    :param color: the color to use in the plot (defaults to the tag value in 
        plot.style_dict)
    :param include_range: plot the min/max value as well (default False)
    :param std_alpha: the alpha value for the shading for standard deviation range
        (default 0.2)
    :param range_alpha: the alpha value for the shading for range, if plotted
        (default 0.1)
    """
    id_to_resultfile_dict = json.load(open(run_to_filename_json, 'r'))
    id_to_dataframe_dict = collect_dataframes(id_to_resultfile_dict)
    if color is None or linestyle is None:
        if not (color is None and linestyle is None):
            raise ValueError("both color and linestyle should either be specified or None")
        style = None
    else:
        style = (color, linestyle)
    plot_roc_with_confidence(id_to_dataframe_dict, tag_to_plot, output_filename, include_range=include_range, style=style,
                             std_alpha=std_alpha, range_alpha=range_alpha)


if __name__ == '__main__':
    baker.run()


================================================
FILE: shas_missing_ember_features.json
================================================
[File too large to display: 21.7 MB]

================================================
FILE: train.py
================================================
# Copyright 2020, Sophos Limited. All rights reserved.
# 
# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of
# Sophos Limited and Sophos Group. All other product and company
# names mentioned are trademarks or registered trademarks of their
# respective owners.


from dataset import Dataset
from nets import PENetwork
import warnings
import os
import baker
import torch
import torch.nn.functional as F
from torch.utils import data
import sys
from generators import get_generator
from config import device
import config
from logzero import logger
from copy import deepcopy
import numpy as np
from collections import defaultdict
from sklearn.metrics import roc_auc_score
import pickle
import json
import lightgbm as lgb


def compute_loss(predictions, labels, loss_wts={'malware': 1.0, 'count': 0.1, 'tags': 0.1}):
    """
    Compute losses for a malware feed-forward neural network (optionally with SMART tags 
    and vendor detection count auxiliary losses).

    :param predictions: a dictionary of results from a PENetwork model
    :param labels: a dictionary of labels 
    :param loss_wts: weights to assign to each head of the network (if it exists); defaults to 
        values used in the ALOHA paper (1.0 for malware, 0.1 for count and each tag)
    """
    loss_dict = {'total':0.}
    if 'malware' in labels:
        malware_labels = labels['malware'].float().to(device)
        malware_loss = F.binary_cross_entropy(predictions['malware'].reshape(malware_labels.shape), malware_labels)
        weight = loss_wts['malware'] if 'malware' in loss_wts else 1.0
        loss_dict['malware'] = deepcopy(malware_loss.item())
        loss_dict['total'] += malware_loss * weight
    if 'count' in labels:
        count_labels = labels['count'].float().to(device)
        count_loss = torch.nn.PoissonNLLLoss()(predictions['count'].reshape(count_labels.shape), count_labels)
        weight = loss_wts['count'] if 'count' in loss_wts else 1.0
        loss_dict['count'] = deepcopy(count_loss.item())
        loss_dict['total'] += count_loss * weight
    if 'tags' in labels:
        tag_labels = labels['tags'].float().to(device)
        tags_loss = F.binary_cross_entropy(predictions['tags'], tag_labels)
        weight = loss_wts['tags'] if 'tags' in loss_wts else 1.0
        loss_dict['tags'] = deepcopy(tags_loss.item())
        loss_dict['total'] += tags_loss * weight
    return loss_dict


@baker.command
def train_network(train_db_path=config.db_path,
                  checkpoint_dir=config.checkpoint_dir,
                  max_epochs=10,
                  use_malicious_labels=True,
                  use_count_labels=True,
                  use_tag_labels=True,
                  feature_dimension=2381,
                  random_seed=None, 
                  workers = None,
                  remove_missing_features='scan'):
    """
    Train a feed-forward neural network on EMBER 2.0 features, optionally with additional targets as
    described in the ALOHA paper (https://arxiv.org/abs/1903.05700).  SMART tags based on
    (https://arxiv.org/abs/1905.06262)
    

    :param train_db_path: Path in which the meta.db is stored; defaults to the value specified in `config.py`
    :param checkpoint_dir: Directory in which to save model checkpoints; WARNING -- this will overwrite any existing checkpoints without warning.
    :param max_epochs: How many epochs to train for; defaults to 10
    :param use_malicious_labels: Whether or not to use malware/benignware labels as a target; defaults to True
    :param use_count_labels: Whether or not to use the counts as an additional target; defaults to True
    :param use_tag_labels: Whether or not to use SMART tags as additional targets; defaults to True
    :param feature_dimension: The input dimension of the model; defaults to 2381 (EMBER 2.0 feature size)
    :param random_seed: if provided, seed random number generation with this value (defaults None, no seeding)
    :param workers: How many worker processes should the dataloader use (default None, use multiprocessing.cpu_count())
    :param remove_missing_features: Strategy for removing missing samples, with meta.db entries but no associated features,
        from the data (e.g. feature extraction failures).  
        Must be one of: 'scan', 'none', or path to a missing keys file.  
        Setting to 'scan' (default) will check all entries in the LMDB and remove any keys that are missing -- safe but slow. 
        Setting to 'none' will not perform a check, but may lead to a run failure if any features are missing.  Setting to
        a path will attempt to load a json-serialized list of SHA256 values from the specified file, indicating which
        keys are missing and should be removed from the dataloader.
    """
    workers = workers if workers is None else int(workers)
    os.system('mkdir -p {}'.format(checkpoint_dir))
    if random_seed is not None:
        logger.info(f"Setting random seed to {int(random_seed)}.")
        torch.manual_seed(int(random_seed))
    logger.info('...instantiating network')
    model = PENetwork(use_malware=True, use_counts=True, use_tags=True, n_tags=len(Dataset.tags),
                      feature_dimension=feature_dimension).to(device)
    opt = torch.optim.Adam(model.parameters())
    generator = get_generator(path=train_db_path,
                              mode='train',
                              use_malicious_labels=use_malicious_labels,
                              use_count_labels=use_count_labels,
                              use_tag_labels=use_tag_labels,
                              num_workers = workers,
                              remove_missing_features=remove_missing_features)
    val_generator = get_generator(path = train_db_path,
                                  mode='validation', 
                                  use_malicious_labels=use_malicious_labels,
                                  use_count_labels=use_count_labels,
                                  use_tag_labels=use_tag_labels,
                                  num_workers=workers,
                                  remove_missing_features=remove_missing_features)
    steps_per_epoch = len(generator)
    val_steps_per_epoch = len(val_generator)
    for epoch in range(1, max_epochs + 1):
        loss_histories = defaultdict(list)
        model.train()
        for i, (features, labels) in enumerate(generator):
            opt.zero_grad()
            features = deepcopy(features).to(device)
            out = model(features)
            loss_dict = compute_loss(out, deepcopy(labels))
            loss = loss_dict['total']
            loss.backward()
            opt.step()
            for k in loss_dict.keys():
                if k == 'total': loss_histories[k].append(deepcopy(loss_dict[k].detach().cpu().item()))
                else: loss_histories[k].append(loss_dict[k])
            loss_str = " ".join([f"{key} loss:{value:7.3f}" for key, value in loss_dict.items()])
            loss_str += " | "
            loss_str += " ".join([f"{key} mean:{np.mean(value):7.3f}" for key, value in loss_histories.items()])
            sys.stdout.write('\r Epoch: {}/{} {}/{} '.format(epoch, max_epochs, i + 1, steps_per_epoch) + loss_str)
            sys.stdout.flush()
            del features, labels # do our best to avoid weird references that lead to generator errors
        torch.save(model.state_dict(), os.path.join(checkpoint_dir, "epoch_{}.pt".format(str(epoch))))
        print()
        loss_histories = defaultdict(list)
        model.eval()
        for i, (features, labels) in enumerate(val_generator):
            features = deepcopy(features).to(device)
            with torch.no_grad():
                out = model(features)
            loss_dict = compute_loss(out, deepcopy(labels))
            loss = loss_dict['total']
            for k in loss_dict.keys():
                if k == 'total': loss_histories[k].append(deepcopy(loss_dict[k].detach().cpu().item()))
                else: loss_histories[k].append(loss_dict[k])
            loss_str = " ".join([f"{key} loss:{value:7.3f}" for key, value in loss_dict.items()])
            loss_str += " | "
            loss_str += " ".join([f"{key} mean:{np.mean(value):7.3f}" for key, value in loss_histories.items()])
            sys.stdout.write('\r   Val: {}/{} {}/{} '.format(epoch, max_epochs, i + 1, val_steps_per_epoch) + loss_str)
            sys.stdout.flush()
            del features, labels # do our best to avoid weird references that lead to generator errors
        print() 
    print('...done')


@baker.command
def train_lightGBM(train_npz_file, validation_npz_file, model_configuration_file, checkpoint_dir,
                   random_seed=None):
    """
    Train a lightGBM model.  Note that this is done entirely in-memory and requires a substantial 
    amount of RAM (approximately 175GB).  Baseline models were trained on an Amazon m5.24xlarge instance.

    :param train_npz_file: path to a .npz file containing featres in 'arr_0' and labels in 'arr_1' for the training data
    :param validation_npz_file: path to a .npz file containing featres in 'arr_0' and labels in 'arr_1' for the validation data
    :param model_configuration_file: path to a json file specifying lightGBM parameters (see lightgbm_config.json for an example)
    :param checkpoint_dir: location to write the trained model to
    :param random_seed: defaults to None (no seeding) otherwise an integer providing a fixed random seed for the experiment.
    """
    logger.info("Loading model config json file...")
    config = json.load(open(model_configuration_file, 'r'))
    if random_seed is not None:
        random_seed = int(random_seed)
        config['seed']=random_seed
        config['bagging_seed']=random_seed
        config['feature_fraction_seed']=random_seed
    logger.info("Loading train data...")
    train_npz = np.load(train_npz_file)
    train_fts, train_lbls = train_npz['arr_0'], train_npz['arr_1']
    val_npz = np.load(validation_npz_file)
    val_fts, val_lbls = val_npz['arr_0'], val_npz['arr_1']
    logger.info("Converting data to lightGMB.Dataset")
    train_data = lgb.Dataset(train_fts, label=train_lbls)
    val_data = lgb.Dataset(val_fts, label=val_lbls)
    logger.info("Starting training")

    bst = lgb.train(params=config, train_set=train_data, valid_sets=[val_data])

    os.system('mkdir -p {}'.format(checkpoint_dir))
    modelfile = os.path.join(checkpoint_dir, 'lightgbm.model')
    logger.info(f"Saving model to {modelfile}")
    bst.save_model(modelfile)


if __name__ == '__main__': 
    baker.run()