[
  {
    "path": "LICENSE.md",
    "content": "\n                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright 2020 Sophos PLC\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "README.md",
    "content": "[SoReL-20M](#SoReL-20M)\n\n[Terms of use](#terms-of-use)\n\n[Requirements](#Requirements)\n\n[Downloading the data](#downloading-the-data)\n\n[A note on dataset size](#a-note-on-dataset-size)\n\n[Quickstart](#Quickstart)\n\n[Neural network training](#neural-network-training)\n\n[LightGBM training](#lightgbm-training)\n\n[Frequently Asked Questions](#frequently-asked-questions)\n\n[Copyright and License](#copyright-and-license)\n\n\n\n\n\n# SoReL-20M\nSophos-ReversingLabs 20 Million dataset\n\nThe code included in this repository produced the baseline models available at `s3://sorel-20m/09-DEC-2020/baselines`\n\nThis code depends on the SOREL dataset available via Amazon S3 at `s3://sorel-20m/09-DEC-2020/processed-data/`; to train the lightGBM models you can use the npz files available at `s3://sorel-20m/09-DC-2020/lightGBM-features/` or use the scripts included here to extract the required files from the processed data.\n\nIf you use this code or this data in your own research, please cite our paper: \"SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection\n\" found at https://arxiv.org/abs/2012.07634 using the following citation:\n```\n@misc{harang2020sorel20m,\n      title={SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection}, \n      author={Richard Harang and Ethan M. Rudd},\n      year={2020},\n      eprint={2012.07634},\n      archivePrefix={arXiv},\n      primaryClass={cs.CR}\n}\n```\n\n# Terms of use\n\nPlease read the [Terms of Use](https://github.com/sophos-ai/SOREL-20M/blob/master/Terms%20and%20Conditions%20of%20Use.pdf) before using this code or accessing the data.\n\n# Requirements\n\nPython 3.6+.  See `environment.yml` for additional package requirements.\n\n# Downloading the data\n\nIndividual files are available directly via https; e.g. you can download one of the baseline checkpoints via web at the url `http://sorel-20m.s3.amazonaws.com/09-DEC-2020/baselines/checkpoints/FFNN/seed0/epoch_1.pt`\n\nFor a large number of files, we recommend using the [AWS command line interface](https://aws.amazon.com/cli/).  The SOREL-20M S3 bucket is public, so no credentials are required.  For example, to download all feedforward neural network checkpoints for all seeds, use the command `aws s3 cp s3://sorel-20m/09-DEC-2020/baselines/checkpoints/FFNN/ . --recursive`\n\nIt is possible to download the entire dataset this way, however we strongly recomend reading about the [dataset size](#a-note-on-dataset-size) before doing so and ensuring that you will not incur bandwidth fees or exhaust your available disk space in so doing.\n\n# A note on dataset size\n\nThe full size of this dataset is approximately 8TB.  It is highly recommended that you only obtain the specific elements you need. Files larger than 1GB are noted below.\n\n```\ns3://sorel-20m/09-DEC-2020/\n|   Terms and Conditions of Use.pdf -- the terms you agree to by using this data and code\n|\n+---baselines\n|   +---checkpoints\n|   |   +---FFNN - per-epoch checkpoints for 5 seeds of the feed-forward neural network\n|   |   +---lightGBM - final trained lightGBM model for 5 seeds\n|   |\n|   +---results\n|       |  ffnn_results.json - index file of results, required for plotting\n|       |  lgbm_results.json - index file of results, required for plotting\n|       |\n|       +---FFNN\n|       |   +---seed0-seed4 - individual seed results, ~1GB each\n|       |\n|       +---lightgbm\n|           +---seed0-seed4 - individual seed results, ~1GB each\n|\n+---binaries\n|      approximately 8TB of zlib compressed malware binaries\n|\n+---lightGBM-features\n|      test-features.npz - array of test data for lightGBM; 37GB\n|      train-features.npz - array of training data for lightGBM; 113GB\n|      validation-features.npz - array of validation data for lightGBM; 22GB\n|\n+---processed-data\n    |   meta.db - contains index, labels, tags, and counts for the data; 3.5GB\n    |\n    +---ember_features - LMDB directory with baseline features, ~72GB\n    +---pe_metadata - LMDB directory with full metadata dumps, ~480GB\n```\n\nNote: values in the LMDB files are serialized via msgpack and compressed via zlib; the code below handles this extraction automatically, however you will need to decompress and deserialize by hand if you use your own code to handle the data.\n\nPlease see the file `./pe_full_metadata_example/32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json` for an example of the metadata contained in the pe_metadata lmdb database.\n\n# Quickstart\n\nThe main scripts of interest are:\n1. `train.py` for training deep learning or (on a machine with sufficient RAM) LightGBM models\n2. `evaluate.py` for taking a pretrained model and producing a results csv\n3. `plot.py` for plotting the results\n\nAll scripts have multiple commands, documented via --help\n\nOnce you have cloned the repository, enter the repository directory and create a conda environment:\n\n```\ncd SoReL-20M\nconda env create -f environment.yml\nconda activate sorel\n```\n\nEnsure that you have the SOREL processed data in a local directory.  Edit `config.py` to indicate the device to use (CPU or CUDA) as well as the dataset location and desired checkpoint directory.  The dataset location should point to the folder that contains the `meta.db` file.\n\n\n*Please note*: the complete contents of processed-data require approximately 552 GB of disk space, the bulk of which is the PE metadata and not used in training the baseline models.  If you only wish to retrain the baseline models, then you will need only the following files (approximately 78GB in total): \n\n```\n/meta.db\n/ember_features/data.mdb\n/ember_features/lock.mdb\n```\n\nThe file `shas_missing_ember_features.json` within this repository contains a list of sha256 values that indicate samples for which no Ember v2 feature values could be extracted; it is _highly recommended_ that the location of this file be passed to `--remove_missing_features` parameter in `train.train_network`, `evaluate.evaluate_network`, and `evaluate.evaluate_lgb` to significantly speed up the data loading time. If is it not provided, you should specify `--remove_missing_features='scan'` which will scan all keys to check for and remove ones with missing features prior to building the dataloader; if the dataloader reaches a missing feature it will cause an error.\n\nYou can train a neural network model with the following (note that config.py values can be overridden via command line switches:\n```\npython train.py train_network --remove_missing_features=shas_missing_ember_features.json \n```\n\nAssuming that the checkpoint has been written to /home/ubuntu/checkpoints/ and you wish to place the results.csv file in /home/ubuntu/results/0 you may produce a test set evaluation as follows:\n\n```\npython evaluate.py evaluate_network /home/ubuntu/results/0 /home/ubuntu/checkpoints/epoch_9.pt \n```\n\nTo enable plotting of multiple series, the `plot.plot_roc_distributions_for_tag` function requires a json file that maps the name for a particular run to the results.csv file for that run.  \n\n```\n# Re-plot baselines -- note that the below command assumes \n# that the baseline models at s3://sorel-20m/09-DEC-2020/baselines\n# have been downloaded to the /baselines directory\npython plot.py plot_roc_distribution_for_tag /baselines/results/ffnn_results.json ./ffnn_results.png\n```\n\n# Neural network training\n\nWhile a GPU allows for faster training (10 epochs can be completed in approximately 90 minutes), this model can be also trained via CPU; the provided results were obtained via GPU on an Amazon g3.4xlarge EC2 instance starting with a \"Deep Learning AMI (Ubuntu 16.04) Version 26.0 (ami-025ed45832b817a35)\" and updating it as above.  In practice, disk I/O loading features from the feature database seems to be a rate-limiting step assuming a GPU is used, so running on a machine with multiple cores and using a drive with high IOPS is recommended.  Training the network requires approximately 12GB of RAM when trained via CPU, though it varies slightly with the number of cores.  It is also highly recommended to use the `--remove_missing_features=shas_missing_ember_features.json` option as this significantly improves loading time of the data.\n\nNote: if you get an error message `RuntimeError: received 0 items of ancdata` this is typically caused by the limit on the maximum number of open files being too low; this may be increased via the `ulimit` command.  In some cases -- if you use a large number of parallel workers -- it may also be neccessary to increase shared memory.\n\nThe commands to train and evaluate a neural network model are\n\n```\npython train.py train_network\npython evaluate.py evaluate_network\n```\n\nUse `--help` for either script to see details and options.  The model itself is given in `nets.PENetwork`\n\n# LightGBM training\n\nDue to the size of the dataset, training a boosted model is difficult.  We use lightGBM, which has relatively memory-efficient data handlers, allowing it to fit a model in-memory using approximately 175GB of RAM.  The lightGBM model provided in this repository was trained on an Amazon m5.24xlarge instance.  \n\nThe script `build_numpy_arrays_for_lightgbm.py` will take training/validation/testing datasets and split them into three .npz files in the specified data location that can then be used for training a LightGBM model.  Please note that these files will be extremely large (113GB, 23GB, and 38GB, respectively) using the provided Ember features.\n\nAlternatively, you may use the pre-extracted npz files available at `s3://sorel-20m/09-DEC-2020/lightGBM-features/` which contain Ember features using the default time splits for training, validation, and testing.\n\nThe lightGBM model can be trained in much the same manner as the neural network\n\n```\npython train.py train_lightGBM --train_npz_file=/dataset/train-features.npz --validation_npz_file=/dataset/validation-features.npz --model_configuration_file=./lightgbm_config.json --checkpoint_dir=/dataset/baselines/checkpoints/lightGBM/run0/\n```\n\nAssuming that you've placed the S3 dataset in /dataset as suggested above, this command will perform a single evaluation run.\n```\npython evaluate.py evaluate_lgb /dataset/baselines/checkpoints/lightGBM/seed0/lightgbm.model /home/ubuntu/lightgbm_eval --remove_missing_features=./shas_missing_ember_features.json \n```\n\nThe script used to generate the numpy array files from the database are found in `generate_numpy_arrays_for_lightgbm.dump_data_to_numpy`.  Note that this script requires approximately as much memory as training the model; a m5.24xlarge or equivalent EC2 instance type is recommended.\n\n# Frequently Asked Questions\n\n**Are there any benign samples available?**\n\nUnfortunately, due to the risk of intellectual property violations, we are not able to make the benign samples freely available. The samples are available via ReversingLabs, and anectodally a large number of them also appear to be available via VirusTotal. We are not able to provide any further assistance in this respect.\n\n**I computed the SHA256 for a malware sample and it's different from the SHA256 value suggested by the file name; why?**\n\nAll malware samples have been disarmed as described below; the SHA256 value in the file name is for the original, unmodified file.\n\n**How were the files disarmed?**\n\nThe OptionalHeader.Subsystem flag and the FileHeader.Machine header value were both set to 0 to prevent accidental execution of the files.  \n\n**Can you provide a tool to re-arm the files, or the original non-disarmed file?**\n\nUnfortunately, we cannot assist anyone in re-arming the file or in obtaining the original, non-disarmed sample.  As with the benign files, they are available via ReversingLabs, and also a large number of them appear to be available via VirusTotal. \n\n**How are the malware/benign labels determined?**\n\nWe use a combination of non-public, internal information as well as a number of static rules and analyses to obtain the ground truth labels.\n\n**Isn't releasing this data dangerous?**\n\nAs we describe in our [blog post](https://ai.sophos.com/2020/12/14/sophos-reversinglabs-sorel-20-million-sample-malware-dataset/):\n\n> The malware we’re releasing is “disarmed” so that it will not execute.  This means it would take knowledge, skill, and time to reconstitute the samples and get them to actually run.  That said, we recognize that there is at least some possibility that a skilled attacker could learn techniques from these samples or use samples from the dataset to assemble attack tools to use as part of their malicious activities.  However, in reality, there are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use. In other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers. \n\n**Is the feature extraction code available for me to apply to my own samples?**\n\nThe feature extraction function is available from the [EMBER repository](https://github.com/elastic/ember/) -- specifically we used the `PEFeatureExtractor.feature_vector()` method in [features.py](https://github.com/elastic/ember/blob/master/ember/features.py).\n\nWe parallelized this code and constructed the dataset using Sophos AI internal tools, and are unable to provide this code; please see below for some notes on feature extraction and extending the dataset.\n\n**How can I add additional files/features to the dataset?**\n\nWe are not accepting additional data for the main dataset. To add new features, files, or both to your own personal copy of it, we have the following recommendations:\n\n1. The `meta.db` sqlite file serves as the index for the LMDB database, and contains metadata and labels.  At a minimum, for each file, the sqlite database should contain columns for: the file sha256, the malware label, and a first-seen timestamp.\n2. To serialize a feature vector to a LMDB database, each individual sample's feature vector needs to be encoded into a dictionary with a key of zero and a value that is a 1-d list of floats, then serialized via msgpack and compressed via zlib, then inserted into an LMDB database with a key as the hash of the original file.  If you are extracting new features for the existing files, it's important to note that the filenames of the samples are the sha256 values of the original, non-disarmed files, and so you should just re-use that filename rather than compute the hash of the file yourself. \n3. We obtained best performance for feature extraction using RAM disks wherever possible -- for the files that features are being extracted from at a minimum, and if memory permits, for the LMDB databases as well.\n\n**What are the .npz file and how do they differ from the LMDB data?**\n\nThe .npz files in the lightGBM-features directory contain features that are identical to the features in the LMDB database (with training, validation, and test splits given as per the timestamps in `config.py`) but converted to flat numpy arrays for convenience in training the lightGBM models. They contain only binary labels, no tag information.\n\n**The values for the \"tag\" columns are counts, not binary values; why?**\n\nAs described in our [paper](https://arxiv.org/abs/1905.06262) on the tag generation, we parse vendor threat feed information for tokens indicative of the behavioral category of the mwlware; the value in these columns denote the number of tokens we identified for that tag for that sample. It may be taken as correlated with the degree of certainty in the tag, but not calibrated to a standard scale. For most applications we suggest binarizing this value by zero/non-zero.\n\n\n# Copyright and License\n\nCopyright 2020, Sophos Limited. All rights reserved.\n\n'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\nSophos Limited and Sophos Group. All other product and company\nnames mentioned are trademarks or registered trademarks of their\nrespective owners.\n\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use these files except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n"
  },
  {
    "path": "build_numpy_arrays_for_lightgbm.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\n# \n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\n# Sophos Limited and Sophos Group. All other product and company\n# names mentioned are trademarks or registered trademarks of their\n# respective owners.\n\nimport baker\nfrom copy import deepcopy\nimport sys\nimport numpy as np\nfrom config import validation_test_split, train_validation_split, db_path\nfrom generators import get_generator\n\n\n@baker.command\ndef dump_data_to_numpy(mode, output_file, workers=1, batchsize=1000, remove_missing_features='scan'):\n    \"\"\"\n    Produce numpy files required for training lightgbm model from SQLite + LMDB database.\n\n    :param mode: One of 'train', 'validation', or 'test' representing which set of the\n        data to process to file. Splits are obtained based on timestamps in config.py\n    :param output_file: The name of the output file to produce for the indicated split.\n    :param workers: How many worker processes to use (default 1)\n    :param batchsize: The batch size to use in collecting samples (default 1000)\n    :param remove_missing_features: How to check for and remove missing features; see\n        README.md for recommendations (default 'scan')\n    \"\"\"\n    _generator = get_generator(path=db_path,\n                               mode=mode,\n                               batch_size=batchsize,\n                               use_malicious_labels=True,\n                               use_count_labels=False,\n                               use_tag_labels=False,\n                               num_workers = workers,\n                               remove_missing_features=remove_missing_features,\n                               shuffle=False)\n    feature_array = []\n    label_array = []\n    for i, (features, labels) in enumerate(_generator):\n        feature_array.append(deepcopy(features.numpy()))\n        label_array.append(deepcopy(labels['malware'].numpy()))\n        sys.stdout.write(f\"\\r{i} / {len(_generator)}\")\n        sys.stdout.flush()\n    np.savez(output_file, feature_array, label_array)\n    print(f\"\\nWrote output to {output_file}\")\n\n\nif __name__ == '__main__':\n    baker.run()\n"
  },
  {
    "path": "config.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\r\n# \r\n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\r\n# Sophos Limited and Sophos Group. All other product and company\r\n# names mentioned are trademarks or registered trademarks of their\r\n# respective owners.\r\n\r\n\r\n# set this to the desired device, e.g. 'cuda:0' if a GPU is available\r\ndevice = 'cuda:0'\r\n#device = 'cpu'\r\n\r\n# NOTE -- if you change the below values, your results will not be comparable with those from\r\n# \t\t  other users of this data set.\r\n# This is the timestamp that divides the validation data (used to check convergence/overfitting)\r\n# from test data (used to assess final performance)\r\nvalidation_test_split =  1547279640.0\r\n# This is the timestamp that splits training data from validation data\r\ntrain_validation_split = 1543542570.0\r\n\r\n# modify these paths as needed to point to the directory that contains the meta_db\r\n# and to indicate where the checkpoints should be placed during model training\r\ndb_path='/dataset/SoReL20M'\r\ncheckpoint_dir='/dataset/checkpoints'\r\n\r\n# adjust the batch size as needed give memory/bus constraints\r\nbatch_size=8192\r\n"
  },
  {
    "path": "dataset.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\r\n# \r\n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\r\n# Sophos Limited and Sophos Group. All other product and company\r\n# names mentioned are trademarks or registered trademarks of their\r\n# respective owners.\r\n\r\n\r\nfrom torch.utils import data\r\nimport lmdb\r\nimport sqlite3\r\nimport baker\r\nimport msgpack\r\nimport zlib\r\nimport numpy as np\r\nimport os\r\nimport tqdm\r\nfrom logzero import logger\r\n\r\nimport config\r\nimport json\r\n\r\nclass LMDBReader(object):\r\n\r\n    def __init__(self, path, postproc_func=None):\r\n        self.env = lmdb.open(path, readonly=True, map_size=1e13, max_readers=1024)\r\n        self.postproc_func = postproc_func\r\n\r\n    def __call__(self, key):\r\n        with self.env.begin() as txn:\r\n            x = txn.get(key.encode('ascii'))\r\n        if x is None:return None\r\n        x = msgpack.loads(zlib.decompress(x),strict_map_key=False)\r\n        if self.postproc_func is not None:\r\n            x = self.postproc_func(x)\r\n        return x\r\n\r\n\r\ndef features_postproc_func(x):\r\n    x = np.asarray(x[0], dtype=np.float32)\r\n    lz = x < 0\r\n    gz = x > 0\r\n    x[lz] = - np.log(1 - x[lz])\r\n    x[gz] = np.log(1 + x[gz])\r\n    return x\r\n\r\n\r\ndef tags_postproc_func(x):\r\n    x = list(x[b'labels'].values())\r\n    x = np.asarray(x)\r\n    return x\r\n\r\n\r\nclass Dataset(data.Dataset):\r\n    tags = [\"adware\", \"flooder\", \"ransomware\", \"dropper\", \"spyware\", \"packed\",\r\n            \"crypto_miner\", \"file_infector\", \"installer\", \"worm\", \"downloader\"]\r\n\r\n    def __init__(self, metadb_path, features_lmdb_path,\r\n                 return_malicious=True, return_counts=True, return_tags=True, return_shas=False,\r\n                 mode='train', binarize_tag_labels=True, n_samples=None, remove_missing_features=True,\r\n                 postprocess_function=features_postproc_func):\r\n\r\n        self.return_counts = return_counts\r\n        self.return_tags = return_tags\r\n        self.return_malicious = return_malicious\r\n        self.return_shas = return_shas\r\n\r\n        self.features_lmdb_reader = LMDBReader(features_lmdb_path, postproc_func=postprocess_function)\r\n\r\n\r\n        retrieve = [\"sha256\"]\r\n        if return_malicious:\r\n            retrieve += [\"is_malware\"]\r\n        if return_counts:\r\n            retrieve += [\"rl_ls_const_positives\"]\r\n        if return_tags:\r\n            retrieve.extend(Dataset.tags)\r\n\r\n        conn = sqlite3.connect(metadb_path)\r\n        cur = conn.cursor()\r\n        query = 'select ' + ','.join(retrieve)\r\n        query += \" from meta\"\r\n\r\n        if mode == 'train':\r\n            query += ' where(rl_fs_t <= {})'.format(config.train_validation_split)\r\n        elif mode == 'validation':\r\n            query += ' where((rl_fs_t >= {}) and (rl_fs_t < {}))'.format(config.train_validation_split,\r\n                                                                         config.validation_test_split)\r\n        elif mode == 'test':\r\n            query += ' where(rl_fs_t >= {})'.format(config.validation_test_split)\r\n        else:\r\n            raise ValueError('invalid mode: {}'.format(mode))\r\n\r\n        logger.info('Opening Dataset at {} in {} mode.'.format(metadb_path, mode))\r\n\r\n        if type(n_samples) != type(None):\r\n            query += ' limit {}'.format(n_samples)\r\n        vals = cur.execute(query).fetchall()\r\n        conn.close()\r\n        logger.info(f\"{len(vals)} samples loaded.\")\r\n        # map the items we're retrieving to an index\r\n        retrieve_ind = dict(zip(retrieve, list(range(len(retrieve)))))\r\n\r\n        if remove_missing_features=='scan':\r\n            logger.info(\"Removing samples with missing features...\")\r\n            indexes_to_remove = []\r\n            logger.info(\"Checking dataset for keys with missing features.\")\r\n            temp_env = lmdb.open(features_lmdb_path, readonly=True, map_size=1e13, max_readers=256)\r\n            with temp_env.begin() as txn:\r\n                for index, item in tqdm.tqdm(enumerate(vals), total=len(vals), mininterval=.5, smoothing=0.):\r\n                    if txn.get(item[retrieve_ind['sha256']].encode('ascii')) is None:\r\n                        indexes_to_remove.append(index)\r\n            indexes_to_remove = set(indexes_to_remove)\r\n            vals = [value for index, value in enumerate(vals) if index not in indexes_to_remove]\r\n            logger.info(f\"{len(indexes_to_remove)} samples had no associated feature and were removed.\")\r\n            logger.info(f\"Dataset now has {len(vals)} samples.\")\r\n        elif (remove_missing_features is False) or (remove_missing_features is None):\r\n            pass\r\n        else:\r\n            # assume filepath\r\n            logger.info(f\"Trying to load shas to ignore from {remove_missing_features}...\")\r\n            with open(remove_missing_features, 'r') as f:\r\n                shas_to_remove = json.load(f)\r\n            shas_to_remove = set(shas_to_remove)\r\n            vals = [value for value in vals if value[retrieve_ind['sha256']] not in shas_to_remove]\r\n            logger.info(f\"Dataset now has {len(vals)} samples.\")\r\n        self.keylist = list(map(lambda x: x[retrieve_ind['sha256']], vals))\r\n        if self.return_malicious:\r\n            self.labels = list(map(lambda x: x[retrieve_ind['is_malware']], vals))\r\n        if self.return_counts:\r\n            self.count_labels = list(map(lambda x: x[retrieve_ind['rl_ls_const_positives']], vals))\r\n        if self.return_tags:\r\n            self.tag_labels = np.asarray([list(map(lambda x: x[retrieve_ind[t]], vals)) for t in Dataset.tags]).T\r\n            if binarize_tag_labels:\r\n                self.tag_labels = (self.tag_labels != 0).astype(int)\r\n\r\n\r\n\r\n    def __len__(self):\r\n        return len(self.keylist)\r\n\r\n    def __getitem__(self, index):\r\n        labels = {}\r\n        key = self.keylist[index]\r\n        features = self.features_lmdb_reader(key)\r\n        if self.return_malicious:\r\n            labels['malware'] = self.labels[index]\r\n        if self.return_counts:\r\n            labels['count'] = self.count_labels[index]\r\n        if self.return_tags:\r\n            labels['tags'] = self.tag_labels[index]\r\n        if self.return_shas:\r\n            return key, features, labels\r\n        else:\r\n            return features, labels\r\n\r\n\r\n\r\nif __name__ == '__main__':\r\n    baker.run()\r\n"
  },
  {
    "path": "environment.yml",
    "content": "name: sorel\nchannels:\n  - defaults\n  - pytorch\ndependencies:\n  - pip\n  - python=3.6\n  - pytorch::pytorch\n  - pytorch::torchvision\n  - cudatoolkit=10.1\n  - tqdm\n  - scikit-learn\n  - lightgbm\n  - matplotlib\n  - pandas\n  - pip:\n    - baker==1.3\n    - lmdb==0.98\n    - logzero==1.5.0\n    - msgpack==0.6.2\nprefix: /home/ubuntu/anaconda3/envs/sorel\n\n"
  },
  {
    "path": "evaluate.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\r\n# \r\n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\r\n# Sophos Limited and Sophos Group. All other product and company\r\n# names mentioned are trademarks or registered trademarks of their\r\n# respective owners.\r\n\r\n\r\nimport torch\r\nimport baker\r\nfrom nets import PENetwork\r\nfrom generators import get_generator\r\nimport tqdm\r\nimport os\r\nfrom config import device\r\nimport config\r\nfrom dataset import Dataset\r\nimport pickle\r\nfrom logzero import logger\r\nfrom copy import deepcopy\r\nimport pandas as pd\r\nimport numpy as np\r\n\r\nall_tags = Dataset.tags\r\n\r\ndef detach_and_copy_array(array):\r\n    if isinstance(array, torch.Tensor):\r\n        return deepcopy(array.cpu().detach().numpy()).ravel()\r\n    elif isinstance(array, np.ndarray):\r\n        return deepcopy(array).ravel()\r\n    else:\r\n        raise ValueError(\"Got array of unknown type {}\".format(type(array)))\r\n\r\ndef normalize_results(labels_dict, results_dict, use_malware=True, use_count=True, use_tags=True):\r\n    \"\"\"\r\n    Take a set of results dicts and break them out into\r\n    a single dict of 1d arrays with appropriate column names\r\n    that pandas can convert to a DataFrame.\r\n    \"\"\"\r\n    # we do a lot of deepcopy stuff here to avoid a FD \"leak\" in the dataset generator\r\n    # see here: https://github.com/pytorch/pytorch/issues/973#issuecomment-459398189\r\n    rv = {}\r\n    if use_malware:\r\n        rv['label_malware'] = detach_and_copy_array(labels_dict['malware'])\r\n        rv['pred_malware'] = detach_and_copy_array(results_dict['malware'])\r\n    if use_count:\r\n        rv['label_count'] = detach_and_copy_array(labels_dict['count'])\r\n        rv['pred_count'] = detach_and_copy_array(results_dict['count'])\r\n    if use_tags:\r\n        for column, tag in enumerate(all_tags):\r\n            rv[f'label_{tag}_tag'] = detach_and_copy_array(labels_dict['tags'][:, column])\r\n            rv[f'pred_{tag}_tag']=detach_and_copy_array(results_dict['tags'][:, column])\r\n    return rv\r\n\r\n@baker.command\r\ndef evaluate_network(results_dir, checkpoint_file,\r\n                     db_path=config.db_path,\r\n                     evaluate_malware=True,\r\n                     evaluate_count=True,\r\n                     evaluate_tags=True,\r\n                     remove_missing_features='scan'):\r\n    \"\"\"\r\n    Take a trained feedforward neural network model and output evaluation results to a csv in the specified location.\r\n\r\n    :param results_dir: The directory to which to write the 'results.csv' file; WARNING -- this will overwrite any\r\n        existing results in that location\r\n    :param checkpoint_file: The checkpoint file containing the weights to evaluate\r\n    :param db_path: the path to the directory containing the meta.db file; defaults to the value in config.py\r\n    :param evaluate_malware: defaults to True; whether or not to record malware labels and predictions\r\n    :param evaluate_count: defaults to True; whether or not to record count labels and predictions\r\n    :param evaluate_tags: defaults to True; whether or not to record individual tag labels and predictions\r\n    :param remove_missing_features: See help for remove_missing_features in train.py / train_network\r\n    \"\"\"\r\n    os.system('mkdir -p {}'.format(results_dir))\r\n    model = PENetwork(use_malware=True, use_counts=True, use_tags=True, n_tags=len(Dataset.tags),\r\n                      feature_dimension=2381)\r\n    model.load_state_dict(torch.load(checkpoint_file))\r\n    model.to(device)\r\n    generator = get_generator(mode='test', path=db_path, use_malicious_labels=evaluate_malware,\r\n                              use_count_labels=evaluate_count,\r\n                              use_tag_labels=evaluate_tags, return_shas=True,\r\n                              remove_missing_features=remove_missing_features)\r\n    logger.info('...running network evaluation')\r\n    f = open(os.path.join(results_dir,'results.csv'),'w')\r\n    first_batch = True\r\n    for shas, features, labels in tqdm.tqdm(generator):\r\n        features = features.to(device)\r\n        predictions = model(features)\r\n        results = normalize_results(labels, predictions)\r\n        pd.DataFrame(results, index=shas).to_csv(f, header=first_batch)\r\n        first_batch=False\r\n    f.close()\r\n    print('...done')\r\n\r\n\r\nimport lightgbm as lgb\r\n\r\n@baker.command\r\ndef evaluate_lgb(lightgbm_model_file,\r\n                 results_dir,\r\n                 db_path=config.db_path,\r\n                 remove_missing_features='scan'\r\n                 ):\r\n    \"\"\"\r\n    Take a trained lightGBM model and perform an evaluation on it. Results will be saved to \r\n    results.csv in the path specified in results_dir\r\n\r\n    :param lightgbm_model_file: Full path to the trained lightGBM model\r\n    :param results_dir: The directory to which to write the 'results.csv' file; WARNING -- this will overwrite any\r\n        existing results in that location\r\n    :param db_path: the path to the directory containing the meta.db file; defaults to the value in config.py\r\n    :param remove_missing_features: See help for remove_missing_features in train.py / train_network\r\n    \"\"\"\r\n    os.system('mkdir -p {}'.format(results_dir))\r\n\r\n    logger.info(f'Loading lgb model from {lightgbm_model_file}')\r\n    model = lgb.Booster(model_file=lightgbm_model_file)\r\n    generator = get_generator(mode='test', path=db_path, use_malicious_labels=True,\r\n                              use_count_labels=False,\r\n                              use_tag_labels=False, return_shas=True,\r\n                              remove_missing_features=remove_missing_features)\r\n    logger.info('running lgb evaluation')\r\n    f = open(os.path.join(results_dir, 'results.csv'), 'w')\r\n    first_batch = True\r\n    for shas, features, labels in tqdm.tqdm(generator):\r\n        predictions = {'malware':model.predict(features)}\r\n        results = normalize_results(labels, predictions, use_malware=True, use_count=False, use_tags=False)\r\n        pd.DataFrame(results, index=shas).to_csv(f, header=first_batch)\r\n        first_batch = False\r\n    f.close()\r\n    print('...done')\r\n\r\nif __name__ == '__main__':\r\n    baker.run()\r\n"
  },
  {
    "path": "generators.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\r\n# \r\n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\r\n# Sophos Limited and Sophos Group. All other product and company\r\n# names mentioned are trademarks or registered trademarks of their\r\n# respective owners.\r\n\r\n\r\nfrom dataset import Dataset\r\nimport os\r\nfrom torch.utils import data\r\nimport config\r\nfrom multiprocessing import cpu_count\r\n\r\nmax_workers = cpu_count()\r\n\r\n\r\nclass GeneratorFactory(object):\r\n    def __init__(self, ds_root, batch_size=None, mode='train', num_workers=max_workers, use_malicious_labels=False,\r\n                 use_count_labels=False, use_tag_labels=False, return_shas=False, features_lmdb='ember_features',\r\n                 remove_missing_features='scan', shuffle=None):\r\n        if mode not in {'train', 'validation', 'test'}:\r\n            raise ValueError('invalid mode {}'.format(mode))\r\n        ds = Dataset(metadb_path=os.path.join(ds_root, 'meta.db'),\r\n                     features_lmdb_path=os.path.join(ds_root, features_lmdb),\r\n                     return_malicious=use_malicious_labels,\r\n                     return_counts=use_count_labels,\r\n                     return_tags=use_tag_labels,\r\n                     return_shas=return_shas, mode=mode,\r\n                     remove_missing_features=remove_missing_features)\r\n        if batch_size is None:\r\n            batch_size = 1024\r\n        \r\n        # check passed in value for shuffle; pick a good one if it's None\r\n        if shuffle is not None:\r\n            if not ( (shuffle is True) or (shuffle is False)):\r\n                raise ValueError(f\"'shuffle' should be either True or False, got {shuffle}\")\r\n        else:\r\n            if mode=='train':shuffle=True\r\n            else:shuffle=False\r\n\r\n        params = {'batch_size': batch_size,\r\n                  'shuffle': shuffle,\r\n                  'num_workers': num_workers}\r\n\r\n        self.generator = data.DataLoader(ds, **params)\r\n\r\n    def __call__(self):\r\n        return self.generator\r\n\r\n\r\ndef get_generator(mode, path=config.db_path, use_malicious_labels=True, use_count_labels=True,\r\n                  use_tag_labels=True,\r\n                  batch_size=config.batch_size, return_shas=False,\r\n                  remove_missing_features='scan', num_workers=None, shuffle=None, \r\n                  feature_lmdb = 'ember_features'):\r\n    if num_workers is None:\r\n        num_workers = max_workers\r\n    return GeneratorFactory(path, batch_size=batch_size, mode=mode, num_workers=num_workers,\r\n                            use_malicious_labels=use_malicious_labels,\r\n                            use_count_labels=use_count_labels, use_tag_labels=use_tag_labels,\r\n                            return_shas=return_shas, remove_missing_features=remove_missing_features,\r\n                            shuffle=shuffle, features_lmdb=feature_lmdb)()\r\n\r\n"
  },
  {
    "path": "lightgbm_config.json",
    "content": "{\"objective\": \"binary\", \"task\": \"train\", \"boosting\": \"gbdt\", \"num_iterations\": 500, \"learning_rate\": 0.1, \"max_depth\": -1, \"num_leaves\": 64, \"tree_learner\": \"serial\", \"num_threads\": 0, \"device_type\": \"cpu\", \"seed\": 0, \"min_data_in_leaf\": 100, \"min_sum_hessian_in_leaf\": 0.001, \"bagging_fraction\": 0.9, \"bagging_freq\": 1, \"bagging_seed\": 0, \"feature_fraction\": 0.9, \"feature_fraction_bynode\": 0.9, \"feature_fraction_seed\": 0, \"early_stopping_rounds\": 10, \"first_metric_only\": true, \"max_delta_step\": 0, \"lambda_l1\": 0, \"lambda_l2\": 1.0, \"verbosity\": 2, \"is_unbalance\": true, \"sigmoid\": 1.0, \"boost_from_average\": true, \"metric\": [\"binary_logloss\", \"auc\", \"binary_error\"]}"
  },
  {
    "path": "nets.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\r\n# \r\n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\r\n# Sophos Limited and Sophos Group. All other product and company\r\n# names mentioned are trademarks or registered trademarks of their\r\n# respective owners.\r\n\r\n\r\nimport torch\r\nfrom torch import nn\r\nimport torch.nn.functional as F\r\n\r\nclass PENetwork(nn.Module):\r\n    \"\"\"\r\n    This is a simple network loosely based on the one used in ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation (https://arxiv.org/abs/1903.05700)\r\n\r\n    Note that it uses fewer (and smaller) layers, as well as a single layer for all tag predictions, performance will suffer accordingly.\r\n    \"\"\"\r\n    def __init__(self,use_malware=True,use_counts=True,use_tags=True,n_tags=None,feature_dimension=1024, layer_sizes = None):\r\n        self.use_malware=use_malware\r\n        self.use_counts=use_counts\r\n        self.use_tags=use_tags\r\n        self.n_tags = n_tags\r\n        if self.use_tags and self.n_tags == None:\r\n            raise ValueError(\"n_tags was None but we're trying to predict tags. Please include n_tags\")\r\n        super(PENetwork,self).__init__()\r\n        p = 0.05\r\n        layers = []\r\n        if layer_sizes is None:layer_sizes=[512,512,128]\r\n        for i,ls in enumerate(layer_sizes):\r\n            if i == 0:\r\n                layers.append(nn.Linear(feature_dimension,ls))\r\n            else:\r\n                layers.append(nn.Linear(layer_sizes[i-1],ls))\r\n            layers.append(nn.LayerNorm(ls))\r\n            layers.append(nn.ELU())\r\n            layers.append(nn.Dropout(p))\r\n        self.model_base = nn.Sequential(*tuple(layers))\r\n        self.malware_head = nn.Sequential(nn.Linear(layer_sizes[-1], 1),\r\n                                          nn.Sigmoid())\r\n        self.count_head = nn.Linear(layer_sizes[-1], 1)\r\n        self.sigmoid = nn.Sigmoid()\r\n        self.tag_head = nn.Sequential(nn.Linear(layer_sizes[-1],64),\r\n                                        nn.ELU(), \r\n                                        nn.Linear(64,64),\r\n                                        nn.ELU(),\r\n                                        nn.Linear(64,n_tags),\r\n                                        nn.Sigmoid())\r\n\r\n    def forward(self,data):\r\n        rv = {}\r\n        base_result = self.model_base.forward(data)\r\n        if self.use_malware:\r\n            rv['malware'] = self.malware_head(base_result)\r\n        if self.use_counts:\r\n            rv['count'] = self.count_head(base_result)\r\n        if self.use_tags:\r\n            rv['tags'] = self.tag_head(base_result)\r\n        return rv\r\n"
  },
  {
    "path": "pe_full_metadata_example/32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json",
    "content": "{\n    \"0\": {\n        \"DOS_HEADER\": {\n            \"Structure\": \"IMAGE_DOS_HEADER\",\n            \"e_magic\": {\n                \"FileOffset\": 0,\n                \"Offset\": 0,\n                \"Value\": 23117\n            },\n            \"e_cblp\": {\n                \"FileOffset\": 2,\n                \"Offset\": 2,\n                \"Value\": 144\n            },\n            \"e_cp\": {\n                \"FileOffset\": 4,\n                \"Offset\": 4,\n                \"Value\": 3\n            },\n            \"e_crlc\": {\n                \"FileOffset\": 6,\n                \"Offset\": 6,\n                \"Value\": 0\n            },\n            \"e_cparhdr\": {\n                \"FileOffset\": 8,\n                \"Offset\": 8,\n                \"Value\": 4\n            },\n            \"e_minalloc\": {\n                \"FileOffset\": 10,\n                \"Offset\": 10,\n                \"Value\": 0\n            },\n            \"e_maxalloc\": {\n                \"FileOffset\": 12,\n                \"Offset\": 12,\n                \"Value\": 65535\n            },\n            \"e_ss\": {\n                \"FileOffset\": 14,\n                \"Offset\": 14,\n                \"Value\": 0\n            },\n            \"e_sp\": {\n                \"FileOffset\": 16,\n                \"Offset\": 16,\n                \"Value\": 184\n            },\n            \"e_csum\": {\n                \"FileOffset\": 18,\n                \"Offset\": 18,\n                \"Value\": 0\n            },\n            \"e_ip\": {\n                \"FileOffset\": 20,\n                \"Offset\": 20,\n                \"Value\": 0\n            },\n            \"e_cs\": {\n                \"FileOffset\": 22,\n                \"Offset\": 22,\n                \"Value\": 0\n            },\n            \"e_lfarlc\": {\n                \"FileOffset\": 24,\n                \"Offset\": 24,\n                \"Value\": 64\n            },\n            \"e_ovno\": {\n                \"FileOffset\": 26,\n                \"Offset\": 26,\n                \"Value\": 0\n            },\n            \"e_res\": {\n                \"FileOffset\": 28,\n                \"Offset\": 28,\n                \"Value\": \"\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\"\n            },\n            \"e_oemid\": {\n                \"FileOffset\": 36,\n                \"Offset\": 36,\n                \"Value\": 0\n            },\n            \"e_oeminfo\": {\n                \"FileOffset\": 38,\n                \"Offset\": 38,\n                \"Value\": 0\n            },\n            \"e_res2\": {\n                \"FileOffset\": 40,\n                \"Offset\": 40,\n                \"Value\": \"\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\\\\x00\"\n            },\n            \"e_lfanew\": {\n                \"FileOffset\": 60,\n                \"Offset\": 60,\n                \"Value\": 128\n            }\n        },\n        \"NT_HEADERS\": {\n            \"Structure\": \"IMAGE_NT_HEADERS\",\n            \"Signature\": {\n                \"FileOffset\": 128,\n                \"Offset\": 0,\n                \"Value\": 17744\n            }\n        },\n        \"FILE_HEADER\": {\n            \"Structure\": \"IMAGE_FILE_HEADER\",\n            \"Machine\": {\n                \"FileOffset\": 132,\n                \"Offset\": 0,\n                \"Value\": 332\n            },\n            \"NumberOfSections\": {\n                \"FileOffset\": 134,\n                \"Offset\": 2,\n                \"Value\": 3\n            },\n            \"TimeDateStamp\": {\n                \"FileOffset\": 136,\n                \"Offset\": 4,\n                \"Value\": \"0x586FEBD9 [Fri Jan  6 19:11:21 2017 UTC]\"\n            },\n            \"PointerToSymbolTable\": {\n                \"FileOffset\": 140,\n                \"Offset\": 8,\n                \"Value\": 0\n            },\n            \"NumberOfSymbols\": {\n                \"FileOffset\": 144,\n                \"Offset\": 12,\n                \"Value\": 0\n            },\n            \"SizeOfOptionalHeader\": {\n                \"FileOffset\": 148,\n                \"Offset\": 16,\n                \"Value\": 224\n            },\n            \"Characteristics\": {\n                \"FileOffset\": 150,\n                \"Offset\": 18,\n                \"Value\": 8450\n            }\n        },\n        \"Flags\": [\n            \"IMAGE_FILE_EXECUTABLE_IMAGE\",\n            \"IMAGE_FILE_32BIT_MACHINE\",\n            \"IMAGE_FILE_DLL\"\n        ],\n        \"OPTIONAL_HEADER\": {\n            \"Structure\": \"IMAGE_OPTIONAL_HEADER\",\n            \"Magic\": {\n                \"FileOffset\": 152,\n                \"Offset\": 0,\n                \"Value\": 267\n            },\n            \"MajorLinkerVersion\": {\n                \"FileOffset\": 154,\n                \"Offset\": 2,\n                \"Value\": 8\n            },\n            \"MinorLinkerVersion\": {\n                \"FileOffset\": 155,\n                \"Offset\": 3,\n                \"Value\": 0\n            },\n            \"SizeOfCode\": {\n                \"FileOffset\": 156,\n                \"Offset\": 4,\n                \"Value\": 7168\n            },\n            \"SizeOfInitializedData\": {\n                \"FileOffset\": 160,\n                \"Offset\": 8,\n                \"Value\": 1536\n            },\n            \"SizeOfUninitializedData\": {\n                \"FileOffset\": 164,\n                \"Offset\": 12,\n                \"Value\": 0\n            },\n            \"AddressOfEntryPoint\": {\n                \"FileOffset\": 168,\n                \"Offset\": 16,\n                \"Value\": 15278\n            },\n            \"BaseOfCode\": {\n                \"FileOffset\": 172,\n                \"Offset\": 20,\n                \"Value\": 8192\n            },\n            \"BaseOfData\": {\n                \"FileOffset\": 176,\n                \"Offset\": 24,\n                \"Value\": 16384\n            },\n            \"ImageBase\": {\n                \"FileOffset\": 180,\n                \"Offset\": 28,\n                \"Value\": 4194304\n            },\n            \"SectionAlignment\": {\n                \"FileOffset\": 184,\n                \"Offset\": 32,\n                \"Value\": 8192\n            },\n            \"FileAlignment\": {\n                \"FileOffset\": 188,\n                \"Offset\": 36,\n                \"Value\": 512\n            },\n            \"MajorOperatingSystemVersion\": {\n                \"FileOffset\": 192,\n                \"Offset\": 40,\n                \"Value\": 4\n            },\n            \"MinorOperatingSystemVersion\": {\n                \"FileOffset\": 194,\n                \"Offset\": 42,\n                \"Value\": 0\n            },\n            \"MajorImageVersion\": {\n                \"FileOffset\": 196,\n                \"Offset\": 44,\n                \"Value\": 0\n            },\n            \"MinorImageVersion\": {\n                \"FileOffset\": 198,\n                \"Offset\": 46,\n                \"Value\": 0\n            },\n            \"MajorSubsystemVersion\": {\n                \"FileOffset\": 200,\n                \"Offset\": 48,\n                \"Value\": 4\n            },\n            \"MinorSubsystemVersion\": {\n                \"FileOffset\": 202,\n                \"Offset\": 50,\n                \"Value\": 0\n            },\n            \"Reserved1\": {\n                \"FileOffset\": 204,\n                \"Offset\": 52,\n                \"Value\": 0\n            },\n            \"SizeOfImage\": {\n                \"FileOffset\": 208,\n                \"Offset\": 56,\n                \"Value\": 32768\n            },\n            \"SizeOfHeaders\": {\n                \"FileOffset\": 212,\n                \"Offset\": 60,\n                \"Value\": 512\n            },\n            \"CheckSum\": {\n                \"FileOffset\": 216,\n                \"Offset\": 64,\n                \"Value\": 0\n            },\n            \"Subsystem\": {\n                \"FileOffset\": 220,\n                \"Offset\": 68,\n                \"Value\": 3\n            },\n            \"DllCharacteristics\": {\n                \"FileOffset\": 222,\n                \"Offset\": 70,\n                \"Value\": 34112\n            },\n            \"SizeOfStackReserve\": {\n                \"FileOffset\": 224,\n                \"Offset\": 72,\n                \"Value\": 1048576\n            },\n            \"SizeOfStackCommit\": {\n                \"FileOffset\": 228,\n                \"Offset\": 76,\n                \"Value\": 4096\n            },\n            \"SizeOfHeapReserve\": {\n                \"FileOffset\": 232,\n                \"Offset\": 80,\n                \"Value\": 1048576\n            },\n            \"SizeOfHeapCommit\": {\n                \"FileOffset\": 236,\n                \"Offset\": 84,\n                \"Value\": 4096\n            },\n            \"LoaderFlags\": {\n                \"FileOffset\": 240,\n                \"Offset\": 88,\n                \"Value\": 0\n            },\n            \"NumberOfRvaAndSizes\": {\n                \"FileOffset\": 244,\n                \"Offset\": 92,\n                \"Value\": 16\n            }\n        },\n        \"DllCharacteristics\": [\n            \"IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE\",\n            \"IMAGE_DLLCHARACTERISTICS_NX_COMPAT\",\n            \"IMAGE_DLLCHARACTERISTICS_NO_SEH\",\n            \"IMAGE_DLLCHARACTERISTICS_TERMINAL_SERVER_AWARE\"\n        ],\n        \"PE Sections\": [\n            {\n                \"Structure\": \"IMAGE_SECTION_HEADER\",\n                \"Name\": {\n                    \"FileOffset\": 376,\n                    \"Offset\": 0,\n                    \"Value\": \".text\\\\x00\\\\x00\\\\x00\"\n                },\n                \"Misc\": {\n                    \"FileOffset\": 384,\n                    \"Offset\": 8,\n                    \"Value\": 7092\n                },\n                \"Misc_PhysicalAddress\": {\n                    \"FileOffset\": 384,\n                    \"Offset\": 8,\n                    \"Value\": 7092\n                },\n                \"Misc_VirtualSize\": {\n                    \"FileOffset\": 384,\n                    \"Offset\": 8,\n                    \"Value\": 7092\n                },\n                \"VirtualAddress\": {\n                    \"FileOffset\": 388,\n                    \"Offset\": 12,\n                    \"Value\": 8192\n                },\n                \"SizeOfRawData\": {\n                    \"FileOffset\": 392,\n                    \"Offset\": 16,\n                    \"Value\": 7168\n                },\n                \"PointerToRawData\": {\n                    \"FileOffset\": 396,\n                    \"Offset\": 20,\n                    \"Value\": 512\n                },\n                \"PointerToRelocations\": {\n                    \"FileOffset\": 400,\n                    \"Offset\": 24,\n                    \"Value\": 0\n                },\n                \"PointerToLinenumbers\": {\n                    \"FileOffset\": 404,\n                    \"Offset\": 28,\n                    \"Value\": 0\n                },\n                \"NumberOfRelocations\": {\n                    \"FileOffset\": 408,\n                    \"Offset\": 32,\n                    \"Value\": 0\n                },\n                \"NumberOfLinenumbers\": {\n                    \"FileOffset\": 410,\n                    \"Offset\": 34,\n                    \"Value\": 0\n                },\n                \"Characteristics\": {\n                    \"FileOffset\": 412,\n                    \"Offset\": 36,\n                    \"Value\": 1610612768\n                },\n                \"Flags\": [\n                    \"IMAGE_SCN_CNT_CODE\",\n                    \"IMAGE_SCN_MEM_EXECUTE\",\n                    \"IMAGE_SCN_MEM_READ\"\n                ],\n                \"Entropy\": 5.312053634802128,\n                \"MD5\": \"3622aa030f4a1f45fae880db94dd6e58\",\n                \"SHA1\": \"85d9dfca1ed27be856f1fccc6898940903582895\",\n                \"SHA256\": \"3c5510d2f545515f943715fe6d70a3df6b8d6d1b3e71aa50ab73a84de61be224\",\n                \"SHA512\": \"09e657ec9bd5bfa60b6449e63cdc61a8fe28c989789e14a8109db4e3cb242fdb258dac841fee24b1488b405e4dbd6e5caf78549764f2becb598394797a2cc141\"\n            },\n            {\n                \"Structure\": \"IMAGE_SECTION_HEADER\",\n                \"Name\": {\n                    \"FileOffset\": 416,\n                    \"Offset\": 0,\n                    \"Value\": \".rsrc\\\\x00\\\\x00\\\\x00\"\n                },\n                \"Misc\": {\n                    \"FileOffset\": 424,\n                    \"Offset\": 8,\n                    \"Value\": 688\n                },\n                \"Misc_PhysicalAddress\": {\n                    \"FileOffset\": 424,\n                    \"Offset\": 8,\n                    \"Value\": 688\n                },\n                \"Misc_VirtualSize\": {\n                    \"FileOffset\": 424,\n                    \"Offset\": 8,\n                    \"Value\": 688\n                },\n                \"VirtualAddress\": {\n                    \"FileOffset\": 428,\n                    \"Offset\": 12,\n                    \"Value\": 16384\n                },\n                \"SizeOfRawData\": {\n                    \"FileOffset\": 432,\n                    \"Offset\": 16,\n                    \"Value\": 1024\n                },\n                \"PointerToRawData\": {\n                    \"FileOffset\": 436,\n                    \"Offset\": 20,\n                    \"Value\": 7680\n                },\n                \"PointerToRelocations\": {\n                    \"FileOffset\": 440,\n                    \"Offset\": 24,\n                    \"Value\": 0\n                },\n                \"PointerToLinenumbers\": {\n                    \"FileOffset\": 444,\n                    \"Offset\": 28,\n                    \"Value\": 0\n                },\n                \"NumberOfRelocations\": {\n                    \"FileOffset\": 448,\n                    \"Offset\": 32,\n                    \"Value\": 0\n                },\n                \"NumberOfLinenumbers\": {\n                    \"FileOffset\": 450,\n                    \"Offset\": 34,\n                    \"Value\": 0\n                },\n                \"Characteristics\": {\n                    \"FileOffset\": 452,\n                    \"Offset\": 36,\n                    \"Value\": 1073741888\n                },\n                \"Flags\": [\n                    \"IMAGE_SCN_CNT_INITIALIZED_DATA\",\n                    \"IMAGE_SCN_MEM_READ\"\n                ],\n                \"Entropy\": 2.2461341734636235,\n                \"MD5\": \"54371107ba38386c1a7c0d2f6e8cb71e\",\n                \"SHA1\": \"074a53a6946f543d03ec1f4bd60ed635db99db8a\",\n                \"SHA256\": \"54be4b9cb66d217fadb7e478d1c18994b044bf3b9fb07b967527f6bba3302bd5\",\n                \"SHA512\": \"9593dd761eeab980b8ff74288180c623e45e67aa98d0cbc2110b82aafeb98a5c0245f06c3c7c8e2256f97471daf1118a38af8ccddea63154ba98e6ee73a78735\"\n            },\n            {\n                \"Structure\": \"IMAGE_SECTION_HEADER\",\n                \"Name\": {\n                    \"FileOffset\": 456,\n                    \"Offset\": 0,\n                    \"Value\": \".reloc\\\\x00\\\\x00\"\n                },\n                \"Misc\": {\n                    \"FileOffset\": 464,\n                    \"Offset\": 8,\n                    \"Value\": 12\n                },\n                \"Misc_PhysicalAddress\": {\n                    \"FileOffset\": 464,\n                    \"Offset\": 8,\n                    \"Value\": 12\n                },\n                \"Misc_VirtualSize\": {\n                    \"FileOffset\": 464,\n                    \"Offset\": 8,\n                    \"Value\": 12\n                },\n                \"VirtualAddress\": {\n                    \"FileOffset\": 468,\n                    \"Offset\": 12,\n                    \"Value\": 24576\n                },\n                \"SizeOfRawData\": {\n                    \"FileOffset\": 472,\n                    \"Offset\": 16,\n                    \"Value\": 512\n                },\n                \"PointerToRawData\": {\n                    \"FileOffset\": 476,\n                    \"Offset\": 20,\n                    \"Value\": 8704\n                },\n                \"PointerToRelocations\": {\n                    \"FileOffset\": 480,\n                    \"Offset\": 24,\n                    \"Value\": 0\n                },\n                \"PointerToLinenumbers\": {\n                    \"FileOffset\": 484,\n                    \"Offset\": 28,\n                    \"Value\": 0\n                },\n                \"NumberOfRelocations\": {\n                    \"FileOffset\": 488,\n                    \"Offset\": 32,\n                    \"Value\": 0\n                },\n                \"NumberOfLinenumbers\": {\n                    \"FileOffset\": 490,\n                    \"Offset\": 34,\n                    \"Value\": 0\n                },\n                \"Characteristics\": {\n                    \"FileOffset\": 492,\n                    \"Offset\": 36,\n                    \"Value\": 1107296320\n                },\n                \"Flags\": [\n                    \"IMAGE_SCN_CNT_INITIALIZED_DATA\",\n                    \"IMAGE_SCN_MEM_DISCARDABLE\",\n                    \"IMAGE_SCN_MEM_READ\"\n                ],\n                \"Entropy\": 0.08153941234324169,\n                \"MD5\": \"d9e08422d3077fe0be94f8ec16840100\",\n                \"SHA1\": \"c4f4de8e850f478b5f7aedc719098b7283936488\",\n                \"SHA256\": \"54543210ea95b9f1c287825f3361e06499aa1f6f32967ddd70efe5086694efd5\",\n                \"SHA512\": \"757186cbe73a86467b3cd9dc69b3363806fbe88332ff07948399404ff2c8f3d36e603a230992c09cdb0776bca5bcc58296540d913bed535ab1ddcab1a71ee276\"\n            }\n        ],\n        \"Directories\": [\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_EXPORT\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 248,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 252,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_IMPORT\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 256,\n                    \"Offset\": 0,\n                    \"Value\": 15192\n                },\n                \"Size\": {\n                    \"FileOffset\": 260,\n                    \"Offset\": 4,\n                    \"Value\": 83\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_RESOURCE\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 264,\n                    \"Offset\": 0,\n                    \"Value\": 16384\n                },\n                \"Size\": {\n                    \"FileOffset\": 268,\n                    \"Offset\": 4,\n                    \"Value\": 688\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_EXCEPTION\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 272,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 276,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_SECURITY\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 280,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 284,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_BASERELOC\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 288,\n                    \"Offset\": 0,\n                    \"Value\": 24576\n                },\n                \"Size\": {\n                    \"FileOffset\": 292,\n                    \"Offset\": 4,\n                    \"Value\": 12\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_DEBUG\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 296,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 300,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_COPYRIGHT\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 304,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 308,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_GLOBALPTR\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 312,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 316,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_TLS\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 320,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 324,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 328,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 332,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 336,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 340,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_IAT\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 344,\n                    \"Offset\": 0,\n                    \"Value\": 8192\n                },\n                \"Size\": {\n                    \"FileOffset\": 348,\n                    \"Offset\": 4,\n                    \"Value\": 8\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 352,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 356,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_COM_DESCRIPTOR\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 360,\n                    \"Offset\": 0,\n                    \"Value\": 8200\n                },\n                \"Size\": {\n                    \"FileOffset\": 364,\n                    \"Offset\": 4,\n                    \"Value\": 72\n                }\n            },\n            {\n                \"Structure\": \"IMAGE_DIRECTORY_ENTRY_RESERVED\",\n                \"VirtualAddress\": {\n                    \"FileOffset\": 368,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"Size\": {\n                    \"FileOffset\": 372,\n                    \"Offset\": 4,\n                    \"Value\": 0\n                }\n            }\n        ],\n        \"Version Information\": [\n            [\n                {\n                    \"Structure\": \"VS_VERSIONINFO\",\n                    \"Length\": {\n                        \"FileOffset\": 7768,\n                        \"Offset\": 0,\n                        \"Value\": 600\n                    },\n                    \"ValueLength\": {\n                        \"FileOffset\": 7770,\n                        \"Offset\": 2,\n                        \"Value\": 52\n                    },\n                    \"Type\": {\n                        \"FileOffset\": 7772,\n                        \"Offset\": 4,\n                        \"Value\": 0\n                    }\n                },\n                {\n                    \"Structure\": \"VS_FIXEDFILEINFO\",\n                    \"Signature\": {\n                        \"FileOffset\": 7808,\n                        \"Offset\": 0,\n                        \"Value\": 4277077181\n                    },\n                    \"StrucVersion\": {\n                        \"FileOffset\": 7812,\n                        \"Offset\": 4,\n                        \"Value\": 65536\n                    },\n                    \"FileVersionMS\": {\n                        \"FileOffset\": 7816,\n                        \"Offset\": 8,\n                        \"Value\": 720896\n                    },\n                    \"FileVersionLS\": {\n                        \"FileOffset\": 7820,\n                        \"Offset\": 12,\n                        \"Value\": 0\n                    },\n                    \"ProductVersionMS\": {\n                        \"FileOffset\": 7824,\n                        \"Offset\": 16,\n                        \"Value\": 720896\n                    },\n                    \"ProductVersionLS\": {\n                        \"FileOffset\": 7828,\n                        \"Offset\": 20,\n                        \"Value\": 0\n                    },\n                    \"FileFlagsMask\": {\n                        \"FileOffset\": 7832,\n                        \"Offset\": 24,\n                        \"Value\": 63\n                    },\n                    \"FileFlags\": {\n                        \"FileOffset\": 7836,\n                        \"Offset\": 28,\n                        \"Value\": 0\n                    },\n                    \"FileOS\": {\n                        \"FileOffset\": 7840,\n                        \"Offset\": 32,\n                        \"Value\": 4\n                    },\n                    \"FileType\": {\n                        \"FileOffset\": 7844,\n                        \"Offset\": 36,\n                        \"Value\": 2\n                    },\n                    \"FileSubtype\": {\n                        \"FileOffset\": 7848,\n                        \"Offset\": 40,\n                        \"Value\": 0\n                    },\n                    \"FileDateMS\": {\n                        \"FileOffset\": 7852,\n                        \"Offset\": 44,\n                        \"Value\": 0\n                    },\n                    \"FileDateLS\": {\n                        \"FileOffset\": 7856,\n                        \"Offset\": 48,\n                        \"Value\": 0\n                    }\n                }\n            ]\n        ],\n        \"Imported symbols\": [\n            [\n                {\n                    \"Structure\": \"IMAGE_IMPORT_DESCRIPTOR\",\n                    \"OriginalFirstThunk\": {\n                        \"FileOffset\": 7512,\n                        \"Offset\": 0,\n                        \"Value\": 15232\n                    },\n                    \"Characteristics\": {\n                        \"FileOffset\": 7512,\n                        \"Offset\": 0,\n                        \"Value\": 15232\n                    },\n                    \"TimeDateStamp\": {\n                        \"FileOffset\": 7516,\n                        \"Offset\": 4,\n                        \"Value\": \"0x0        [Thu Jan  1 00:00:00 1970 UTC]\"\n                    },\n                    \"ForwarderChain\": {\n                        \"FileOffset\": 7520,\n                        \"Offset\": 8,\n                        \"Value\": 0\n                    },\n                    \"Name\": {\n                        \"FileOffset\": 7524,\n                        \"Offset\": 12,\n                        \"Value\": 15262\n                    },\n                    \"FirstThunk\": {\n                        \"FileOffset\": 7528,\n                        \"Offset\": 16,\n                        \"Value\": 8192\n                    }\n                },\n                {\n                    \"DLL\": \"mscoree.dll\",\n                    \"Name\": \"_CorDllMain\",\n                    \"Hint\": 0\n                }\n            ]\n        ],\n        \"Resource directory\": [\n            {\n                \"Structure\": \"IMAGE_RESOURCE_DIRECTORY\",\n                \"Characteristics\": {\n                    \"FileOffset\": 7680,\n                    \"Offset\": 0,\n                    \"Value\": 0\n                },\n                \"TimeDateStamp\": {\n                    \"FileOffset\": 7684,\n                    \"Offset\": 4,\n                    \"Value\": \"0x0        [Thu Jan  1 00:00:00 1970 UTC]\"\n                },\n                \"MajorVersion\": {\n                    \"FileOffset\": 7688,\n                    \"Offset\": 8,\n                    \"Value\": 0\n                },\n                \"MinorVersion\": {\n                    \"FileOffset\": 7690,\n                    \"Offset\": 10,\n                    \"Value\": 0\n                },\n                \"NumberOfNamedEntries\": {\n                    \"FileOffset\": 7692,\n                    \"Offset\": 12,\n                    \"Value\": 0\n                },\n                \"NumberOfIdEntries\": {\n                    \"FileOffset\": 7694,\n                    \"Offset\": 14,\n                    \"Value\": 1\n                }\n            },\n            {\n                \"Id\": [\n                    16,\n                    \"RT_VERSION\"\n                ],\n                \"Structure\": \"IMAGE_RESOURCE_DIRECTORY_ENTRY\",\n                \"Name\": {\n                    \"FileOffset\": 7696,\n                    \"Offset\": 0,\n                    \"Value\": 16\n                },\n                \"OffsetToData\": {\n                    \"FileOffset\": 7700,\n                    \"Offset\": 4,\n                    \"Value\": 2147483672\n                }\n            },\n            [\n                {\n                    \"Structure\": \"IMAGE_RESOURCE_DIRECTORY\",\n                    \"Characteristics\": {\n                        \"FileOffset\": 7704,\n                        \"Offset\": 0,\n                        \"Value\": 0\n                    },\n                    \"TimeDateStamp\": {\n                        \"FileOffset\": 7708,\n                        \"Offset\": 4,\n                        \"Value\": \"0x0        [Thu Jan  1 00:00:00 1970 UTC]\"\n                    },\n                    \"MajorVersion\": {\n                        \"FileOffset\": 7712,\n                        \"Offset\": 8,\n                        \"Value\": 0\n                    },\n                    \"MinorVersion\": {\n                        \"FileOffset\": 7714,\n                        \"Offset\": 10,\n                        \"Value\": 0\n                    },\n                    \"NumberOfNamedEntries\": {\n                        \"FileOffset\": 7716,\n                        \"Offset\": 12,\n                        \"Value\": 0\n                    },\n                    \"NumberOfIdEntries\": {\n                        \"FileOffset\": 7718,\n                        \"Offset\": 14,\n                        \"Value\": 1\n                    }\n                },\n                {\n                    \"Id\": 1,\n                    \"Structure\": \"IMAGE_RESOURCE_DIRECTORY_ENTRY\",\n                    \"Name\": {\n                        \"FileOffset\": 7720,\n                        \"Offset\": 0,\n                        \"Value\": 1\n                    },\n                    \"OffsetToData\": {\n                        \"FileOffset\": 7724,\n                        \"Offset\": 4,\n                        \"Value\": 2147483696\n                    }\n                },\n                [\n                    {\n                        \"Structure\": \"IMAGE_RESOURCE_DIRECTORY\",\n                        \"Characteristics\": {\n                            \"FileOffset\": 7728,\n                            \"Offset\": 0,\n                            \"Value\": 0\n                        },\n                        \"TimeDateStamp\": {\n                            \"FileOffset\": 7732,\n                            \"Offset\": 4,\n                            \"Value\": \"0x0        [Thu Jan  1 00:00:00 1970 UTC]\"\n                        },\n                        \"MajorVersion\": {\n                            \"FileOffset\": 7736,\n                            \"Offset\": 8,\n                            \"Value\": 0\n                        },\n                        \"MinorVersion\": {\n                            \"FileOffset\": 7738,\n                            \"Offset\": 10,\n                            \"Value\": 0\n                        },\n                        \"NumberOfNamedEntries\": {\n                            \"FileOffset\": 7740,\n                            \"Offset\": 12,\n                            \"Value\": 0\n                        },\n                        \"NumberOfIdEntries\": {\n                            \"FileOffset\": 7742,\n                            \"Offset\": 14,\n                            \"Value\": 1\n                        }\n                    },\n                    {\n                        \"LANG\": 0,\n                        \"SUBLANG\": 0,\n                        \"LANG_NAME\": \"LANG_NEUTRAL\",\n                        \"SUBLANG_NAME\": \"SUBLANG_NEUTRAL\",\n                        \"Structure\": \"IMAGE_RESOURCE_DATA_ENTRY\",\n                        \"Name\": {\n                            \"FileOffset\": 7744,\n                            \"Offset\": 0,\n                            \"Value\": 0\n                        },\n                        \"OffsetToData\": {\n                            \"FileOffset\": 7752,\n                            \"Offset\": 0,\n                            \"Value\": 16472\n                        },\n                        \"Size\": {\n                            \"FileOffset\": 7756,\n                            \"Offset\": 4,\n                            \"Value\": 600\n                        },\n                        \"CodePage\": {\n                            \"FileOffset\": 7760,\n                            \"Offset\": 8,\n                            \"Value\": 0\n                        },\n                        \"Reserved\": {\n                            \"FileOffset\": 7764,\n                            \"Offset\": 12,\n                            \"Value\": 0\n                        }\n                    }\n                ]\n            ]\n        ],\n        \"Base relocations\": [\n            [\n                {\n                    \"Structure\": \"IMAGE_BASE_RELOCATION\",\n                    \"VirtualAddress\": {\n                        \"FileOffset\": 8704,\n                        \"Offset\": 0,\n                        \"Value\": 12288\n                    },\n                    \"SizeOfBlock\": {\n                        \"FileOffset\": 8708,\n                        \"Offset\": 4,\n                        \"Value\": 12\n                    }\n                },\n                {\n                    \"RVA\": 15280,\n                    \"Type\": \"HIGHLOW\"\n                },\n                {\n                    \"RVA\": 12288,\n                    \"Type\": \"ABSOLUTE\"\n                }\n            ]\n        ]\n    }\n}\n"
  },
  {
    "path": "plot.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\n# \n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\n# Sophos Limited and Sophos Group. All other product and company\n# names mentioned are trademarks or registered trademarks of their\n# respective owners.\n\n\nimport baker\nimport matplotlib\n\nmatplotlib.use('Agg')\nfrom matplotlib import pyplot as plt\nfrom sklearn.metrics import roc_auc_score, roc_curve\nimport pandas as pd\nimport numpy as np\nimport json\n\ndefault_tags = ['adware_tag', 'flooder_tag', 'ransomware_tag',\n                'dropper_tag', 'spyware_tag',\n                'packed_tag', 'crypto_miner_tag',\n                'file_infector_tag', 'installer_tag',\n                'worm_tag', 'downloader_tag']\ndefault_tag_colors = ['r', 'r', 'r', 'g', 'g', 'b', 'b', 'm', 'm', 'c', 'c']\ndefault_tag_linestyles = [':', '--', '-.', ':', '--', ':', '--', ':', '--', ':', '--']\n\nstyle_dict = {tag: (color, linestyle) for tag, color, linestyle in zip(default_tags,\n                                                                       default_tag_colors,\n                                                                       default_tag_linestyles)}\n\nstyle_dict['malware'] = ('k', '-')\n\n\ndef collect_dataframes(run_id_to_filename_dictionary):\n    loaded_dataframes = {}\n    for k, v in run_id_to_filename_dictionary.items():\n        loaded_dataframes[k] = pd.read_csv(v)\n    return loaded_dataframes\n\n\ndef get_tprs_at_fpr(result_dataframe, key, target_fprs=None):\n    \"\"\"\n    Estimate the True Positive Rate for a dataframe/key combination\n    at specific False Positive Rates of interest.\n    :param result_dataframe: a pandas dataframe\n    :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided\n    the dataframe is expected to have a column names `pred_malware` and `label_malware`\n    :param target_fprs: The FPRs at which you wish to estimate the TPRs; None (uses default np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1]) or a 1-d numpy array\n    :return: target_fprs, the corresponsing TPRs\n    \"\"\"\n    if target_fprs is None:\n        target_fprs = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1])\n    fpr, tpr, thresholds = get_roc_curve(result_dataframe, key)\n    return target_fprs, np.interp(target_fprs, fpr, tpr)\n\n\ndef get_roc_curve(result_dataframe, key):\n    \"\"\"\n    Get the ROC curve for a single result in a dataframe\n    :param result_dataframe: a dataframe\n    :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided\n    the dataframe is expected to have a column names `pred_malware` and `label_malware`\n    :return: false positive rates, true positive rates, and thresholds (all np.arrays)\n    \"\"\"\n    labels = result_dataframe['label_{}'.format(key)]\n    predictions = result_dataframe['pred_{}'.format(key)]\n    return roc_curve(labels, predictions)\n\n\ndef get_auc_score(result_dataframe, key):\n    \"\"\"\n    Get the Area Under the Curve for the indicated key in the dataframe\n    :param result_dataframe: a dataframe\n    :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided\n    the dataframe is expected to have a column names `pred_malware` and `label_malware`\n    :return: the AUC for the ROC generated for the provided key\n    \"\"\"\n    labels = result_dataframe['label_{}'.format(key)]\n    predictions = result_dataframe['pred_{}'.format(key)]\n    return roc_auc_score(labels, predictions)\n\n\ndef interpolate_rocs(id_to_roc_dictionary, eval_fpr_points=None):\n    \"\"\"\n    This function takes several sets of ROC results and interpolates them to a common set of\n    evaluation (FPR) values to allow for computing e.g. a mean ROC or pointwise variance of the curve\n    across multiple model fittings.\n    :param list_of_rocs: a list of results from get_roc_score (or sklearn.metrics.roc_curve) of the\n    form [(fpr_1, tpr_1, threshold_1), (fpr_2, tpr_2, threshold_2)...]\n    :param eval_fpr_points: the set of FPR values at which to interpolate the results; defaults to\n    `np.logspace(-6, 0, 1000)`\n    :return:\n        eval_fpr_points  -- the set of common points to which TPRs have been interpolated\n        interpolated_tprs -- an array with one row for each ROC provided, giving the interpolated TPR for that ROC at\n    the corresponding column in eval_fpr_points\n    \"\"\"\n    if eval_fpr_points is None:\n        eval_fpr_points = np.logspace(-6, 0, 1000)\n\n    interpolated_tprs = {}\n\n    for k, (fpr, tpr, thresh) in id_to_roc_dictionary.items():\n        interpolated_tprs[k] = np.interp(eval_fpr_points, fpr, tpr)\n\n    return eval_fpr_points, interpolated_tprs\n\n\ndef plot_roc_with_confidence(id_to_dataframe_dictionary, key, filename, include_range=False, style=None, std_alpha=.2,\n                             range_alpha=.1):\n    \"\"\"\n    Compute the mean and standard deviation of the ROC curve from a sequence of results\n    and plot it with shading.\n    \"\"\"\n    if not len(id_to_dataframe_dictionary) > 1:\n        raise ValueError(\"Need a minimum of 2 result sets to plot confidence region; found {}\".format(\n            len(id_to_dataframe_dictionary)\n        ))\n    if style is None:\n        if key in style_dict:\n            color, linestyle = style_dict[key]\n        else:\n            raise ValueError(\n                \"No default style information is available for key {}; please provide (linestyle, color)\".format(key))\n    else:\n        linestyle, color = style\n    id_to_roc_dictionary = {k: get_roc_curve(df, key) for k, df in id_to_dataframe_dictionary.items()}\n    fpr_points, interpolated_tprs = interpolate_rocs(id_to_roc_dictionary)\n    tpr_array = np.vstack([v for v in interpolated_tprs.values()])\n    mean_tpr = tpr_array.mean(0)\n    std_tpr = np.sqrt(tpr_array.var(0))\n\n    aucs = np.array([get_auc_score(v, key) for v in id_to_dataframe_dictionary.values()])\n    mean_auc = aucs.mean()\n    min_auc = aucs.min()\n    max_auc = aucs.max()\n    std_auc = np.sqrt(aucs.var())\n\n    plt.figure(figsize=(12, 12))\n    plt.semilogx(fpr_points, mean_tpr, color + linestyle, linewidth=2.0,\n                 label=f\"{key}: {mean_auc:5.3f}$\\pm${std_auc:5.3f} [{min_auc:5.3f}-{max_auc:5.3f}]\")\n    plt.fill_between(fpr_points, mean_tpr - std_tpr, mean_tpr + std_tpr, color=color, alpha=std_alpha)\n    if include_range:\n        plt.fill_between(fpr_points, tpr_array.min(0), tpr_array.max(0), color=color, alpha=range_alpha)\n    plt.legend()\n    plt.xlim(1e-6, 1.0)\n    plt.ylim([0., 1.])\n    plt.xlabel('False Positive Rate (FPR)')\n    plt.ylabel('True Positive Rate (TPR)')\n    plt.savefig(filename)\n    plt.clf()\n\n\ndef plot_tag_results(dataframe, filename):\n    all_tag_rocs = {tag: get_roc_curve(dataframe, tag) for tag in default_tags}\n    eval_fpr_pts, interpolated_rocs = interpolate_rocs(all_tag_rocs)\n\n    plt.figure(figsize=(12, 12))\n    for tag in default_tags:\n        color, linestyle = style_dict[tag]\n        auc = get_auc_score(dataframe, tag)\n        plt.semilogx(eval_fpr_pts, interpolated_rocs[tag], color + linestyle, linewidth=2.0,\n                     label=f\"{tag}:{auc:5.3f}\")\n    plt.legend(loc='best')\n    plt.xlim(1e-6, 1.0)\n    plt.ylim([0., 1.])\n    plt.xlabel('False Positive Rate (FPR)')\n    plt.ylabel('True Positive Rate (TPR)')\n    plt.savefig(filename)\n    plt.clf()\n\n\n@baker.command\ndef plot_tag_result(results_file, output_filename):\n    \"\"\"\n    Takes a result file from a feedforward neural network model that includes all\n    tags, and produces multiple overlaid ROC plots for each tag individually.\n\n    :param results_file: complete path to a results.csv file that contains the output of \n        a model run.  Note that the model must have been trained with --use_tag_labels=True\n        and evaluated using --evaluate_tags=True\n    :param output_filename: the name of the file in which ot save the resulting plot.\n    \"\"\"\n    id_to_resultfile_dict = {'run': results_file}\n    id_to_dataframe_dict = collect_dataframes(id_to_resultfile_dict)\n    plot_tag_results(id_to_dataframe_dict['run'], output_filename)\n\n\n@baker.command\ndef plot_roc_distribution_for_tag(run_to_filename_json, output_filename, tag_to_plot='malware', linestyle=None, color=None,\n                                  include_range=False, std_alpha=.2, range_alpha=.1):\n    \"\"\"\n    Compute the mean and standard deviation of the TPR at a range of FPRS (the ROC curve)\n    over several sets of results for a given tag.  The run_to_filename_json file must have\n    the following format:\n    {\"run_id_0\": \"/full/path/to/results.csv/for/run/0/results.csv\",\n     \"run_id_1\": \"/full/path/to/results.csv/for/run/1/results.csv\",\n      ...\n    }\n    \n    :param run_to_filename_json: A json file that contains a key-value map that links run IDs to\n        the full path to a results file (including the file name)\n    :param output_filename: The filename to save the resulting figure to\n    :param tag_to_plot: the tag from the results to plot; defaults to \"malware\"\n    :param linestyle: the linestyle to use in the plot (defaults to the tag value in \n        plot.style_dict)\n    :param color: the color to use in the plot (defaults to the tag value in \n        plot.style_dict)\n    :param include_range: plot the min/max value as well (default False)\n    :param std_alpha: the alpha value for the shading for standard deviation range\n        (default 0.2)\n    :param range_alpha: the alpha value for the shading for range, if plotted\n        (default 0.1)\n    \"\"\"\n    id_to_resultfile_dict = json.load(open(run_to_filename_json, 'r'))\n    id_to_dataframe_dict = collect_dataframes(id_to_resultfile_dict)\n    if color is None or linestyle is None:\n        if not (color is None and linestyle is None):\n            raise ValueError(\"both color and linestyle should either be specified or None\")\n        style = None\n    else:\n        style = (color, linestyle)\n    plot_roc_with_confidence(id_to_dataframe_dict, tag_to_plot, output_filename, include_range=include_range, style=style,\n                             std_alpha=std_alpha, range_alpha=range_alpha)\n\n\nif __name__ == '__main__':\n    baker.run()\n"
  },
  {
    "path": "train.py",
    "content": "# Copyright 2020, Sophos Limited. All rights reserved.\r\n# \r\n# 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of\r\n# Sophos Limited and Sophos Group. All other product and company\r\n# names mentioned are trademarks or registered trademarks of their\r\n# respective owners.\r\n\r\n\r\nfrom dataset import Dataset\r\nfrom nets import PENetwork\r\nimport warnings\r\nimport os\r\nimport baker\r\nimport torch\r\nimport torch.nn.functional as F\r\nfrom torch.utils import data\r\nimport sys\r\nfrom generators import get_generator\r\nfrom config import device\r\nimport config\r\nfrom logzero import logger\r\nfrom copy import deepcopy\r\nimport numpy as np\r\nfrom collections import defaultdict\r\nfrom sklearn.metrics import roc_auc_score\r\nimport pickle\r\nimport json\r\nimport lightgbm as lgb\r\n\r\n\r\ndef compute_loss(predictions, labels, loss_wts={'malware': 1.0, 'count': 0.1, 'tags': 0.1}):\r\n    \"\"\"\r\n    Compute losses for a malware feed-forward neural network (optionally with SMART tags \r\n    and vendor detection count auxiliary losses).\r\n\r\n    :param predictions: a dictionary of results from a PENetwork model\r\n    :param labels: a dictionary of labels \r\n    :param loss_wts: weights to assign to each head of the network (if it exists); defaults to \r\n        values used in the ALOHA paper (1.0 for malware, 0.1 for count and each tag)\r\n    \"\"\"\r\n    loss_dict = {'total':0.}\r\n    if 'malware' in labels:\r\n        malware_labels = labels['malware'].float().to(device)\r\n        malware_loss = F.binary_cross_entropy(predictions['malware'].reshape(malware_labels.shape), malware_labels)\r\n        weight = loss_wts['malware'] if 'malware' in loss_wts else 1.0\r\n        loss_dict['malware'] = deepcopy(malware_loss.item())\r\n        loss_dict['total'] += malware_loss * weight\r\n    if 'count' in labels:\r\n        count_labels = labels['count'].float().to(device)\r\n        count_loss = torch.nn.PoissonNLLLoss()(predictions['count'].reshape(count_labels.shape), count_labels)\r\n        weight = loss_wts['count'] if 'count' in loss_wts else 1.0\r\n        loss_dict['count'] = deepcopy(count_loss.item())\r\n        loss_dict['total'] += count_loss * weight\r\n    if 'tags' in labels:\r\n        tag_labels = labels['tags'].float().to(device)\r\n        tags_loss = F.binary_cross_entropy(predictions['tags'], tag_labels)\r\n        weight = loss_wts['tags'] if 'tags' in loss_wts else 1.0\r\n        loss_dict['tags'] = deepcopy(tags_loss.item())\r\n        loss_dict['total'] += tags_loss * weight\r\n    return loss_dict\r\n\r\n\r\n@baker.command\r\ndef train_network(train_db_path=config.db_path,\r\n                  checkpoint_dir=config.checkpoint_dir,\r\n                  max_epochs=10,\r\n                  use_malicious_labels=True,\r\n                  use_count_labels=True,\r\n                  use_tag_labels=True,\r\n                  feature_dimension=2381,\r\n                  random_seed=None, \r\n                  workers = None,\r\n                  remove_missing_features='scan'):\r\n    \"\"\"\r\n    Train a feed-forward neural network on EMBER 2.0 features, optionally with additional targets as\r\n    described in the ALOHA paper (https://arxiv.org/abs/1903.05700).  SMART tags based on\r\n    (https://arxiv.org/abs/1905.06262)\r\n    \r\n\r\n    :param train_db_path: Path in which the meta.db is stored; defaults to the value specified in `config.py`\r\n    :param checkpoint_dir: Directory in which to save model checkpoints; WARNING -- this will overwrite any existing checkpoints without warning.\r\n    :param max_epochs: How many epochs to train for; defaults to 10\r\n    :param use_malicious_labels: Whether or not to use malware/benignware labels as a target; defaults to True\r\n    :param use_count_labels: Whether or not to use the counts as an additional target; defaults to True\r\n    :param use_tag_labels: Whether or not to use SMART tags as additional targets; defaults to True\r\n    :param feature_dimension: The input dimension of the model; defaults to 2381 (EMBER 2.0 feature size)\r\n    :param random_seed: if provided, seed random number generation with this value (defaults None, no seeding)\r\n    :param workers: How many worker processes should the dataloader use (default None, use multiprocessing.cpu_count())\r\n    :param remove_missing_features: Strategy for removing missing samples, with meta.db entries but no associated features,\r\n        from the data (e.g. feature extraction failures).  \r\n        Must be one of: 'scan', 'none', or path to a missing keys file.  \r\n        Setting to 'scan' (default) will check all entries in the LMDB and remove any keys that are missing -- safe but slow. \r\n        Setting to 'none' will not perform a check, but may lead to a run failure if any features are missing.  Setting to\r\n        a path will attempt to load a json-serialized list of SHA256 values from the specified file, indicating which\r\n        keys are missing and should be removed from the dataloader.\r\n    \"\"\"\r\n    workers = workers if workers is None else int(workers)\r\n    os.system('mkdir -p {}'.format(checkpoint_dir))\r\n    if random_seed is not None:\r\n        logger.info(f\"Setting random seed to {int(random_seed)}.\")\r\n        torch.manual_seed(int(random_seed))\r\n    logger.info('...instantiating network')\r\n    model = PENetwork(use_malware=True, use_counts=True, use_tags=True, n_tags=len(Dataset.tags),\r\n                      feature_dimension=feature_dimension).to(device)\r\n    opt = torch.optim.Adam(model.parameters())\r\n    generator = get_generator(path=train_db_path,\r\n                              mode='train',\r\n                              use_malicious_labels=use_malicious_labels,\r\n                              use_count_labels=use_count_labels,\r\n                              use_tag_labels=use_tag_labels,\r\n                              num_workers = workers,\r\n                              remove_missing_features=remove_missing_features)\r\n    val_generator = get_generator(path = train_db_path,\r\n                                  mode='validation', \r\n                                  use_malicious_labels=use_malicious_labels,\r\n                                  use_count_labels=use_count_labels,\r\n                                  use_tag_labels=use_tag_labels,\r\n                                  num_workers=workers,\r\n                                  remove_missing_features=remove_missing_features)\r\n    steps_per_epoch = len(generator)\r\n    val_steps_per_epoch = len(val_generator)\r\n    for epoch in range(1, max_epochs + 1):\r\n        loss_histories = defaultdict(list)\r\n        model.train()\r\n        for i, (features, labels) in enumerate(generator):\r\n            opt.zero_grad()\r\n            features = deepcopy(features).to(device)\r\n            out = model(features)\r\n            loss_dict = compute_loss(out, deepcopy(labels))\r\n            loss = loss_dict['total']\r\n            loss.backward()\r\n            opt.step()\r\n            for k in loss_dict.keys():\r\n                if k == 'total': loss_histories[k].append(deepcopy(loss_dict[k].detach().cpu().item()))\r\n                else: loss_histories[k].append(loss_dict[k])\r\n            loss_str = \" \".join([f\"{key} loss:{value:7.3f}\" for key, value in loss_dict.items()])\r\n            loss_str += \" | \"\r\n            loss_str += \" \".join([f\"{key} mean:{np.mean(value):7.3f}\" for key, value in loss_histories.items()])\r\n            sys.stdout.write('\\r Epoch: {}/{} {}/{} '.format(epoch, max_epochs, i + 1, steps_per_epoch) + loss_str)\r\n            sys.stdout.flush()\r\n            del features, labels # do our best to avoid weird references that lead to generator errors\r\n        torch.save(model.state_dict(), os.path.join(checkpoint_dir, \"epoch_{}.pt\".format(str(epoch))))\r\n        print()\r\n        loss_histories = defaultdict(list)\r\n        model.eval()\r\n        for i, (features, labels) in enumerate(val_generator):\r\n            features = deepcopy(features).to(device)\r\n            with torch.no_grad():\r\n                out = model(features)\r\n            loss_dict = compute_loss(out, deepcopy(labels))\r\n            loss = loss_dict['total']\r\n            for k in loss_dict.keys():\r\n                if k == 'total': loss_histories[k].append(deepcopy(loss_dict[k].detach().cpu().item()))\r\n                else: loss_histories[k].append(loss_dict[k])\r\n            loss_str = \" \".join([f\"{key} loss:{value:7.3f}\" for key, value in loss_dict.items()])\r\n            loss_str += \" | \"\r\n            loss_str += \" \".join([f\"{key} mean:{np.mean(value):7.3f}\" for key, value in loss_histories.items()])\r\n            sys.stdout.write('\\r   Val: {}/{} {}/{} '.format(epoch, max_epochs, i + 1, val_steps_per_epoch) + loss_str)\r\n            sys.stdout.flush()\r\n            del features, labels # do our best to avoid weird references that lead to generator errors\r\n        print() \r\n    print('...done')\r\n\r\n\r\n\r\n@baker.command\r\ndef train_lightGBM(train_npz_file, validation_npz_file, model_configuration_file, checkpoint_dir,\r\n                   random_seed=None):\r\n    \"\"\"\r\n    Train a lightGBM model.  Note that this is done entirely in-memory and requires a substantial \r\n    amount of RAM (approximately 175GB).  Baseline models were trained on an Amazon m5.24xlarge instance.\r\n\r\n    :param train_npz_file: path to a .npz file containing featres in 'arr_0' and labels in 'arr_1' for the training data\r\n    :param validation_npz_file: path to a .npz file containing featres in 'arr_0' and labels in 'arr_1' for the validation data\r\n    :param model_configuration_file: path to a json file specifying lightGBM parameters (see lightgbm_config.json for an example)\r\n    :param checkpoint_dir: location to write the trained model to\r\n    :param random_seed: defaults to None (no seeding) otherwise an integer providing a fixed random seed for the experiment.\r\n    \"\"\"\r\n    logger.info(\"Loading model config json file...\")\r\n    config = json.load(open(model_configuration_file, 'r'))\r\n    if random_seed is not None:\r\n        random_seed = int(random_seed)\r\n        config['seed']=random_seed\r\n        config['bagging_seed']=random_seed\r\n        config['feature_fraction_seed']=random_seed\r\n    logger.info(\"Loading train data...\")\r\n    train_npz = np.load(train_npz_file)\r\n    train_fts, train_lbls = train_npz['arr_0'], train_npz['arr_1']\r\n    val_npz = np.load(validation_npz_file)\r\n    val_fts, val_lbls = val_npz['arr_0'], val_npz['arr_1']\r\n    logger.info(\"Converting data to lightGMB.Dataset\")\r\n    train_data = lgb.Dataset(train_fts, label=train_lbls)\r\n    val_data = lgb.Dataset(val_fts, label=val_lbls)\r\n    logger.info(\"Starting training\")\r\n\r\n    bst = lgb.train(params=config, train_set=train_data, valid_sets=[val_data])\r\n\r\n    os.system('mkdir -p {}'.format(checkpoint_dir))\r\n    modelfile = os.path.join(checkpoint_dir, 'lightgbm.model')\r\n    logger.info(f\"Saving model to {modelfile}\")\r\n    bst.save_model(modelfile)\r\n\r\n\r\n\r\nif __name__ == '__main__': \r\n    baker.run()\r\n"
  }
]