Repository: sophos/SOREL-20M Branch: master Commit: 3664addd05c4 Files: 14 Total size: 21.8 MB Directory structure: gitextract__adqnldz/ ├── LICENSE.md ├── README.md ├── build_numpy_arrays_for_lightgbm.py ├── config.py ├── dataset.py ├── environment.yml ├── evaluate.py ├── generators.py ├── lightgbm_config.json ├── nets.py ├── pe_full_metadata_example/ │ └── 32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json ├── plot.py ├── shas_missing_ember_features.json └── train.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE.md ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright 2020 Sophos PLC Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: README.md ================================================ [SoReL-20M](#SoReL-20M) [Terms of use](#terms-of-use) [Requirements](#Requirements) [Downloading the data](#downloading-the-data) [A note on dataset size](#a-note-on-dataset-size) [Quickstart](#Quickstart) [Neural network training](#neural-network-training) [LightGBM training](#lightgbm-training) [Frequently Asked Questions](#frequently-asked-questions) [Copyright and License](#copyright-and-license) # SoReL-20M Sophos-ReversingLabs 20 Million dataset The code included in this repository produced the baseline models available at `s3://sorel-20m/09-DEC-2020/baselines` This code depends on the SOREL dataset available via Amazon S3 at `s3://sorel-20m/09-DEC-2020/processed-data/`; to train the lightGBM models you can use the npz files available at `s3://sorel-20m/09-DC-2020/lightGBM-features/` or use the scripts included here to extract the required files from the processed data. If you use this code or this data in your own research, please cite our paper: "SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection " found at https://arxiv.org/abs/2012.07634 using the following citation: ``` @misc{harang2020sorel20m, title={SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection}, author={Richard Harang and Ethan M. Rudd}, year={2020}, eprint={2012.07634}, archivePrefix={arXiv}, primaryClass={cs.CR} } ``` # Terms of use Please read the [Terms of Use](https://github.com/sophos-ai/SOREL-20M/blob/master/Terms%20and%20Conditions%20of%20Use.pdf) before using this code or accessing the data. # Requirements Python 3.6+. See `environment.yml` for additional package requirements. # Downloading the data Individual files are available directly via https; e.g. you can download one of the baseline checkpoints via web at the url `http://sorel-20m.s3.amazonaws.com/09-DEC-2020/baselines/checkpoints/FFNN/seed0/epoch_1.pt` For a large number of files, we recommend using the [AWS command line interface](https://aws.amazon.com/cli/). The SOREL-20M S3 bucket is public, so no credentials are required. For example, to download all feedforward neural network checkpoints for all seeds, use the command `aws s3 cp s3://sorel-20m/09-DEC-2020/baselines/checkpoints/FFNN/ . --recursive` It is possible to download the entire dataset this way, however we strongly recomend reading about the [dataset size](#a-note-on-dataset-size) before doing so and ensuring that you will not incur bandwidth fees or exhaust your available disk space in so doing. # A note on dataset size The full size of this dataset is approximately 8TB. It is highly recommended that you only obtain the specific elements you need. Files larger than 1GB are noted below. ``` s3://sorel-20m/09-DEC-2020/ | Terms and Conditions of Use.pdf -- the terms you agree to by using this data and code | +---baselines | +---checkpoints | | +---FFNN - per-epoch checkpoints for 5 seeds of the feed-forward neural network | | +---lightGBM - final trained lightGBM model for 5 seeds | | | +---results | | ffnn_results.json - index file of results, required for plotting | | lgbm_results.json - index file of results, required for plotting | | | +---FFNN | | +---seed0-seed4 - individual seed results, ~1GB each | | | +---lightgbm | +---seed0-seed4 - individual seed results, ~1GB each | +---binaries | approximately 8TB of zlib compressed malware binaries | +---lightGBM-features | test-features.npz - array of test data for lightGBM; 37GB | train-features.npz - array of training data for lightGBM; 113GB | validation-features.npz - array of validation data for lightGBM; 22GB | +---processed-data | meta.db - contains index, labels, tags, and counts for the data; 3.5GB | +---ember_features - LMDB directory with baseline features, ~72GB +---pe_metadata - LMDB directory with full metadata dumps, ~480GB ``` Note: values in the LMDB files are serialized via msgpack and compressed via zlib; the code below handles this extraction automatically, however you will need to decompress and deserialize by hand if you use your own code to handle the data. Please see the file `./pe_full_metadata_example/32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json` for an example of the metadata contained in the pe_metadata lmdb database. # Quickstart The main scripts of interest are: 1. `train.py` for training deep learning or (on a machine with sufficient RAM) LightGBM models 2. `evaluate.py` for taking a pretrained model and producing a results csv 3. `plot.py` for plotting the results All scripts have multiple commands, documented via --help Once you have cloned the repository, enter the repository directory and create a conda environment: ``` cd SoReL-20M conda env create -f environment.yml conda activate sorel ``` Ensure that you have the SOREL processed data in a local directory. Edit `config.py` to indicate the device to use (CPU or CUDA) as well as the dataset location and desired checkpoint directory. The dataset location should point to the folder that contains the `meta.db` file. *Please note*: the complete contents of processed-data require approximately 552 GB of disk space, the bulk of which is the PE metadata and not used in training the baseline models. If you only wish to retrain the baseline models, then you will need only the following files (approximately 78GB in total): ``` /meta.db /ember_features/data.mdb /ember_features/lock.mdb ``` The file `shas_missing_ember_features.json` within this repository contains a list of sha256 values that indicate samples for which no Ember v2 feature values could be extracted; it is _highly recommended_ that the location of this file be passed to `--remove_missing_features` parameter in `train.train_network`, `evaluate.evaluate_network`, and `evaluate.evaluate_lgb` to significantly speed up the data loading time. If is it not provided, you should specify `--remove_missing_features='scan'` which will scan all keys to check for and remove ones with missing features prior to building the dataloader; if the dataloader reaches a missing feature it will cause an error. You can train a neural network model with the following (note that config.py values can be overridden via command line switches: ``` python train.py train_network --remove_missing_features=shas_missing_ember_features.json ``` Assuming that the checkpoint has been written to /home/ubuntu/checkpoints/ and you wish to place the results.csv file in /home/ubuntu/results/0 you may produce a test set evaluation as follows: ``` python evaluate.py evaluate_network /home/ubuntu/results/0 /home/ubuntu/checkpoints/epoch_9.pt ``` To enable plotting of multiple series, the `plot.plot_roc_distributions_for_tag` function requires a json file that maps the name for a particular run to the results.csv file for that run. ``` # Re-plot baselines -- note that the below command assumes # that the baseline models at s3://sorel-20m/09-DEC-2020/baselines # have been downloaded to the /baselines directory python plot.py plot_roc_distribution_for_tag /baselines/results/ffnn_results.json ./ffnn_results.png ``` # Neural network training While a GPU allows for faster training (10 epochs can be completed in approximately 90 minutes), this model can be also trained via CPU; the provided results were obtained via GPU on an Amazon g3.4xlarge EC2 instance starting with a "Deep Learning AMI (Ubuntu 16.04) Version 26.0 (ami-025ed45832b817a35)" and updating it as above. In practice, disk I/O loading features from the feature database seems to be a rate-limiting step assuming a GPU is used, so running on a machine with multiple cores and using a drive with high IOPS is recommended. Training the network requires approximately 12GB of RAM when trained via CPU, though it varies slightly with the number of cores. It is also highly recommended to use the `--remove_missing_features=shas_missing_ember_features.json` option as this significantly improves loading time of the data. Note: if you get an error message `RuntimeError: received 0 items of ancdata` this is typically caused by the limit on the maximum number of open files being too low; this may be increased via the `ulimit` command. In some cases -- if you use a large number of parallel workers -- it may also be neccessary to increase shared memory. The commands to train and evaluate a neural network model are ``` python train.py train_network python evaluate.py evaluate_network ``` Use `--help` for either script to see details and options. The model itself is given in `nets.PENetwork` # LightGBM training Due to the size of the dataset, training a boosted model is difficult. We use lightGBM, which has relatively memory-efficient data handlers, allowing it to fit a model in-memory using approximately 175GB of RAM. The lightGBM model provided in this repository was trained on an Amazon m5.24xlarge instance. The script `build_numpy_arrays_for_lightgbm.py` will take training/validation/testing datasets and split them into three .npz files in the specified data location that can then be used for training a LightGBM model. Please note that these files will be extremely large (113GB, 23GB, and 38GB, respectively) using the provided Ember features. Alternatively, you may use the pre-extracted npz files available at `s3://sorel-20m/09-DEC-2020/lightGBM-features/` which contain Ember features using the default time splits for training, validation, and testing. The lightGBM model can be trained in much the same manner as the neural network ``` python train.py train_lightGBM --train_npz_file=/dataset/train-features.npz --validation_npz_file=/dataset/validation-features.npz --model_configuration_file=./lightgbm_config.json --checkpoint_dir=/dataset/baselines/checkpoints/lightGBM/run0/ ``` Assuming that you've placed the S3 dataset in /dataset as suggested above, this command will perform a single evaluation run. ``` python evaluate.py evaluate_lgb /dataset/baselines/checkpoints/lightGBM/seed0/lightgbm.model /home/ubuntu/lightgbm_eval --remove_missing_features=./shas_missing_ember_features.json ``` The script used to generate the numpy array files from the database are found in `generate_numpy_arrays_for_lightgbm.dump_data_to_numpy`. Note that this script requires approximately as much memory as training the model; a m5.24xlarge or equivalent EC2 instance type is recommended. # Frequently Asked Questions **Are there any benign samples available?** Unfortunately, due to the risk of intellectual property violations, we are not able to make the benign samples freely available. The samples are available via ReversingLabs, and anectodally a large number of them also appear to be available via VirusTotal. We are not able to provide any further assistance in this respect. **I computed the SHA256 for a malware sample and it's different from the SHA256 value suggested by the file name; why?** All malware samples have been disarmed as described below; the SHA256 value in the file name is for the original, unmodified file. **How were the files disarmed?** The OptionalHeader.Subsystem flag and the FileHeader.Machine header value were both set to 0 to prevent accidental execution of the files. **Can you provide a tool to re-arm the files, or the original non-disarmed file?** Unfortunately, we cannot assist anyone in re-arming the file or in obtaining the original, non-disarmed sample. As with the benign files, they are available via ReversingLabs, and also a large number of them appear to be available via VirusTotal. **How are the malware/benign labels determined?** We use a combination of non-public, internal information as well as a number of static rules and analyses to obtain the ground truth labels. **Isn't releasing this data dangerous?** As we describe in our [blog post](https://ai.sophos.com/2020/12/14/sophos-reversinglabs-sorel-20-million-sample-malware-dataset/): > The malware we’re releasing is “disarmed” so that it will not execute. This means it would take knowledge, skill, and time to reconstitute the samples and get them to actually run. That said, we recognize that there is at least some possibility that a skilled attacker could learn techniques from these samples or use samples from the dataset to assemble attack tools to use as part of their malicious activities. However, in reality, there are already many other sources attackers could leverage to gain access to malware information and samples that are easier, faster and more cost effective to use. In other words, this disarmed sample set will have much more value to researchers looking to improve and develop their independent defenses than it will have to attackers. **Is the feature extraction code available for me to apply to my own samples?** The feature extraction function is available from the [EMBER repository](https://github.com/elastic/ember/) -- specifically we used the `PEFeatureExtractor.feature_vector()` method in [features.py](https://github.com/elastic/ember/blob/master/ember/features.py). We parallelized this code and constructed the dataset using Sophos AI internal tools, and are unable to provide this code; please see below for some notes on feature extraction and extending the dataset. **How can I add additional files/features to the dataset?** We are not accepting additional data for the main dataset. To add new features, files, or both to your own personal copy of it, we have the following recommendations: 1. The `meta.db` sqlite file serves as the index for the LMDB database, and contains metadata and labels. At a minimum, for each file, the sqlite database should contain columns for: the file sha256, the malware label, and a first-seen timestamp. 2. To serialize a feature vector to a LMDB database, each individual sample's feature vector needs to be encoded into a dictionary with a key of zero and a value that is a 1-d list of floats, then serialized via msgpack and compressed via zlib, then inserted into an LMDB database with a key as the hash of the original file. If you are extracting new features for the existing files, it's important to note that the filenames of the samples are the sha256 values of the original, non-disarmed files, and so you should just re-use that filename rather than compute the hash of the file yourself. 3. We obtained best performance for feature extraction using RAM disks wherever possible -- for the files that features are being extracted from at a minimum, and if memory permits, for the LMDB databases as well. **What are the .npz file and how do they differ from the LMDB data?** The .npz files in the lightGBM-features directory contain features that are identical to the features in the LMDB database (with training, validation, and test splits given as per the timestamps in `config.py`) but converted to flat numpy arrays for convenience in training the lightGBM models. They contain only binary labels, no tag information. **The values for the "tag" columns are counts, not binary values; why?** As described in our [paper](https://arxiv.org/abs/1905.06262) on the tag generation, we parse vendor threat feed information for tokens indicative of the behavioral category of the mwlware; the value in these columns denote the number of tokens we identified for that tag for that sample. It may be taken as correlated with the degree of certainty in the tag, but not calibrated to a standard scale. For most applications we suggest binarizing this value by zero/non-zero. # Copyright and License Copyright 2020, Sophos Limited. All rights reserved. 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of Sophos Limited and Sophos Group. All other product and company names mentioned are trademarks or registered trademarks of their respective owners. Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: build_numpy_arrays_for_lightgbm.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. import baker from copy import deepcopy import sys import numpy as np from config import validation_test_split, train_validation_split, db_path from generators import get_generator @baker.command def dump_data_to_numpy(mode, output_file, workers=1, batchsize=1000, remove_missing_features='scan'): """ Produce numpy files required for training lightgbm model from SQLite + LMDB database. :param mode: One of 'train', 'validation', or 'test' representing which set of the data to process to file. Splits are obtained based on timestamps in config.py :param output_file: The name of the output file to produce for the indicated split. :param workers: How many worker processes to use (default 1) :param batchsize: The batch size to use in collecting samples (default 1000) :param remove_missing_features: How to check for and remove missing features; see README.md for recommendations (default 'scan') """ _generator = get_generator(path=db_path, mode=mode, batch_size=batchsize, use_malicious_labels=True, use_count_labels=False, use_tag_labels=False, num_workers = workers, remove_missing_features=remove_missing_features, shuffle=False) feature_array = [] label_array = [] for i, (features, labels) in enumerate(_generator): feature_array.append(deepcopy(features.numpy())) label_array.append(deepcopy(labels['malware'].numpy())) sys.stdout.write(f"\r{i} / {len(_generator)}") sys.stdout.flush() np.savez(output_file, feature_array, label_array) print(f"\nWrote output to {output_file}") if __name__ == '__main__': baker.run() ================================================ FILE: config.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. # set this to the desired device, e.g. 'cuda:0' if a GPU is available device = 'cuda:0' #device = 'cpu' # NOTE -- if you change the below values, your results will not be comparable with those from # other users of this data set. # This is the timestamp that divides the validation data (used to check convergence/overfitting) # from test data (used to assess final performance) validation_test_split = 1547279640.0 # This is the timestamp that splits training data from validation data train_validation_split = 1543542570.0 # modify these paths as needed to point to the directory that contains the meta_db # and to indicate where the checkpoints should be placed during model training db_path='/dataset/SoReL20M' checkpoint_dir='/dataset/checkpoints' # adjust the batch size as needed give memory/bus constraints batch_size=8192 ================================================ FILE: dataset.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. from torch.utils import data import lmdb import sqlite3 import baker import msgpack import zlib import numpy as np import os import tqdm from logzero import logger import config import json class LMDBReader(object): def __init__(self, path, postproc_func=None): self.env = lmdb.open(path, readonly=True, map_size=1e13, max_readers=1024) self.postproc_func = postproc_func def __call__(self, key): with self.env.begin() as txn: x = txn.get(key.encode('ascii')) if x is None:return None x = msgpack.loads(zlib.decompress(x),strict_map_key=False) if self.postproc_func is not None: x = self.postproc_func(x) return x def features_postproc_func(x): x = np.asarray(x[0], dtype=np.float32) lz = x < 0 gz = x > 0 x[lz] = - np.log(1 - x[lz]) x[gz] = np.log(1 + x[gz]) return x def tags_postproc_func(x): x = list(x[b'labels'].values()) x = np.asarray(x) return x class Dataset(data.Dataset): tags = ["adware", "flooder", "ransomware", "dropper", "spyware", "packed", "crypto_miner", "file_infector", "installer", "worm", "downloader"] def __init__(self, metadb_path, features_lmdb_path, return_malicious=True, return_counts=True, return_tags=True, return_shas=False, mode='train', binarize_tag_labels=True, n_samples=None, remove_missing_features=True, postprocess_function=features_postproc_func): self.return_counts = return_counts self.return_tags = return_tags self.return_malicious = return_malicious self.return_shas = return_shas self.features_lmdb_reader = LMDBReader(features_lmdb_path, postproc_func=postprocess_function) retrieve = ["sha256"] if return_malicious: retrieve += ["is_malware"] if return_counts: retrieve += ["rl_ls_const_positives"] if return_tags: retrieve.extend(Dataset.tags) conn = sqlite3.connect(metadb_path) cur = conn.cursor() query = 'select ' + ','.join(retrieve) query += " from meta" if mode == 'train': query += ' where(rl_fs_t <= {})'.format(config.train_validation_split) elif mode == 'validation': query += ' where((rl_fs_t >= {}) and (rl_fs_t < {}))'.format(config.train_validation_split, config.validation_test_split) elif mode == 'test': query += ' where(rl_fs_t >= {})'.format(config.validation_test_split) else: raise ValueError('invalid mode: {}'.format(mode)) logger.info('Opening Dataset at {} in {} mode.'.format(metadb_path, mode)) if type(n_samples) != type(None): query += ' limit {}'.format(n_samples) vals = cur.execute(query).fetchall() conn.close() logger.info(f"{len(vals)} samples loaded.") # map the items we're retrieving to an index retrieve_ind = dict(zip(retrieve, list(range(len(retrieve))))) if remove_missing_features=='scan': logger.info("Removing samples with missing features...") indexes_to_remove = [] logger.info("Checking dataset for keys with missing features.") temp_env = lmdb.open(features_lmdb_path, readonly=True, map_size=1e13, max_readers=256) with temp_env.begin() as txn: for index, item in tqdm.tqdm(enumerate(vals), total=len(vals), mininterval=.5, smoothing=0.): if txn.get(item[retrieve_ind['sha256']].encode('ascii')) is None: indexes_to_remove.append(index) indexes_to_remove = set(indexes_to_remove) vals = [value for index, value in enumerate(vals) if index not in indexes_to_remove] logger.info(f"{len(indexes_to_remove)} samples had no associated feature and were removed.") logger.info(f"Dataset now has {len(vals)} samples.") elif (remove_missing_features is False) or (remove_missing_features is None): pass else: # assume filepath logger.info(f"Trying to load shas to ignore from {remove_missing_features}...") with open(remove_missing_features, 'r') as f: shas_to_remove = json.load(f) shas_to_remove = set(shas_to_remove) vals = [value for value in vals if value[retrieve_ind['sha256']] not in shas_to_remove] logger.info(f"Dataset now has {len(vals)} samples.") self.keylist = list(map(lambda x: x[retrieve_ind['sha256']], vals)) if self.return_malicious: self.labels = list(map(lambda x: x[retrieve_ind['is_malware']], vals)) if self.return_counts: self.count_labels = list(map(lambda x: x[retrieve_ind['rl_ls_const_positives']], vals)) if self.return_tags: self.tag_labels = np.asarray([list(map(lambda x: x[retrieve_ind[t]], vals)) for t in Dataset.tags]).T if binarize_tag_labels: self.tag_labels = (self.tag_labels != 0).astype(int) def __len__(self): return len(self.keylist) def __getitem__(self, index): labels = {} key = self.keylist[index] features = self.features_lmdb_reader(key) if self.return_malicious: labels['malware'] = self.labels[index] if self.return_counts: labels['count'] = self.count_labels[index] if self.return_tags: labels['tags'] = self.tag_labels[index] if self.return_shas: return key, features, labels else: return features, labels if __name__ == '__main__': baker.run() ================================================ FILE: environment.yml ================================================ name: sorel channels: - defaults - pytorch dependencies: - pip - python=3.6 - pytorch::pytorch - pytorch::torchvision - cudatoolkit=10.1 - tqdm - scikit-learn - lightgbm - matplotlib - pandas - pip: - baker==1.3 - lmdb==0.98 - logzero==1.5.0 - msgpack==0.6.2 prefix: /home/ubuntu/anaconda3/envs/sorel ================================================ FILE: evaluate.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. import torch import baker from nets import PENetwork from generators import get_generator import tqdm import os from config import device import config from dataset import Dataset import pickle from logzero import logger from copy import deepcopy import pandas as pd import numpy as np all_tags = Dataset.tags def detach_and_copy_array(array): if isinstance(array, torch.Tensor): return deepcopy(array.cpu().detach().numpy()).ravel() elif isinstance(array, np.ndarray): return deepcopy(array).ravel() else: raise ValueError("Got array of unknown type {}".format(type(array))) def normalize_results(labels_dict, results_dict, use_malware=True, use_count=True, use_tags=True): """ Take a set of results dicts and break them out into a single dict of 1d arrays with appropriate column names that pandas can convert to a DataFrame. """ # we do a lot of deepcopy stuff here to avoid a FD "leak" in the dataset generator # see here: https://github.com/pytorch/pytorch/issues/973#issuecomment-459398189 rv = {} if use_malware: rv['label_malware'] = detach_and_copy_array(labels_dict['malware']) rv['pred_malware'] = detach_and_copy_array(results_dict['malware']) if use_count: rv['label_count'] = detach_and_copy_array(labels_dict['count']) rv['pred_count'] = detach_and_copy_array(results_dict['count']) if use_tags: for column, tag in enumerate(all_tags): rv[f'label_{tag}_tag'] = detach_and_copy_array(labels_dict['tags'][:, column]) rv[f'pred_{tag}_tag']=detach_and_copy_array(results_dict['tags'][:, column]) return rv @baker.command def evaluate_network(results_dir, checkpoint_file, db_path=config.db_path, evaluate_malware=True, evaluate_count=True, evaluate_tags=True, remove_missing_features='scan'): """ Take a trained feedforward neural network model and output evaluation results to a csv in the specified location. :param results_dir: The directory to which to write the 'results.csv' file; WARNING -- this will overwrite any existing results in that location :param checkpoint_file: The checkpoint file containing the weights to evaluate :param db_path: the path to the directory containing the meta.db file; defaults to the value in config.py :param evaluate_malware: defaults to True; whether or not to record malware labels and predictions :param evaluate_count: defaults to True; whether or not to record count labels and predictions :param evaluate_tags: defaults to True; whether or not to record individual tag labels and predictions :param remove_missing_features: See help for remove_missing_features in train.py / train_network """ os.system('mkdir -p {}'.format(results_dir)) model = PENetwork(use_malware=True, use_counts=True, use_tags=True, n_tags=len(Dataset.tags), feature_dimension=2381) model.load_state_dict(torch.load(checkpoint_file)) model.to(device) generator = get_generator(mode='test', path=db_path, use_malicious_labels=evaluate_malware, use_count_labels=evaluate_count, use_tag_labels=evaluate_tags, return_shas=True, remove_missing_features=remove_missing_features) logger.info('...running network evaluation') f = open(os.path.join(results_dir,'results.csv'),'w') first_batch = True for shas, features, labels in tqdm.tqdm(generator): features = features.to(device) predictions = model(features) results = normalize_results(labels, predictions) pd.DataFrame(results, index=shas).to_csv(f, header=first_batch) first_batch=False f.close() print('...done') import lightgbm as lgb @baker.command def evaluate_lgb(lightgbm_model_file, results_dir, db_path=config.db_path, remove_missing_features='scan' ): """ Take a trained lightGBM model and perform an evaluation on it. Results will be saved to results.csv in the path specified in results_dir :param lightgbm_model_file: Full path to the trained lightGBM model :param results_dir: The directory to which to write the 'results.csv' file; WARNING -- this will overwrite any existing results in that location :param db_path: the path to the directory containing the meta.db file; defaults to the value in config.py :param remove_missing_features: See help for remove_missing_features in train.py / train_network """ os.system('mkdir -p {}'.format(results_dir)) logger.info(f'Loading lgb model from {lightgbm_model_file}') model = lgb.Booster(model_file=lightgbm_model_file) generator = get_generator(mode='test', path=db_path, use_malicious_labels=True, use_count_labels=False, use_tag_labels=False, return_shas=True, remove_missing_features=remove_missing_features) logger.info('running lgb evaluation') f = open(os.path.join(results_dir, 'results.csv'), 'w') first_batch = True for shas, features, labels in tqdm.tqdm(generator): predictions = {'malware':model.predict(features)} results = normalize_results(labels, predictions, use_malware=True, use_count=False, use_tags=False) pd.DataFrame(results, index=shas).to_csv(f, header=first_batch) first_batch = False f.close() print('...done') if __name__ == '__main__': baker.run() ================================================ FILE: generators.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. from dataset import Dataset import os from torch.utils import data import config from multiprocessing import cpu_count max_workers = cpu_count() class GeneratorFactory(object): def __init__(self, ds_root, batch_size=None, mode='train', num_workers=max_workers, use_malicious_labels=False, use_count_labels=False, use_tag_labels=False, return_shas=False, features_lmdb='ember_features', remove_missing_features='scan', shuffle=None): if mode not in {'train', 'validation', 'test'}: raise ValueError('invalid mode {}'.format(mode)) ds = Dataset(metadb_path=os.path.join(ds_root, 'meta.db'), features_lmdb_path=os.path.join(ds_root, features_lmdb), return_malicious=use_malicious_labels, return_counts=use_count_labels, return_tags=use_tag_labels, return_shas=return_shas, mode=mode, remove_missing_features=remove_missing_features) if batch_size is None: batch_size = 1024 # check passed in value for shuffle; pick a good one if it's None if shuffle is not None: if not ( (shuffle is True) or (shuffle is False)): raise ValueError(f"'shuffle' should be either True or False, got {shuffle}") else: if mode=='train':shuffle=True else:shuffle=False params = {'batch_size': batch_size, 'shuffle': shuffle, 'num_workers': num_workers} self.generator = data.DataLoader(ds, **params) def __call__(self): return self.generator def get_generator(mode, path=config.db_path, use_malicious_labels=True, use_count_labels=True, use_tag_labels=True, batch_size=config.batch_size, return_shas=False, remove_missing_features='scan', num_workers=None, shuffle=None, feature_lmdb = 'ember_features'): if num_workers is None: num_workers = max_workers return GeneratorFactory(path, batch_size=batch_size, mode=mode, num_workers=num_workers, use_malicious_labels=use_malicious_labels, use_count_labels=use_count_labels, use_tag_labels=use_tag_labels, return_shas=return_shas, remove_missing_features=remove_missing_features, shuffle=shuffle, features_lmdb=feature_lmdb)() ================================================ FILE: lightgbm_config.json ================================================ {"objective": "binary", "task": "train", "boosting": "gbdt", "num_iterations": 500, "learning_rate": 0.1, "max_depth": -1, "num_leaves": 64, "tree_learner": "serial", "num_threads": 0, "device_type": "cpu", "seed": 0, "min_data_in_leaf": 100, "min_sum_hessian_in_leaf": 0.001, "bagging_fraction": 0.9, "bagging_freq": 1, "bagging_seed": 0, "feature_fraction": 0.9, "feature_fraction_bynode": 0.9, "feature_fraction_seed": 0, "early_stopping_rounds": 10, "first_metric_only": true, "max_delta_step": 0, "lambda_l1": 0, "lambda_l2": 1.0, "verbosity": 2, "is_unbalance": true, "sigmoid": 1.0, "boost_from_average": true, "metric": ["binary_logloss", "auc", "binary_error"]} ================================================ FILE: nets.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. import torch from torch import nn import torch.nn.functional as F class PENetwork(nn.Module): """ This is a simple network loosely based on the one used in ALOHA: Auxiliary Loss Optimization for Hypothesis Augmentation (https://arxiv.org/abs/1903.05700) Note that it uses fewer (and smaller) layers, as well as a single layer for all tag predictions, performance will suffer accordingly. """ def __init__(self,use_malware=True,use_counts=True,use_tags=True,n_tags=None,feature_dimension=1024, layer_sizes = None): self.use_malware=use_malware self.use_counts=use_counts self.use_tags=use_tags self.n_tags = n_tags if self.use_tags and self.n_tags == None: raise ValueError("n_tags was None but we're trying to predict tags. Please include n_tags") super(PENetwork,self).__init__() p = 0.05 layers = [] if layer_sizes is None:layer_sizes=[512,512,128] for i,ls in enumerate(layer_sizes): if i == 0: layers.append(nn.Linear(feature_dimension,ls)) else: layers.append(nn.Linear(layer_sizes[i-1],ls)) layers.append(nn.LayerNorm(ls)) layers.append(nn.ELU()) layers.append(nn.Dropout(p)) self.model_base = nn.Sequential(*tuple(layers)) self.malware_head = nn.Sequential(nn.Linear(layer_sizes[-1], 1), nn.Sigmoid()) self.count_head = nn.Linear(layer_sizes[-1], 1) self.sigmoid = nn.Sigmoid() self.tag_head = nn.Sequential(nn.Linear(layer_sizes[-1],64), nn.ELU(), nn.Linear(64,64), nn.ELU(), nn.Linear(64,n_tags), nn.Sigmoid()) def forward(self,data): rv = {} base_result = self.model_base.forward(data) if self.use_malware: rv['malware'] = self.malware_head(base_result) if self.use_counts: rv['count'] = self.count_head(base_result) if self.use_tags: rv['tags'] = self.tag_head(base_result) return rv ================================================ FILE: pe_full_metadata_example/32c37c352802fb20004fa14053ac13134f31aff747dc0a2962da2ea1ea894d74.json ================================================ { "0": { "DOS_HEADER": { "Structure": "IMAGE_DOS_HEADER", "e_magic": { "FileOffset": 0, "Offset": 0, "Value": 23117 }, "e_cblp": { "FileOffset": 2, "Offset": 2, "Value": 144 }, "e_cp": { "FileOffset": 4, "Offset": 4, "Value": 3 }, "e_crlc": { "FileOffset": 6, "Offset": 6, "Value": 0 }, "e_cparhdr": { "FileOffset": 8, "Offset": 8, "Value": 4 }, "e_minalloc": { "FileOffset": 10, "Offset": 10, "Value": 0 }, "e_maxalloc": { "FileOffset": 12, "Offset": 12, "Value": 65535 }, "e_ss": { "FileOffset": 14, "Offset": 14, "Value": 0 }, "e_sp": { "FileOffset": 16, "Offset": 16, "Value": 184 }, "e_csum": { "FileOffset": 18, "Offset": 18, "Value": 0 }, "e_ip": { "FileOffset": 20, "Offset": 20, "Value": 0 }, "e_cs": { "FileOffset": 22, "Offset": 22, "Value": 0 }, "e_lfarlc": { "FileOffset": 24, "Offset": 24, "Value": 64 }, "e_ovno": { "FileOffset": 26, "Offset": 26, "Value": 0 }, "e_res": { "FileOffset": 28, "Offset": 28, "Value": "\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00" }, "e_oemid": { "FileOffset": 36, "Offset": 36, "Value": 0 }, "e_oeminfo": { "FileOffset": 38, "Offset": 38, "Value": 0 }, "e_res2": { "FileOffset": 40, "Offset": 40, "Value": "\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00" }, "e_lfanew": { "FileOffset": 60, "Offset": 60, "Value": 128 } }, "NT_HEADERS": { "Structure": "IMAGE_NT_HEADERS", "Signature": { "FileOffset": 128, "Offset": 0, "Value": 17744 } }, "FILE_HEADER": { "Structure": "IMAGE_FILE_HEADER", "Machine": { "FileOffset": 132, "Offset": 0, "Value": 332 }, "NumberOfSections": { "FileOffset": 134, "Offset": 2, "Value": 3 }, "TimeDateStamp": { "FileOffset": 136, "Offset": 4, "Value": "0x586FEBD9 [Fri Jan 6 19:11:21 2017 UTC]" }, "PointerToSymbolTable": { "FileOffset": 140, "Offset": 8, "Value": 0 }, "NumberOfSymbols": { "FileOffset": 144, "Offset": 12, "Value": 0 }, "SizeOfOptionalHeader": { "FileOffset": 148, "Offset": 16, "Value": 224 }, "Characteristics": { "FileOffset": 150, "Offset": 18, "Value": 8450 } }, "Flags": [ "IMAGE_FILE_EXECUTABLE_IMAGE", "IMAGE_FILE_32BIT_MACHINE", "IMAGE_FILE_DLL" ], "OPTIONAL_HEADER": { "Structure": "IMAGE_OPTIONAL_HEADER", "Magic": { "FileOffset": 152, "Offset": 0, "Value": 267 }, "MajorLinkerVersion": { "FileOffset": 154, "Offset": 2, "Value": 8 }, "MinorLinkerVersion": { "FileOffset": 155, "Offset": 3, "Value": 0 }, "SizeOfCode": { "FileOffset": 156, "Offset": 4, "Value": 7168 }, "SizeOfInitializedData": { "FileOffset": 160, "Offset": 8, "Value": 1536 }, "SizeOfUninitializedData": { "FileOffset": 164, "Offset": 12, "Value": 0 }, "AddressOfEntryPoint": { "FileOffset": 168, "Offset": 16, "Value": 15278 }, "BaseOfCode": { "FileOffset": 172, "Offset": 20, "Value": 8192 }, "BaseOfData": { "FileOffset": 176, "Offset": 24, "Value": 16384 }, "ImageBase": { "FileOffset": 180, "Offset": 28, "Value": 4194304 }, "SectionAlignment": { "FileOffset": 184, "Offset": 32, "Value": 8192 }, "FileAlignment": { "FileOffset": 188, "Offset": 36, "Value": 512 }, "MajorOperatingSystemVersion": { "FileOffset": 192, "Offset": 40, "Value": 4 }, "MinorOperatingSystemVersion": { "FileOffset": 194, "Offset": 42, "Value": 0 }, "MajorImageVersion": { "FileOffset": 196, "Offset": 44, "Value": 0 }, "MinorImageVersion": { "FileOffset": 198, "Offset": 46, "Value": 0 }, "MajorSubsystemVersion": { "FileOffset": 200, "Offset": 48, "Value": 4 }, "MinorSubsystemVersion": { "FileOffset": 202, "Offset": 50, "Value": 0 }, "Reserved1": { "FileOffset": 204, "Offset": 52, "Value": 0 }, "SizeOfImage": { "FileOffset": 208, "Offset": 56, "Value": 32768 }, "SizeOfHeaders": { "FileOffset": 212, "Offset": 60, "Value": 512 }, "CheckSum": { "FileOffset": 216, "Offset": 64, "Value": 0 }, "Subsystem": { "FileOffset": 220, "Offset": 68, "Value": 3 }, "DllCharacteristics": { "FileOffset": 222, "Offset": 70, "Value": 34112 }, "SizeOfStackReserve": { "FileOffset": 224, "Offset": 72, "Value": 1048576 }, "SizeOfStackCommit": { "FileOffset": 228, "Offset": 76, "Value": 4096 }, "SizeOfHeapReserve": { "FileOffset": 232, "Offset": 80, "Value": 1048576 }, "SizeOfHeapCommit": { "FileOffset": 236, "Offset": 84, "Value": 4096 }, "LoaderFlags": { "FileOffset": 240, "Offset": 88, "Value": 0 }, "NumberOfRvaAndSizes": { "FileOffset": 244, "Offset": 92, "Value": 16 } }, "DllCharacteristics": [ "IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE", "IMAGE_DLLCHARACTERISTICS_NX_COMPAT", "IMAGE_DLLCHARACTERISTICS_NO_SEH", "IMAGE_DLLCHARACTERISTICS_TERMINAL_SERVER_AWARE" ], "PE Sections": [ { "Structure": "IMAGE_SECTION_HEADER", "Name": { "FileOffset": 376, "Offset": 0, "Value": ".text\\x00\\x00\\x00" }, "Misc": { "FileOffset": 384, "Offset": 8, "Value": 7092 }, "Misc_PhysicalAddress": { "FileOffset": 384, "Offset": 8, "Value": 7092 }, "Misc_VirtualSize": { "FileOffset": 384, "Offset": 8, "Value": 7092 }, "VirtualAddress": { "FileOffset": 388, "Offset": 12, "Value": 8192 }, "SizeOfRawData": { "FileOffset": 392, "Offset": 16, "Value": 7168 }, "PointerToRawData": { "FileOffset": 396, "Offset": 20, "Value": 512 }, "PointerToRelocations": { "FileOffset": 400, "Offset": 24, "Value": 0 }, "PointerToLinenumbers": { "FileOffset": 404, "Offset": 28, "Value": 0 }, "NumberOfRelocations": { "FileOffset": 408, "Offset": 32, "Value": 0 }, "NumberOfLinenumbers": { "FileOffset": 410, "Offset": 34, "Value": 0 }, "Characteristics": { "FileOffset": 412, "Offset": 36, "Value": 1610612768 }, "Flags": [ "IMAGE_SCN_CNT_CODE", "IMAGE_SCN_MEM_EXECUTE", "IMAGE_SCN_MEM_READ" ], "Entropy": 5.312053634802128, "MD5": "3622aa030f4a1f45fae880db94dd6e58", "SHA1": "85d9dfca1ed27be856f1fccc6898940903582895", "SHA256": "3c5510d2f545515f943715fe6d70a3df6b8d6d1b3e71aa50ab73a84de61be224", "SHA512": "09e657ec9bd5bfa60b6449e63cdc61a8fe28c989789e14a8109db4e3cb242fdb258dac841fee24b1488b405e4dbd6e5caf78549764f2becb598394797a2cc141" }, { "Structure": "IMAGE_SECTION_HEADER", "Name": { "FileOffset": 416, "Offset": 0, "Value": ".rsrc\\x00\\x00\\x00" }, "Misc": { "FileOffset": 424, "Offset": 8, "Value": 688 }, "Misc_PhysicalAddress": { "FileOffset": 424, "Offset": 8, "Value": 688 }, "Misc_VirtualSize": { "FileOffset": 424, "Offset": 8, "Value": 688 }, "VirtualAddress": { "FileOffset": 428, "Offset": 12, "Value": 16384 }, "SizeOfRawData": { "FileOffset": 432, "Offset": 16, "Value": 1024 }, "PointerToRawData": { "FileOffset": 436, "Offset": 20, "Value": 7680 }, "PointerToRelocations": { "FileOffset": 440, "Offset": 24, "Value": 0 }, "PointerToLinenumbers": { "FileOffset": 444, "Offset": 28, "Value": 0 }, "NumberOfRelocations": { "FileOffset": 448, "Offset": 32, "Value": 0 }, "NumberOfLinenumbers": { "FileOffset": 450, "Offset": 34, "Value": 0 }, "Characteristics": { "FileOffset": 452, "Offset": 36, "Value": 1073741888 }, "Flags": [ "IMAGE_SCN_CNT_INITIALIZED_DATA", "IMAGE_SCN_MEM_READ" ], "Entropy": 2.2461341734636235, "MD5": "54371107ba38386c1a7c0d2f6e8cb71e", "SHA1": "074a53a6946f543d03ec1f4bd60ed635db99db8a", "SHA256": "54be4b9cb66d217fadb7e478d1c18994b044bf3b9fb07b967527f6bba3302bd5", "SHA512": "9593dd761eeab980b8ff74288180c623e45e67aa98d0cbc2110b82aafeb98a5c0245f06c3c7c8e2256f97471daf1118a38af8ccddea63154ba98e6ee73a78735" }, { "Structure": "IMAGE_SECTION_HEADER", "Name": { "FileOffset": 456, "Offset": 0, "Value": ".reloc\\x00\\x00" }, "Misc": { "FileOffset": 464, "Offset": 8, "Value": 12 }, "Misc_PhysicalAddress": { "FileOffset": 464, "Offset": 8, "Value": 12 }, "Misc_VirtualSize": { "FileOffset": 464, "Offset": 8, "Value": 12 }, "VirtualAddress": { "FileOffset": 468, "Offset": 12, "Value": 24576 }, "SizeOfRawData": { "FileOffset": 472, "Offset": 16, "Value": 512 }, "PointerToRawData": { "FileOffset": 476, "Offset": 20, "Value": 8704 }, "PointerToRelocations": { "FileOffset": 480, "Offset": 24, "Value": 0 }, "PointerToLinenumbers": { "FileOffset": 484, "Offset": 28, "Value": 0 }, "NumberOfRelocations": { "FileOffset": 488, "Offset": 32, "Value": 0 }, "NumberOfLinenumbers": { "FileOffset": 490, "Offset": 34, "Value": 0 }, "Characteristics": { "FileOffset": 492, "Offset": 36, "Value": 1107296320 }, "Flags": [ "IMAGE_SCN_CNT_INITIALIZED_DATA", "IMAGE_SCN_MEM_DISCARDABLE", "IMAGE_SCN_MEM_READ" ], "Entropy": 0.08153941234324169, "MD5": "d9e08422d3077fe0be94f8ec16840100", "SHA1": "c4f4de8e850f478b5f7aedc719098b7283936488", "SHA256": "54543210ea95b9f1c287825f3361e06499aa1f6f32967ddd70efe5086694efd5", "SHA512": "757186cbe73a86467b3cd9dc69b3363806fbe88332ff07948399404ff2c8f3d36e603a230992c09cdb0776bca5bcc58296540d913bed535ab1ddcab1a71ee276" } ], "Directories": [ { "Structure": "IMAGE_DIRECTORY_ENTRY_EXPORT", "VirtualAddress": { "FileOffset": 248, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 252, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_IMPORT", "VirtualAddress": { "FileOffset": 256, "Offset": 0, "Value": 15192 }, "Size": { "FileOffset": 260, "Offset": 4, "Value": 83 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_RESOURCE", "VirtualAddress": { "FileOffset": 264, "Offset": 0, "Value": 16384 }, "Size": { "FileOffset": 268, "Offset": 4, "Value": 688 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_EXCEPTION", "VirtualAddress": { "FileOffset": 272, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 276, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_SECURITY", "VirtualAddress": { "FileOffset": 280, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 284, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_BASERELOC", "VirtualAddress": { "FileOffset": 288, "Offset": 0, "Value": 24576 }, "Size": { "FileOffset": 292, "Offset": 4, "Value": 12 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_DEBUG", "VirtualAddress": { "FileOffset": 296, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 300, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_COPYRIGHT", "VirtualAddress": { "FileOffset": 304, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 308, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_GLOBALPTR", "VirtualAddress": { "FileOffset": 312, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 316, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_TLS", "VirtualAddress": { "FileOffset": 320, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 324, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_LOAD_CONFIG", "VirtualAddress": { "FileOffset": 328, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 332, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT", "VirtualAddress": { "FileOffset": 336, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 340, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_IAT", "VirtualAddress": { "FileOffset": 344, "Offset": 0, "Value": 8192 }, "Size": { "FileOffset": 348, "Offset": 4, "Value": 8 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT", "VirtualAddress": { "FileOffset": 352, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 356, "Offset": 4, "Value": 0 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_COM_DESCRIPTOR", "VirtualAddress": { "FileOffset": 360, "Offset": 0, "Value": 8200 }, "Size": { "FileOffset": 364, "Offset": 4, "Value": 72 } }, { "Structure": "IMAGE_DIRECTORY_ENTRY_RESERVED", "VirtualAddress": { "FileOffset": 368, "Offset": 0, "Value": 0 }, "Size": { "FileOffset": 372, "Offset": 4, "Value": 0 } } ], "Version Information": [ [ { "Structure": "VS_VERSIONINFO", "Length": { "FileOffset": 7768, "Offset": 0, "Value": 600 }, "ValueLength": { "FileOffset": 7770, "Offset": 2, "Value": 52 }, "Type": { "FileOffset": 7772, "Offset": 4, "Value": 0 } }, { "Structure": "VS_FIXEDFILEINFO", "Signature": { "FileOffset": 7808, "Offset": 0, "Value": 4277077181 }, "StrucVersion": { "FileOffset": 7812, "Offset": 4, "Value": 65536 }, "FileVersionMS": { "FileOffset": 7816, "Offset": 8, "Value": 720896 }, "FileVersionLS": { "FileOffset": 7820, "Offset": 12, "Value": 0 }, "ProductVersionMS": { "FileOffset": 7824, "Offset": 16, "Value": 720896 }, "ProductVersionLS": { "FileOffset": 7828, "Offset": 20, "Value": 0 }, "FileFlagsMask": { "FileOffset": 7832, "Offset": 24, "Value": 63 }, "FileFlags": { "FileOffset": 7836, "Offset": 28, "Value": 0 }, "FileOS": { "FileOffset": 7840, "Offset": 32, "Value": 4 }, "FileType": { "FileOffset": 7844, "Offset": 36, "Value": 2 }, "FileSubtype": { "FileOffset": 7848, "Offset": 40, "Value": 0 }, "FileDateMS": { "FileOffset": 7852, "Offset": 44, "Value": 0 }, "FileDateLS": { "FileOffset": 7856, "Offset": 48, "Value": 0 } } ] ], "Imported symbols": [ [ { "Structure": "IMAGE_IMPORT_DESCRIPTOR", "OriginalFirstThunk": { "FileOffset": 7512, "Offset": 0, "Value": 15232 }, "Characteristics": { "FileOffset": 7512, "Offset": 0, "Value": 15232 }, "TimeDateStamp": { "FileOffset": 7516, "Offset": 4, "Value": "0x0 [Thu Jan 1 00:00:00 1970 UTC]" }, "ForwarderChain": { "FileOffset": 7520, "Offset": 8, "Value": 0 }, "Name": { "FileOffset": 7524, "Offset": 12, "Value": 15262 }, "FirstThunk": { "FileOffset": 7528, "Offset": 16, "Value": 8192 } }, { "DLL": "mscoree.dll", "Name": "_CorDllMain", "Hint": 0 } ] ], "Resource directory": [ { "Structure": "IMAGE_RESOURCE_DIRECTORY", "Characteristics": { "FileOffset": 7680, "Offset": 0, "Value": 0 }, "TimeDateStamp": { "FileOffset": 7684, "Offset": 4, "Value": "0x0 [Thu Jan 1 00:00:00 1970 UTC]" }, "MajorVersion": { "FileOffset": 7688, "Offset": 8, "Value": 0 }, "MinorVersion": { "FileOffset": 7690, "Offset": 10, "Value": 0 }, "NumberOfNamedEntries": { "FileOffset": 7692, "Offset": 12, "Value": 0 }, "NumberOfIdEntries": { "FileOffset": 7694, "Offset": 14, "Value": 1 } }, { "Id": [ 16, "RT_VERSION" ], "Structure": "IMAGE_RESOURCE_DIRECTORY_ENTRY", "Name": { "FileOffset": 7696, "Offset": 0, "Value": 16 }, "OffsetToData": { "FileOffset": 7700, "Offset": 4, "Value": 2147483672 } }, [ { "Structure": "IMAGE_RESOURCE_DIRECTORY", "Characteristics": { "FileOffset": 7704, "Offset": 0, "Value": 0 }, "TimeDateStamp": { "FileOffset": 7708, "Offset": 4, "Value": "0x0 [Thu Jan 1 00:00:00 1970 UTC]" }, "MajorVersion": { "FileOffset": 7712, "Offset": 8, "Value": 0 }, "MinorVersion": { "FileOffset": 7714, "Offset": 10, "Value": 0 }, "NumberOfNamedEntries": { "FileOffset": 7716, "Offset": 12, "Value": 0 }, "NumberOfIdEntries": { "FileOffset": 7718, "Offset": 14, "Value": 1 } }, { "Id": 1, "Structure": "IMAGE_RESOURCE_DIRECTORY_ENTRY", "Name": { "FileOffset": 7720, "Offset": 0, "Value": 1 }, "OffsetToData": { "FileOffset": 7724, "Offset": 4, "Value": 2147483696 } }, [ { "Structure": "IMAGE_RESOURCE_DIRECTORY", "Characteristics": { "FileOffset": 7728, "Offset": 0, "Value": 0 }, "TimeDateStamp": { "FileOffset": 7732, "Offset": 4, "Value": "0x0 [Thu Jan 1 00:00:00 1970 UTC]" }, "MajorVersion": { "FileOffset": 7736, "Offset": 8, "Value": 0 }, "MinorVersion": { "FileOffset": 7738, "Offset": 10, "Value": 0 }, "NumberOfNamedEntries": { "FileOffset": 7740, "Offset": 12, "Value": 0 }, "NumberOfIdEntries": { "FileOffset": 7742, "Offset": 14, "Value": 1 } }, { "LANG": 0, "SUBLANG": 0, "LANG_NAME": "LANG_NEUTRAL", "SUBLANG_NAME": "SUBLANG_NEUTRAL", "Structure": "IMAGE_RESOURCE_DATA_ENTRY", "Name": { "FileOffset": 7744, "Offset": 0, "Value": 0 }, "OffsetToData": { "FileOffset": 7752, "Offset": 0, "Value": 16472 }, "Size": { "FileOffset": 7756, "Offset": 4, "Value": 600 }, "CodePage": { "FileOffset": 7760, "Offset": 8, "Value": 0 }, "Reserved": { "FileOffset": 7764, "Offset": 12, "Value": 0 } } ] ] ], "Base relocations": [ [ { "Structure": "IMAGE_BASE_RELOCATION", "VirtualAddress": { "FileOffset": 8704, "Offset": 0, "Value": 12288 }, "SizeOfBlock": { "FileOffset": 8708, "Offset": 4, "Value": 12 } }, { "RVA": 15280, "Type": "HIGHLOW" }, { "RVA": 12288, "Type": "ABSOLUTE" } ] ] } } ================================================ FILE: plot.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. import baker import matplotlib matplotlib.use('Agg') from matplotlib import pyplot as plt from sklearn.metrics import roc_auc_score, roc_curve import pandas as pd import numpy as np import json default_tags = ['adware_tag', 'flooder_tag', 'ransomware_tag', 'dropper_tag', 'spyware_tag', 'packed_tag', 'crypto_miner_tag', 'file_infector_tag', 'installer_tag', 'worm_tag', 'downloader_tag'] default_tag_colors = ['r', 'r', 'r', 'g', 'g', 'b', 'b', 'm', 'm', 'c', 'c'] default_tag_linestyles = [':', '--', '-.', ':', '--', ':', '--', ':', '--', ':', '--'] style_dict = {tag: (color, linestyle) for tag, color, linestyle in zip(default_tags, default_tag_colors, default_tag_linestyles)} style_dict['malware'] = ('k', '-') def collect_dataframes(run_id_to_filename_dictionary): loaded_dataframes = {} for k, v in run_id_to_filename_dictionary.items(): loaded_dataframes[k] = pd.read_csv(v) return loaded_dataframes def get_tprs_at_fpr(result_dataframe, key, target_fprs=None): """ Estimate the True Positive Rate for a dataframe/key combination at specific False Positive Rates of interest. :param result_dataframe: a pandas dataframe :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided the dataframe is expected to have a column names `pred_malware` and `label_malware` :param target_fprs: The FPRs at which you wish to estimate the TPRs; None (uses default np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1]) or a 1-d numpy array :return: target_fprs, the corresponsing TPRs """ if target_fprs is None: target_fprs = np.array([1e-5, 1e-4, 1e-3, 1e-2, 1e-1]) fpr, tpr, thresholds = get_roc_curve(result_dataframe, key) return target_fprs, np.interp(target_fprs, fpr, tpr) def get_roc_curve(result_dataframe, key): """ Get the ROC curve for a single result in a dataframe :param result_dataframe: a dataframe :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided the dataframe is expected to have a column names `pred_malware` and `label_malware` :return: false positive rates, true positive rates, and thresholds (all np.arrays) """ labels = result_dataframe['label_{}'.format(key)] predictions = result_dataframe['pred_{}'.format(key)] return roc_curve(labels, predictions) def get_auc_score(result_dataframe, key): """ Get the Area Under the Curve for the indicated key in the dataframe :param result_dataframe: a dataframe :param key: the name of the result to get the curve for; if (e.g.) the key 'malware' is provided the dataframe is expected to have a column names `pred_malware` and `label_malware` :return: the AUC for the ROC generated for the provided key """ labels = result_dataframe['label_{}'.format(key)] predictions = result_dataframe['pred_{}'.format(key)] return roc_auc_score(labels, predictions) def interpolate_rocs(id_to_roc_dictionary, eval_fpr_points=None): """ This function takes several sets of ROC results and interpolates them to a common set of evaluation (FPR) values to allow for computing e.g. a mean ROC or pointwise variance of the curve across multiple model fittings. :param list_of_rocs: a list of results from get_roc_score (or sklearn.metrics.roc_curve) of the form [(fpr_1, tpr_1, threshold_1), (fpr_2, tpr_2, threshold_2)...] :param eval_fpr_points: the set of FPR values at which to interpolate the results; defaults to `np.logspace(-6, 0, 1000)` :return: eval_fpr_points -- the set of common points to which TPRs have been interpolated interpolated_tprs -- an array with one row for each ROC provided, giving the interpolated TPR for that ROC at the corresponding column in eval_fpr_points """ if eval_fpr_points is None: eval_fpr_points = np.logspace(-6, 0, 1000) interpolated_tprs = {} for k, (fpr, tpr, thresh) in id_to_roc_dictionary.items(): interpolated_tprs[k] = np.interp(eval_fpr_points, fpr, tpr) return eval_fpr_points, interpolated_tprs def plot_roc_with_confidence(id_to_dataframe_dictionary, key, filename, include_range=False, style=None, std_alpha=.2, range_alpha=.1): """ Compute the mean and standard deviation of the ROC curve from a sequence of results and plot it with shading. """ if not len(id_to_dataframe_dictionary) > 1: raise ValueError("Need a minimum of 2 result sets to plot confidence region; found {}".format( len(id_to_dataframe_dictionary) )) if style is None: if key in style_dict: color, linestyle = style_dict[key] else: raise ValueError( "No default style information is available for key {}; please provide (linestyle, color)".format(key)) else: linestyle, color = style id_to_roc_dictionary = {k: get_roc_curve(df, key) for k, df in id_to_dataframe_dictionary.items()} fpr_points, interpolated_tprs = interpolate_rocs(id_to_roc_dictionary) tpr_array = np.vstack([v for v in interpolated_tprs.values()]) mean_tpr = tpr_array.mean(0) std_tpr = np.sqrt(tpr_array.var(0)) aucs = np.array([get_auc_score(v, key) for v in id_to_dataframe_dictionary.values()]) mean_auc = aucs.mean() min_auc = aucs.min() max_auc = aucs.max() std_auc = np.sqrt(aucs.var()) plt.figure(figsize=(12, 12)) plt.semilogx(fpr_points, mean_tpr, color + linestyle, linewidth=2.0, label=f"{key}: {mean_auc:5.3f}$\pm${std_auc:5.3f} [{min_auc:5.3f}-{max_auc:5.3f}]") plt.fill_between(fpr_points, mean_tpr - std_tpr, mean_tpr + std_tpr, color=color, alpha=std_alpha) if include_range: plt.fill_between(fpr_points, tpr_array.min(0), tpr_array.max(0), color=color, alpha=range_alpha) plt.legend() plt.xlim(1e-6, 1.0) plt.ylim([0., 1.]) plt.xlabel('False Positive Rate (FPR)') plt.ylabel('True Positive Rate (TPR)') plt.savefig(filename) plt.clf() def plot_tag_results(dataframe, filename): all_tag_rocs = {tag: get_roc_curve(dataframe, tag) for tag in default_tags} eval_fpr_pts, interpolated_rocs = interpolate_rocs(all_tag_rocs) plt.figure(figsize=(12, 12)) for tag in default_tags: color, linestyle = style_dict[tag] auc = get_auc_score(dataframe, tag) plt.semilogx(eval_fpr_pts, interpolated_rocs[tag], color + linestyle, linewidth=2.0, label=f"{tag}:{auc:5.3f}") plt.legend(loc='best') plt.xlim(1e-6, 1.0) plt.ylim([0., 1.]) plt.xlabel('False Positive Rate (FPR)') plt.ylabel('True Positive Rate (TPR)') plt.savefig(filename) plt.clf() @baker.command def plot_tag_result(results_file, output_filename): """ Takes a result file from a feedforward neural network model that includes all tags, and produces multiple overlaid ROC plots for each tag individually. :param results_file: complete path to a results.csv file that contains the output of a model run. Note that the model must have been trained with --use_tag_labels=True and evaluated using --evaluate_tags=True :param output_filename: the name of the file in which ot save the resulting plot. """ id_to_resultfile_dict = {'run': results_file} id_to_dataframe_dict = collect_dataframes(id_to_resultfile_dict) plot_tag_results(id_to_dataframe_dict['run'], output_filename) @baker.command def plot_roc_distribution_for_tag(run_to_filename_json, output_filename, tag_to_plot='malware', linestyle=None, color=None, include_range=False, std_alpha=.2, range_alpha=.1): """ Compute the mean and standard deviation of the TPR at a range of FPRS (the ROC curve) over several sets of results for a given tag. The run_to_filename_json file must have the following format: {"run_id_0": "/full/path/to/results.csv/for/run/0/results.csv", "run_id_1": "/full/path/to/results.csv/for/run/1/results.csv", ... } :param run_to_filename_json: A json file that contains a key-value map that links run IDs to the full path to a results file (including the file name) :param output_filename: The filename to save the resulting figure to :param tag_to_plot: the tag from the results to plot; defaults to "malware" :param linestyle: the linestyle to use in the plot (defaults to the tag value in plot.style_dict) :param color: the color to use in the plot (defaults to the tag value in plot.style_dict) :param include_range: plot the min/max value as well (default False) :param std_alpha: the alpha value for the shading for standard deviation range (default 0.2) :param range_alpha: the alpha value for the shading for range, if plotted (default 0.1) """ id_to_resultfile_dict = json.load(open(run_to_filename_json, 'r')) id_to_dataframe_dict = collect_dataframes(id_to_resultfile_dict) if color is None or linestyle is None: if not (color is None and linestyle is None): raise ValueError("both color and linestyle should either be specified or None") style = None else: style = (color, linestyle) plot_roc_with_confidence(id_to_dataframe_dict, tag_to_plot, output_filename, include_range=include_range, style=style, std_alpha=std_alpha, range_alpha=range_alpha) if __name__ == '__main__': baker.run() ================================================ FILE: shas_missing_ember_features.json ================================================ [File too large to display: 21.7 MB] ================================================ FILE: train.py ================================================ # Copyright 2020, Sophos Limited. All rights reserved. # # 'Sophos' and 'Sophos Anti-Virus' are registered trademarks of # Sophos Limited and Sophos Group. All other product and company # names mentioned are trademarks or registered trademarks of their # respective owners. from dataset import Dataset from nets import PENetwork import warnings import os import baker import torch import torch.nn.functional as F from torch.utils import data import sys from generators import get_generator from config import device import config from logzero import logger from copy import deepcopy import numpy as np from collections import defaultdict from sklearn.metrics import roc_auc_score import pickle import json import lightgbm as lgb def compute_loss(predictions, labels, loss_wts={'malware': 1.0, 'count': 0.1, 'tags': 0.1}): """ Compute losses for a malware feed-forward neural network (optionally with SMART tags and vendor detection count auxiliary losses). :param predictions: a dictionary of results from a PENetwork model :param labels: a dictionary of labels :param loss_wts: weights to assign to each head of the network (if it exists); defaults to values used in the ALOHA paper (1.0 for malware, 0.1 for count and each tag) """ loss_dict = {'total':0.} if 'malware' in labels: malware_labels = labels['malware'].float().to(device) malware_loss = F.binary_cross_entropy(predictions['malware'].reshape(malware_labels.shape), malware_labels) weight = loss_wts['malware'] if 'malware' in loss_wts else 1.0 loss_dict['malware'] = deepcopy(malware_loss.item()) loss_dict['total'] += malware_loss * weight if 'count' in labels: count_labels = labels['count'].float().to(device) count_loss = torch.nn.PoissonNLLLoss()(predictions['count'].reshape(count_labels.shape), count_labels) weight = loss_wts['count'] if 'count' in loss_wts else 1.0 loss_dict['count'] = deepcopy(count_loss.item()) loss_dict['total'] += count_loss * weight if 'tags' in labels: tag_labels = labels['tags'].float().to(device) tags_loss = F.binary_cross_entropy(predictions['tags'], tag_labels) weight = loss_wts['tags'] if 'tags' in loss_wts else 1.0 loss_dict['tags'] = deepcopy(tags_loss.item()) loss_dict['total'] += tags_loss * weight return loss_dict @baker.command def train_network(train_db_path=config.db_path, checkpoint_dir=config.checkpoint_dir, max_epochs=10, use_malicious_labels=True, use_count_labels=True, use_tag_labels=True, feature_dimension=2381, random_seed=None, workers = None, remove_missing_features='scan'): """ Train a feed-forward neural network on EMBER 2.0 features, optionally with additional targets as described in the ALOHA paper (https://arxiv.org/abs/1903.05700). SMART tags based on (https://arxiv.org/abs/1905.06262) :param train_db_path: Path in which the meta.db is stored; defaults to the value specified in `config.py` :param checkpoint_dir: Directory in which to save model checkpoints; WARNING -- this will overwrite any existing checkpoints without warning. :param max_epochs: How many epochs to train for; defaults to 10 :param use_malicious_labels: Whether or not to use malware/benignware labels as a target; defaults to True :param use_count_labels: Whether or not to use the counts as an additional target; defaults to True :param use_tag_labels: Whether or not to use SMART tags as additional targets; defaults to True :param feature_dimension: The input dimension of the model; defaults to 2381 (EMBER 2.0 feature size) :param random_seed: if provided, seed random number generation with this value (defaults None, no seeding) :param workers: How many worker processes should the dataloader use (default None, use multiprocessing.cpu_count()) :param remove_missing_features: Strategy for removing missing samples, with meta.db entries but no associated features, from the data (e.g. feature extraction failures). Must be one of: 'scan', 'none', or path to a missing keys file. Setting to 'scan' (default) will check all entries in the LMDB and remove any keys that are missing -- safe but slow. Setting to 'none' will not perform a check, but may lead to a run failure if any features are missing. Setting to a path will attempt to load a json-serialized list of SHA256 values from the specified file, indicating which keys are missing and should be removed from the dataloader. """ workers = workers if workers is None else int(workers) os.system('mkdir -p {}'.format(checkpoint_dir)) if random_seed is not None: logger.info(f"Setting random seed to {int(random_seed)}.") torch.manual_seed(int(random_seed)) logger.info('...instantiating network') model = PENetwork(use_malware=True, use_counts=True, use_tags=True, n_tags=len(Dataset.tags), feature_dimension=feature_dimension).to(device) opt = torch.optim.Adam(model.parameters()) generator = get_generator(path=train_db_path, mode='train', use_malicious_labels=use_malicious_labels, use_count_labels=use_count_labels, use_tag_labels=use_tag_labels, num_workers = workers, remove_missing_features=remove_missing_features) val_generator = get_generator(path = train_db_path, mode='validation', use_malicious_labels=use_malicious_labels, use_count_labels=use_count_labels, use_tag_labels=use_tag_labels, num_workers=workers, remove_missing_features=remove_missing_features) steps_per_epoch = len(generator) val_steps_per_epoch = len(val_generator) for epoch in range(1, max_epochs + 1): loss_histories = defaultdict(list) model.train() for i, (features, labels) in enumerate(generator): opt.zero_grad() features = deepcopy(features).to(device) out = model(features) loss_dict = compute_loss(out, deepcopy(labels)) loss = loss_dict['total'] loss.backward() opt.step() for k in loss_dict.keys(): if k == 'total': loss_histories[k].append(deepcopy(loss_dict[k].detach().cpu().item())) else: loss_histories[k].append(loss_dict[k]) loss_str = " ".join([f"{key} loss:{value:7.3f}" for key, value in loss_dict.items()]) loss_str += " | " loss_str += " ".join([f"{key} mean:{np.mean(value):7.3f}" for key, value in loss_histories.items()]) sys.stdout.write('\r Epoch: {}/{} {}/{} '.format(epoch, max_epochs, i + 1, steps_per_epoch) + loss_str) sys.stdout.flush() del features, labels # do our best to avoid weird references that lead to generator errors torch.save(model.state_dict(), os.path.join(checkpoint_dir, "epoch_{}.pt".format(str(epoch)))) print() loss_histories = defaultdict(list) model.eval() for i, (features, labels) in enumerate(val_generator): features = deepcopy(features).to(device) with torch.no_grad(): out = model(features) loss_dict = compute_loss(out, deepcopy(labels)) loss = loss_dict['total'] for k in loss_dict.keys(): if k == 'total': loss_histories[k].append(deepcopy(loss_dict[k].detach().cpu().item())) else: loss_histories[k].append(loss_dict[k]) loss_str = " ".join([f"{key} loss:{value:7.3f}" for key, value in loss_dict.items()]) loss_str += " | " loss_str += " ".join([f"{key} mean:{np.mean(value):7.3f}" for key, value in loss_histories.items()]) sys.stdout.write('\r Val: {}/{} {}/{} '.format(epoch, max_epochs, i + 1, val_steps_per_epoch) + loss_str) sys.stdout.flush() del features, labels # do our best to avoid weird references that lead to generator errors print() print('...done') @baker.command def train_lightGBM(train_npz_file, validation_npz_file, model_configuration_file, checkpoint_dir, random_seed=None): """ Train a lightGBM model. Note that this is done entirely in-memory and requires a substantial amount of RAM (approximately 175GB). Baseline models were trained on an Amazon m5.24xlarge instance. :param train_npz_file: path to a .npz file containing featres in 'arr_0' and labels in 'arr_1' for the training data :param validation_npz_file: path to a .npz file containing featres in 'arr_0' and labels in 'arr_1' for the validation data :param model_configuration_file: path to a json file specifying lightGBM parameters (see lightgbm_config.json for an example) :param checkpoint_dir: location to write the trained model to :param random_seed: defaults to None (no seeding) otherwise an integer providing a fixed random seed for the experiment. """ logger.info("Loading model config json file...") config = json.load(open(model_configuration_file, 'r')) if random_seed is not None: random_seed = int(random_seed) config['seed']=random_seed config['bagging_seed']=random_seed config['feature_fraction_seed']=random_seed logger.info("Loading train data...") train_npz = np.load(train_npz_file) train_fts, train_lbls = train_npz['arr_0'], train_npz['arr_1'] val_npz = np.load(validation_npz_file) val_fts, val_lbls = val_npz['arr_0'], val_npz['arr_1'] logger.info("Converting data to lightGMB.Dataset") train_data = lgb.Dataset(train_fts, label=train_lbls) val_data = lgb.Dataset(val_fts, label=val_lbls) logger.info("Starting training") bst = lgb.train(params=config, train_set=train_data, valid_sets=[val_data]) os.system('mkdir -p {}'.format(checkpoint_dir)) modelfile = os.path.join(checkpoint_dir, 'lightgbm.model') logger.info(f"Saving model to {modelfile}") bst.save_model(modelfile) if __name__ == '__main__': baker.run()