Repository: gadiluna/SAFE Branch: master Commit: fddfca90e111 Files: 40 Total size: 100.8 KB Directory structure: gitextract_60c_bmdf/ ├── 404.html ├── Gemfile ├── LICENSE ├── README.md ├── __init__.py ├── _config.yml ├── asm_embedding/ │ ├── DocumentManipulation.py │ ├── FunctionAnalyzerRadare.py │ ├── FunctionNormalizer.py │ ├── InstructionsConverter.py │ └── __init__.py ├── dataset_creation/ │ ├── DataSplitter.py │ ├── DatabaseFactory.py │ ├── ExperimentUtil.py │ ├── FunctionsEmbedder.py │ ├── __init__.py │ └── convertDB.py ├── download_model.sh ├── downloader.py ├── function_search/ │ ├── EvaluateSearchEngine.py │ ├── FunctionSearchEngine.py │ ├── __init__.py │ └── fromJsonSearchToPlot.py ├── godown.pl ├── helloworld.c ├── helloworld.o ├── index.md ├── neural_network/ │ ├── PairFactory.py │ ├── SAFEEmbedder.py │ ├── SAFE_model.py │ ├── SiameseSAFE.py │ ├── __init__.py │ ├── freeze_graph.sh │ ├── parameters.py │ ├── train.py │ └── train.sh ├── requirements.txt ├── safe.py └── utils/ ├── __init__.py └── utils.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: 404.html ================================================ --- layout: default ---

404

Page not found :(

The requested page could not be found.

================================================ FILE: Gemfile ================================================ source "https://rubygems.org" # Hello! This is where you manage which Jekyll version is used to run. # When you want to use a different version, change it below, save the # file and run `bundle install`. Run Jekyll with `bundle exec`, like so: # # bundle exec jekyll serve # # This will help ensure the proper Jekyll version is running. # Happy Jekylling! gem "jekyll", "~> 3.7.4" # This is the default theme for new Jekyll sites. You may change this to anything you like. gem "minima", "~> 2.0" # If you want to use GitHub Pages, remove the "gem "jekyll"" above and # uncomment the line below. To upgrade, run `bundle update github-pages`. # gem "github-pages", group: :jekyll_plugins #gem "github-pages", group: :jekyll_plugins # If you have any plugins, put them here! group :jekyll_plugins do gem "jekyll-feed", "~> 0.6" end # Windows does not include zoneinfo files, so bundle the tzinfo-data gem gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby] # Performance-booster for watching directories on Windows gem "wdm", "~> 0.1.0" if Gem.win_platform? ================================================ FILE: LICENSE ================================================ Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni ================================================ FILE: README.md ================================================ # SAFE : Self Attentive Function Embedding Paper --- This software is the outcome of our accademic research. See our arXiv paper: [arxiv](https://arxiv.org/abs/1811.05296) If you use this code, please cite our accademic paper as: ```bibtex @inproceedings{massarelli2018safe, title={SAFE: Self-Attentive Function Embeddings for Binary Similarity}, author={Massarelli, Luca and Di Luna, Giuseppe Antonio and Petroni, Fabio and Querzoni, Leonardo and Baldoni, Roberto}, booktitle={Proceedings of 16th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA)}, year={2019} } ``` What you need ----- You need [radare2](https://github.com/radare/radare2) installed in your system. Quickstart ----- To create the embedding of a function: ``` git clone https://github.com/gadiluna/SAFE.git pip install -r requirements chmod +x download_model.sh ./download_model.sh python safe.py -m data/safe.pb -i helloworld.o -a 100000F30 ``` #### What to do with an embedding? Once you have two embeddings ```embedding_x``` and ```embedding_y``` you can compute the similarity of the corresponding functions as: ``` from sklearn.metrics.pairwise import cosine_similarity sim=cosine_similarity(embedding_x, embedding_y) ``` Data Needed ----- SAFE needs few information to work. Two are essentials, a model that tells safe how to convert assembly instructions in vectors (i2v model) and a model that tells safe how to convert an binary function into a vector. Both models can be downloaded by using the command ``` ./download_model.sh ``` the downloader downloads the model and place them in the directory data. The directory tree after the download should be. ``` safe/-- githubcode \ \--data/-----safe.pb \ \---i2v/ ``` The safe.pb file contains the safe-model used to convert binary function to vectors. The i2v folder contains the i2v model. Hardcore Details ---- This section contains details that are needed to replicate our experiments, if you are an user of safe you can skip it. ### Safe.pb This is the freezed tensorflow trained model for AMD64 architecture. You can import it in your project using: ``` import tensorflow as tf with tf.gfile.GFile("safe.pb", "rb") as f: graph_def = tf.GraphDef() graph_def.ParseFromString(f.read()) with tf.Graph().as_default() as graph: tf.import_graph_def(graph_def) sess = tf.Session(graph=graph) ``` see file: neural_network/SAFEEmbedder.py ### i2v The i2v folder contains two files. A Matrix where each row is the embedding of an asm instruction. A json file that contains a dictonary mapping asm instructions into row numbers of the matrix above. see file: asm_embedding/InstructionsConverter.py ## Train the model If you want to train the model using our datasets you have to first use: ``` python3 downloader.py -td ``` This will download the datasets into data folder. Note that the datasets are compressed so you have to decompress them yourself. This data will be an sqlite databases. To start the train use neural_network/train.sh. The db can be selected by changing the parameter into train.sh. If you want information on the dataset see our paper. ## Create your own dataset If you want to create your own dataset you can use the script ExperimentUtil into the folder dataset creation. ## Create a functions knowledge base If you want to use SAFE binary code search engine you can use the script ExperimentUtil to create the knowledge base. Then you can search through it using the script into function_search Related Projects --- * YARASAFE: Automatic Binary Function Similarity Checks with Yara (https://github.com/lucamassarelli/yarasafe) * SAFEtorch: Pytorch implemenation of the SAFE neural network (https://github.com/facebookresearch/SAFEtorch) Thanks --- In our code we use [godown](https://github.com/circulosmeos/gdown.pl) to download data from Google drive. We thank circulosmeos, the creator of godown. We thank Davide Italiano for the useful discussions. ================================================ FILE: __init__.py ================================================ ================================================ FILE: _config.yml ================================================ # Welcome to Jekyll! # # This config file is meant for settings that affect your whole blog, values # which you are expected to set up once and rarely edit after that. If you find # yourself editing this file very often, consider using Jekyll's data files # feature for the data you need to update frequently. # # For technical reasons, this file is *NOT* reloaded automatically when you use # 'bundle exec jekyll serve'. If you change this file, please restart the server process. # Site settings # These are used to personalize your new site. If you look in the HTML files, # you will see them accessed via {{ site.title }}, {{ site.email }}, and so on. # You can create any custom variable you would like, and they will be accessible # in the templates via {{ site.myvariable }}. title: 'SAFE: Self-Attentive Function Embeddings' email: safeteam@gmail.com description: >- # this means to ignore newlines until "baseurl:" Self-Attentive Function Embeddings for binary similarity. https://arxiv.org/abs/1811.05296 baseurl: "" # the subpath of your site, e.g. /blog url: "" # the base hostname & protocol for your site, e.g. http://example.com twitter_username: github_username: # Build settings markdown: kramdown theme: minima #theme: jekyll-theme-midnight plugins: - jekyll-feed # Exclude from processing. # The following items will not be processed, by default. Create a custom list # to override the default setting. # exclude: # - Gemfile # - Gemfile.lock # - node_modules # - vendor/bundle/ # - vendor/cache/ # - vendor/gems/ # - vendor/ruby/ ================================================ FILE: asm_embedding/DocumentManipulation.py ================================================ import json import re import os def list_to_str(li): i='' for x in li: i=i+' '+x i=i+' endfun'*5 return i def document_append(strin): with open('/Users/giuseppe/docuent_X86','a') as f: f.write(strin) ciro=set() cantina=[] num_total=0 num_filtered=0 with open('/Users/giuseppe/dump.x86.linux.json') as f: l=f.readline() print('loaded') r = re.split('(\[.*?\])(?= *\[)', l) del l for x in r: if '[' in x: gennaro=json.loads(x) for materdomini in gennaro: num_total=num_total+1 if materdomini[0] not in ciro: ciro.add(materdomini[0]) num_filtered=num_filtered+1 a=list_to_str(materdomini[1]) document_append(a) del x print(num_total) print(num_filtered) ================================================ FILE: asm_embedding/FunctionAnalyzerRadare.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import json import r2pipe class RadareFunctionAnalyzer: def __init__(self, filename, use_symbol, depth): self.r2 = r2pipe.open(filename, flags=['-2']) self.filename = filename self.arch, _ = self.get_arch() self.top_depth = depth self.use_symbol = use_symbol def __enter__(self): return self @staticmethod def filter_reg(op): return op["value"] @staticmethod def filter_imm(op): imm = int(op["value"]) if -int(5000) <= imm <= int(5000): ret = str(hex(op["value"])) else: ret = str('HIMM') return ret @staticmethod def filter_mem(op): if "base" not in op: op["base"] = 0 if op["base"] == 0: r = "[" + "MEM" + "]" else: reg_base = str(op["base"]) disp = str(op["disp"]) scale = str(op["scale"]) r = '[' + reg_base + "*" + scale + "+" + disp + ']' return r @staticmethod def filter_memory_references(i): inst = "" + i["mnemonic"] for op in i["operands"]: if op["type"] == 'reg': inst += " " + RadareFunctionAnalyzer.filter_reg(op) elif op["type"] == 'imm': inst += " " + RadareFunctionAnalyzer.filter_imm(op) elif op["type"] == 'mem': inst += " " + RadareFunctionAnalyzer.filter_mem(op) if len(i["operands"]) > 1: inst = inst + "," if "," in inst: inst = inst[:-1] inst = inst.replace(" ", "_") return str(inst) @staticmethod def get_callref(my_function, depth): calls = {} if 'callrefs' in my_function and depth > 0: for cc in my_function['callrefs']: if cc["type"] == "C": calls[cc['at']] = cc['addr'] return calls def get_instruction(self): instruction = json.loads(self.r2.cmd("aoj 1")) if len(instruction) > 0: instruction = instruction[0] else: return None operands = [] if 'opex' not in instruction: return None for op in instruction['opex']['operands']: operands.append(op) instruction['operands'] = operands return instruction def function_to_inst(self, functions_dict, my_function, depth): instructions = [] asm = "" if self.use_symbol: s = my_function['vaddr'] else: s = my_function['offset'] calls = RadareFunctionAnalyzer.get_callref(my_function, depth) self.r2.cmd('s ' + str(s)) if self.use_symbol: end_address = s + my_function["size"] else: end_address = s + my_function["realsz"] while s < end_address: instruction = self.get_instruction() asm += instruction["bytes"] if self.arch == 'x86': filtered_instruction = "X_" + RadareFunctionAnalyzer.filter_memory_references(instruction) elif self.arch == 'arm': filtered_instruction = "A_" + RadareFunctionAnalyzer.filter_memory_references(instruction) instructions.append(filtered_instruction) if s in calls and depth > 0: if calls[s] in functions_dict: ii, aa = self.function_to_inst(functions_dict, functions_dict[calls[s]], depth-1) instructions.extend(ii) asm += aa self.r2.cmd("s " + str(s)) self.r2.cmd("so 1") s = int(self.r2.cmd("s"), 16) return instructions, asm def get_arch(self): try: info = json.loads(self.r2.cmd('ij')) if 'bin' in info: arch = info['bin']['arch'] bits = info['bin']['bits'] except: print("Error loading file") arch = None bits = None return arch, bits def find_functions(self): self.r2.cmd('aaa') try: function_list = json.loads(self.r2.cmd('aflj')) except: function_list = [] return function_list def find_functions_by_symbols(self): self.r2.cmd('aa') try: symbols = json.loads(self.r2.cmd('isj')) fcn_symb = [s for s in symbols if s['type'] == 'FUNC'] except: fcn_symb = [] return fcn_symb def analyze(self): if self.use_symbol: function_list = self.find_functions_by_symbols() else: function_list = self.find_functions() functions_dict = {} if self.top_depth > 0: for my_function in function_list: if self.use_symbol: functions_dict[my_function['vaddr']] = my_function else: functions_dict[my_function['offset']] = my_function result = {} for my_function in function_list: if self.use_symbol: address = my_function['vaddr'] else: address = my_function['offset'] try: instructions, asm = self.function_to_inst(functions_dict, my_function, self.top_depth) result[my_function['name']] = {'filtered_instructions': instructions, "asm": asm, "address": address} except: print("Error in functions: {} from {}".format(my_function['name'], self.filename)) pass return result def close(self): self.r2.quit() def __exit__(self, exc_type, exc_value, traceback): self.r2.quit() ================================================ FILE: asm_embedding/FunctionNormalizer.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import numpy as np class FunctionNormalizer: def __init__(self, max_instruction): self.max_instructions = max_instruction def normalize(self, f): f = np.asarray(f[0:self.max_instructions]) length = f.shape[0] if f.shape[0] < self.max_instructions: f = np.pad(f, (0, self.max_instructions - f.shape[0]), mode='constant') return f, length def normalize_function_pairs(self, pairs): lengths = [] new_pairs = [] for x in pairs: f0, len0 = self.normalize(x[0]) f1, len1 = self.normalize(x[1]) lengths.append((len0, len1)) new_pairs.append((f0, f1)) return new_pairs, lengths def normalize_functions(self, functions): lengths = [] new_functions = [] for f in functions: f, length = self.normalize(f) lengths.append(length) new_functions.append(f) return new_functions, lengths ================================================ FILE: asm_embedding/InstructionsConverter.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import json class InstructionsConverter: def __init__(self, json_i2id): f = open(json_i2id, 'r') self.i2id = json.load(f) f.close() def convert_to_ids(self, instructions_list): ret_array = [] # For each instruction we add +1 to its ID because the first # element of the embedding matrix is zero for x in instructions_list: if x in self.i2id: ret_array.append(self.i2id[x] + 1) elif 'X_' in x: # print(str(x) + " is not a known x86 instruction") ret_array.append(self.i2id['X_UNK'] + 1) elif 'A_' in x: # print(str(x) + " is not a known arm instruction") ret_array.append(self.i2id['A_UNK'] + 1) else: # print("There is a problem " + str(x) + " does not appear to be an asm or arm instruction") ret_array.append(self.i2id['X_UNK'] + 1) return ret_array ================================================ FILE: asm_embedding/__init__.py ================================================ ================================================ FILE: dataset_creation/DataSplitter.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import json import random import sqlite3 from tqdm import tqdm class DataSplitter: def __init__(self, db_name): self.db_name = db_name def create_pair_table(self, table_name): conn = sqlite3.connect(self.db_name) c = conn.cursor() c.executescript("DROP TABLE IF EXISTS {} ".format(table_name)) c.execute("CREATE TABLE {} (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)".format(table_name)) conn.commit() conn.close() def get_ids(self, set_type): conn = sqlite3.connect(self.db_name) cur = conn.cursor() q = cur.execute("SELECT id FROM {}".format(set_type)) ids = q.fetchall() conn.close() return ids @staticmethod def select_similar_cfg(id, provenance, ids, cursor): q1 = cursor.execute('SELECT id FROM functions WHERE project=? AND file_name=? and function_name=?', provenance) candidates = [i[0] for i in q1.fetchall() if (i[0] != id and i[0] in ids)] if len(candidates) == 0: return None id_similar = random.choice(candidates) return id_similar @staticmethod def select_dissimilar_cfg(ids, provenance, cursor): while True: id_dissimilar = random.choice(ids) q2 = cursor.execute('SELECT project, file_name, function_name FROM functions WHERE id=?', id_dissimilar) res = q2.fetchone() if res != provenance: break return id_dissimilar def create_epoch_pairs(self, epoch_number, pairs_table,id_table): random.seed = epoch_number conn = sqlite3.connect(self.db_name) cur = conn.cursor() ids = cur.execute("SELECT id FROM "+id_table).fetchall() id_set=set(ids) true_pair = [] false_pair = [] for my_id in tqdm(ids): q = cur.execute('SELECT project, file_name, function_name FROM functions WHERE id =?', my_id) cfg_0_provenance = q.fetchone() id_sim = DataSplitter.select_similar_cfg(my_id, cfg_0_provenance, id_set, cur) id_dissim = DataSplitter.select_dissimilar_cfg(ids, cfg_0_provenance, cur) if id_sim is not None and id_dissim is not None: true_pair.append((my_id, id_sim)) false_pair.append((my_id, id_dissim)) true_pair = str(json.dumps(true_pair)) false_pair = str(json.dumps(false_pair)) cur.execute("INSERT INTO {} VALUES (?,?,?)".format(pairs_table), (epoch_number, true_pair, false_pair)) conn.commit() conn.close() def create_pairs(self, total_epochs): self.create_pair_table('train_pairs') self.create_pair_table('validation_pairs') self.create_pair_table('test_pairs') for i in range(0, total_epochs): print("Creating training pairs for epoch {} of {}".format(i, total_epochs)) self.create_epoch_pairs(i, 'train_pairs','train') print("Creating validation pairs") self.create_epoch_pairs(0, 'validation_pairs','validation') print("Creating test pairs") self.create_epoch_pairs(0, "test_pairs",'test') @staticmethod def prepare_set(data_to_include, table_name, file_list, cur): i = 0 while i < data_to_include and len(file_list) > 0: choice = random.choice(file_list) file_list.remove(choice) q = cur.execute("SELECT id FROM functions where project=? AND file_name=?", choice) data = q.fetchall() cur.executemany("INSERT INTO {} VALUES (?)".format(table_name), data) i += len(data) return file_list, i def split_data(self, validation_dim, test_dim): random.seed = 12345 conn = sqlite3.connect(self.db_name) c = conn.cursor() q = c.execute('''SELECT project, file_name FROM functions ''') data = q.fetchall() conn.commit() num_data = len(data) num_test = int(num_data * test_dim) num_validation = int(num_data * validation_dim) filename = list(set(data)) c.execute("DROP TABLE IF EXISTS train") c.execute("DROP TABLE IF EXISTS test") c.execute("DROP TABLE IF EXISTS validation") c.execute("CREATE TABLE IF NOT EXISTS train (id INTEGER PRIMARY KEY)") c.execute("CREATE TABLE IF NOT EXISTS validation (id INTEGER PRIMARY KEY)") c.execute("CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY)") c.execute('''CREATE INDEX IF NOT EXISTS my_index ON functions(project, file_name, function_name)''') c.execute('''CREATE INDEX IF NOT EXISTS my_index_2 ON functions(project, file_name)''') filename, test_num = DataSplitter.prepare_set(num_test, 'test', filename, conn.cursor()) conn.commit() assert len(filename) > 0 filename, val_num = self.prepare_set(num_validation, 'validation', filename, conn.cursor()) conn.commit() assert len(filename) > 0 _, train_num = self.prepare_set(num_data - num_test - num_validation, 'train', filename, conn.cursor()) conn.commit() print("Train Size: {}".format(train_num)) print("Validation Size: {}".format(val_num)) print("Test Size: {}".format(test_num)) ================================================ FILE: dataset_creation/DatabaseFactory.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni from asm_embedding.InstructionsConverter import InstructionsConverter from asm_embedding.FunctionAnalyzerRadare import RadareFunctionAnalyzer import json import multiprocessing from multiprocessing import Pool from multiprocessing.dummy import Pool as ThreadPool import os import random import signal import sqlite3 from tqdm import tqdm class DatabaseFactory: def __init__(self, db_name, root_path): self.db_name = db_name self.root_path = root_path @staticmethod def worker(item): DatabaseFactory.analyze_file(item) return 0 @staticmethod def extract_function(graph_analyzer): return graph_analyzer.extractAll() @staticmethod def insert_in_db(db_name, pool_sem, func, filename, function_name, instruction_converter): path = filename.split(os.sep) if len(path) < 4: return asm = func["asm"] instructions_list = func["filtered_instructions"] instruction_ids = json.dumps(instruction_converter.convert_to_ids(instructions_list)) pool_sem.acquire() conn = sqlite3.connect(db_name) cur = conn.cursor() cur.execute('''INSERT INTO functions VALUES (?,?,?,?,?,?,?,?)''', (None, # id path[-4], # project path[-3], # compiler path[-2], # optimization path[-1], # file_name function_name, # function_name asm, # asm len(instructions_list)) # num of instructions ) inserted_id = cur.lastrowid cur.execute('''INSERT INTO filtered_functions VALUES (?,?)''', (inserted_id, instruction_ids) ) conn.commit() conn.close() pool_sem.release() @staticmethod def analyze_file(item): global pool_sem os.setpgrp() filename = item[0] db = item[1] use_symbol = item[2] depth = item[3] instruction_converter = item[4] analyzer = RadareFunctionAnalyzer(filename, use_symbol, depth) p = ThreadPool(1) res = p.apply_async(analyzer.analyze) try: result = res.get(120) except multiprocessing.TimeoutError: print("Aborting due to timeout:" + str(filename)) print('Try to modify the timeout value in DatabaseFactory instruction result = res.get(TIMEOUT)') os.killpg(0, signal.SIGKILL) except Exception: print("Aborting due to error:" + str(filename)) os.killpg(0, signal.SIGKILL) for func in result: DatabaseFactory.insert_in_db(db, pool_sem, result[func], filename, func, instruction_converter) analyzer.close() return 0 # Create the db where data are stored def create_db(self): print('Database creation...') conn = sqlite3.connect(self.db_name) conn.execute(''' CREATE TABLE IF NOT EXISTS functions (id INTEGER PRIMARY KEY, project text, compiler text, optimization text, file_name text, function_name text, asm text, num_instructions INTEGER) ''') conn.execute('''CREATE TABLE IF NOT EXISTS filtered_functions (id INTEGER PRIMARY KEY, instructions_list text) ''') conn.commit() conn.close() # Scan the root directory to find all the file to analyze, # query also the db for already analyzed files. def scan_for_file(self, start): file_list = [] # Scan recursively all the subdirectory directories = os.listdir(start) for item in directories: item = os.path.join(start,item) if os.path.isdir(item): file_list.extend(self.scan_for_file(item + os.sep)) elif os.path.isfile(item) and item.endswith('.o'): file_list.append(item) return file_list # Looks for already existing files in the database # It returns a list of files that are not in the database def remove_override(self, file_list): conn = sqlite3.connect(self.db_name) cur = conn.cursor() q = cur.execute('''SELECT project, compiler, optimization, file_name FROM functions''') names = q.fetchall() names = [os.path.join(self.root_path, n[0], n[1], n[2], n[3]) for n in names] names = set(names) # If some files is already in the db remove it from the file list if len(names) > 0: print(str(len(names)) + ' Already in the database') cleaned_file_list = [] for f in file_list: if not(f in names): cleaned_file_list.append(f) return cleaned_file_list # root function to create the db def build_db(self, use_symbol, depth): global pool_sem pool_sem = multiprocessing.BoundedSemaphore(value=1) instruction_converter = InstructionsConverter("data/i2v/word2id.json") self.create_db() file_list = self.scan_for_file(self.root_path) print('Found ' + str(len(file_list)) + ' during the scan') file_list = self.remove_override(file_list) print('Find ' + str(len(file_list)) + ' files to analyze') random.shuffle(file_list) t_args = [(f, self.db_name, use_symbol, depth, instruction_converter) for f in file_list] # Start a parallel pool to analyze files p = Pool(processes=None, maxtasksperchild=20) for _ in tqdm(p.imap_unordered(DatabaseFactory.worker, t_args), total=len(file_list)): pass p.close() p.join() ================================================ FILE: dataset_creation/ExperimentUtil.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import argparse from dataset_creation import DatabaseFactory, DataSplitter, FunctionsEmbedder from utils.utils import print_safe def debug_msg(): msg = "SAFE DATABASE UTILITY" msg += "-------------------------------------------------\n" msg += "This program is an utility to save data into an sqlite database with SAFE \n\n" msg += "There are three main command: \n" msg += "BUILD: It create a db with two tables: functions, filtered_functions. \n" msg += " In the first table there are all the functions extracted from the executable with their hex code.\n" msg += " In the second table functions are converted to i2v representation. \n" msg += "SPLIT: Data are splitted into train validation and test set. " \ " Then it generate the pairs for the training of the network.\n" msg += "EMBEDD: Generate the embeddings of each function in the database using a trained SAFE model\n\n" msg += "If you want to train the network use build + split" msg += "If you want to create a knowledge base for the binary code search engine use build + embedd" msg += "This program has been written by the SAFE team.\n" msg += "-------------------------------------------------" return msg def build_configuration(db_name, root_dir, use_symbols, callee_depth): msg = "Database creation options: \n" msg += " - Database Name: {} \n".format(db_name) msg += " - Root dir: {} \n".format(root_dir) msg += " - Use symbols: {} \n".format(use_symbols) msg += " - Callee depth: {} \n".format(callee_depth) return msg def split_configuration(db_name, val_split, test_split, epochs): msg = "Splitting options: \n" msg += " - Database Name: {} \n".format(db_name) msg += " - Validation Size: {} \n".format(val_split) msg += " - Test Size: {} \n".format(test_split) msg += " - Epochs: {} \n".format(epochs) return msg def embedd_configuration(db_name, model, batch_size, max_instruction, embeddings_table): msg = "Embedding options: \n" msg += " - Database Name: {} \n".format(db_name) msg += " - Model: {} \n".format(model) msg += " - Batch Size: {} \n".format(batch_size) msg += " - Max Instruction per function: {} \n".format(max_instruction) msg += " - Table for saving embeddings: {}.".format(embeddings_table) return msg if __name__ == '__main__': print_safe() parser = argparse.ArgumentParser(description=debug_msg) parser.add_argument("-db", "--db", help="Name of the database to create", required=True) parser.add_argument("-b", "--build", help="Build db disassebling executables", action="store_true") parser.add_argument("-s", "--split", help="Perform data splitting for training", action="store_true") parser.add_argument("-e", "--embed", help="Compute functions embedding", action="store_true") parser.add_argument("-dir", "--dir", help="Root path of the directory to scan") parser.add_argument("-sym", "--symbols", help="Use it if you want to use symbols", action="store_true") parser.add_argument("-dep", "--depth", help="Recursive depth for analysis", default=0, type=int) parser.add_argument("-test", "--test_size", help="Test set size [0-1]", type=float, default=0.2) parser.add_argument("-val", "--val_size", help="Validation set size [0-1]", type=float, default=0.2) parser.add_argument("-epo", "--epochs", help="# Epochs to generate pairs for", type=int, default=25) parser.add_argument("-mod", "--model", help="Model for embedding generation") parser.add_argument("-bat", "--batch_size", help="Batch size for function embeddings", type=int, default=500) parser.add_argument("-max", "--max_instruction", help="Maximum instruction per function", type=int, default=150) parser.add_argument("-etb", "--embeddings_table", help="Name for the table that contains embeddings", default="safe_embeddings") try: args = parser.parse_args() except: parser.print_help() print(debug_msg()) exit(0) if args.build: print("Disassemblying files and creating dataset") print(build_configuration(args.db, args.dir, args.symbols, args.depth)) factory = DatabaseFactory.DatabaseFactory(args.db, args.dir) factory.build_db(args.symbols, args.depth) if args.split: print("Splitting data and generating epoch pairs") print(split_configuration(args.db, args.val_size, args.test_size, args.epochs)) splitter = DataSplitter.DataSplitter(args.db) splitter.split_data(args.val_size, args.test_size) splitter.create_pairs(args.epochs) if args.embed: print("Computing embeddings for function in db") print(embedd_configuration(args.db, args.model, args.batch_size, args.max_instruction, args.embeddings_table)) embedder = FunctionsEmbedder.FunctionsEmbedder(args.model, args.batch_size, args.max_instruction) embedder.compute_and_save_embeddings_from_db(args.db, args.embeddings_table) exit(0) ================================================ FILE: dataset_creation/FunctionsEmbedder.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni from asm_embedding.FunctionNormalizer import FunctionNormalizer import json from neural_network.SAFEEmbedder import SAFEEmbedder import numpy as np import sqlite3 from tqdm import tqdm class FunctionsEmbedder: def __init__(self, model, batch_size, max_instruction): self.batch_size = batch_size self.normalizer = FunctionNormalizer(max_instruction) self.safe = SAFEEmbedder(model) self.safe.loadmodel() self.safe.get_tensor() def compute_embeddings(self, functions): functions, lenghts = self.normalizer.normalize_functions(functions) embeddings = self.safe.embedd(functions, lenghts) return embeddings @staticmethod def create_table(db_name, table_name): conn = sqlite3.connect(db_name) c = conn.cursor() c.execute("CREATE TABLE IF NOT EXISTS {} (id INTEGER PRIMARY KEY, {} TEXT)".format(table_name, table_name)) conn.commit() conn.close() def compute_and_save_embeddings_from_db(self, db_name, table_name): FunctionsEmbedder.create_table(db_name, table_name) conn = sqlite3.connect(db_name) cur = conn.cursor() q = cur.execute("SELECT id FROM functions WHERE id not in (SELECT id from {})".format(table_name)) ids = q.fetchall() for i in tqdm(range(0, len(ids), self.batch_size)): functions = [] batch_ids = ids[i:i+self.batch_size] for my_id in batch_ids: q = cur.execute("SELECT instructions_list FROM filtered_functions where id=?", my_id) functions.append(json.loads(q.fetchone()[0])) embeddings = self.compute_embeddings(functions) for l, id in enumerate(batch_ids): cur.execute("INSERT INTO {} VALUES (?,?)".format(table_name), (id[0], np.array2string(embeddings[l]))) conn.commit() ================================================ FILE: dataset_creation/__init__.py ================================================ ================================================ FILE: dataset_creation/convertDB.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import sqlite3 import json from networkx.readwrite import json_graph import logging from tqdm import tqdm from asm_embedding.InstructionsConverter import InstructionsConverter # Create the db where data are stored def create_db(db_name): print('Database creation...') conn = sqlite3.connect(db_name) conn.execute(''' CREATE TABLE IF NOT EXISTS functions (id INTEGER PRIMARY KEY, project text, compiler text, optimization text, file_name text, function_name text, asm text, num_instructions INTEGER) ''') conn.execute('''CREATE TABLE IF NOT EXISTS filtered_functions (id INTEGER PRIMARY KEY, instructions_list text) ''') conn.commit() conn.close() def reverse_graph(cfg, lstm_cfg): instructions = [] asm = "" node_addr = list(cfg.nodes()) node_addr.sort() nodes = cfg.nodes(data=True) lstm_nodes = lstm_cfg.nodes(data=True) for addr in node_addr: a = nodes[addr]["asm"] if a is not None: asm += a instructions.extend(lstm_nodes[addr]['features']) return instructions, asm def copy_split(old_cur, new_cur, table): q = old_cur.execute("SELECT id FROM {}".format(table)) iii = q.fetchall() print("Copying table {}".format(table)) for ii in tqdm(iii): new_cur.execute("INSERT INTO {} VALUES (?)".format(table), ii) def copy_table(old_cur, new_cur, table_old, table_new): q = old_cur.execute("SELECT * FROM {}".format(table_old)) iii = q.fetchall() print("Copying table {} to {}".format(table_old, table_new)) for ii in tqdm(iii): new_cur.execute("INSERT INTO {} VALUES (?,?,?)".format(table_new), ii) logger = logging.getLogger() logger.setLevel(logging.DEBUG) db = "/home/lucamassarelli/binary_similarity_data/databases/big_dataset_X86.db" new_db = "/home/lucamassarelli/binary_similarity_data/new_databases/big_dataset_X86_new.db" create_db(new_db) conn_old = sqlite3.connect(db) conn_new = sqlite3.connect(new_db) cur_old = conn_old.cursor() cur_new = conn_new.cursor() q = cur_old.execute("SELECT id FROM functions") ids = q.fetchall() converter = InstructionsConverter() for my_id in tqdm(ids): q0 = cur_old.execute("SELECT id, project, compiler, optimization, file_name, function_name, cfg FROM functions WHERE id=?", my_id) meta = q.fetchone() q1 = cur_old.execute("SELECT lstm_cfg FROM lstm_cfg WHERE id=?", my_id) cfg = json_graph.adjacency_graph(json.loads(meta[6])) lstm_cfg = json_graph.adjacency_graph(json.loads(q1.fetchone()[0])) instructions, asm = reverse_graph(cfg, lstm_cfg) values = meta[0:6] + (asm, len(instructions)) q_n = cur_new.execute("INSERT INTO functions VALUES (?,?,?,?,?,?,?,?)", values) converted_instruction = json.dumps(converter.convert_to_ids(instructions)) q_n = cur_new.execute("INSERT INTO filtered_functions VALUES (?,?)", (my_id[0], converted_instruction)) conn_new.commit() cur_new.execute("CREATE TABLE train (id INTEGER PRIMARY KEY) ") cur_new.execute("CREATE TABLE validation (id INTEGER PRIMARY KEY) ") cur_new.execute("CREATE TABLE test (id INTEGER PRIMARY KEY) ") conn_new.commit() copy_split(cur_old, cur_new, "train") conn_new.commit() copy_split(cur_old, cur_new, "validation") conn_new.commit() copy_split(cur_old, cur_new, "test") conn_new.commit() cur_new.execute("CREATE TABLE train_pairs (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)") cur_new.execute("CREATE TABLE validation_pairs (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)") cur_new.execute("CREATE TABLE test_pairs (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)") conn_new.commit() copy_table(cur_old, cur_new, "train_couples", "train_pairs") conn_new.commit() copy_table(cur_old, cur_new, "validation_couples", "validation_pairs") conn_new.commit() copy_table(cur_old, cur_new, "test_couples", "test_pairs") conn_new.commit() conn_new.close() ================================================ FILE: download_model.sh ================================================ #!/usr/bin/env bash python3 downloader.py -b echo 'Model downloaded and, hopefully, ready to run' ================================================ FILE: downloader.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import argparse import os import sys from subprocess import call class Downloader: def __init__(self): parser = argparse.ArgumentParser(description='SAFE downloader') parser.add_argument("-m", "--model", dest="model", help="Download the trained SAFE model for x86", action="store_true", required=False) parser.add_argument("-i2v", "--i2v", dest="i2v", help="Download the i2v dictionary and embedding matrix", action="store_true", required=False) parser.add_argument("-b", "--bundle", dest="bundle", help="Download all files necessary to run the model", action="store_true", required=False) parser.add_argument("-td", "--train_data", dest="train_data", help="Download the files necessary to train the model (It takes a lot of space!)", action="store_true", required=False) args = parser.parse_args() self.download_model = (args.model or args.bundle) self.download_i2v = (args.i2v or args.bundle) self.download_train = args.train_data if not (self.download_model or self.download_i2v or self.download_train): parser.print_help(sys.__stdout__) self.url_model = "https://drive.google.com/file/d/1Kwl8Jy-g9DXe1AUjUZDhJpjRlDkB4NBs/view?usp=sharing" self.url_i2v = "https://drive.google.com/file/d/1CqJVGYbLDEuJmJV6KH4Dzzhy-G12GjGP" self.url_train = ['https://drive.google.com/file/d/1sNahtLTfZY5cxPaYDUjqkPTK0naZ45SH/view?usp=sharing','https://drive.google.com/file/d/16D5AVDux_Q8pCVIyvaMuiL2cw2V6gtLc/view?usp=sharing','https://drive.google.com/file/d/1cBRda8fYdqHtzLwstViuwK6U5IVHad1N/view?usp=sharing'] self.train_name = ['AMD64ARMOpenSSL.tar.bz2','AMD64multipleCompilers.tar.bz2','AMD64PostgreSQL.tar.bz2'] self.base_path = "data" self.path_i2v = os.path.join(self.base_path, "") self.path_model = os.path.join(self.base_path, "") self.path_train_data = os.path.join(self.base_path, "") self.i2v_compress_name='i2v.tar.bz2' self.model_compress_name='model.tar.bz2' self.datasets_compress_name='safe.pb' @staticmethod def download_file(id,path): try: print("Downloading from "+ str(id) +" into "+str(path)) call(['./godown.pl',id,path]) except Exception as e: print("Error downloading file at url:" + str(id)) print(e) @staticmethod def decompress_file(file_src,file_path): try: call(['tar','-xvf',file_src,'-C',file_path]) except Exception as e: print("Error decompressing file:" + str(file_src)) print('you need tar command e b2zip support') print(e) def download(self): print('Making the godown.pl script executable, thanks:'+str('https://github.com/circulosmeos/gdown.pl')) call(['chmod', '+x','godown.pl']) print("SAFE --- downloading models") if self.download_i2v: print("Downloading i2v model.... in the folder data/i2v/") if not os.path.exists(self.path_i2v): os.makedirs(self.path_i2v) Downloader.download_file(self.url_i2v, os.path.join(self.path_i2v,self.i2v_compress_name)) print("Decompressing i2v model and placing in" + str(self.path_i2v)) Downloader.decompress_file(os.path.join(self.path_i2v,self.i2v_compress_name),self.path_i2v) if self.download_model: print("Downloading the SAFE model... in the folder data") if not os.path.exists(self.path_model): os.makedirs(self.path_model) Downloader.download_file(self.url_model, os.path.join(self.path_model,self.datasets_compress_name)) #print("Decompressing SAFE model and placing in" + str(self.path_model)) #Downloader.decompress_file(os.path.join(self.path_model,self.model_compress_name),self.path_model) if self.download_train: print("Downloading the train data.... in the folder data") if not os.path.exists(self.path_train_data): os.makedirs(self.path_train_data) for i,x in enumerate(self.url_train): print("Downloading dataset "+str(self.train_name[i])) Downloader.download_file(x, os.path.join(self.path_train_data,self.train_name[i])) #print("Decompressing the train data and placing in" + str(self.path_train_data)) #Downloader.decompress_file(os.path.join(self.path_train_data,self.datasets_compress_name),self.path_train_data) if __name__=='__main__': a=Downloader() a.download() ================================================ FILE: function_search/EvaluateSearchEngine.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni from FunctionSearchEngine import FunctionSearchEngine from sklearn import metrics import sqlite3 from multiprocessing import Process import math import warnings import random import json class SearchEngineEvaluator: def __init__(self, db_name, table, limit=None,k=None): self.tables = table self.db_name = db_name self.SE = FunctionSearchEngine(db_name, table, limit=limit) self.k=k self.number_similar={} def do_search(self, target_db_name, target_fcn_ids): self.SE.load_target(target_db_name, target_fcn_ids) self.SE.pp_search(50) def calc_auc(self, target_db_name, target_fcn_ids): self.SE.load_target(target_db_name, target_fcn_ids) result = self.SE.auc() print(result) # # This methods searches for all target function in the DB, in our test we take num functions compiled with compiler and opt # moreover it populates the self.number_similar dictionary, that contains the number of similar function for each target # def find_target_fcn(self, compiler, opt, num): conn = sqlite3.connect(self.db_name) cur = conn.cursor() q = cur.execute("SELECT id, project, file_name, function_name FROM functions WHERE compiler=? AND optimization=?", (compiler, opt)) res = q.fetchall() ids = [i[0] for i in res] true_labels = [l[1]+"/"+l[2]+"/"+l[3] for l in res] n_ids = [] n_true_labels = [] num = min(num, len(ids)) for i in range(0, num): index = random.randrange(len(ids)) n_ids.append(ids[index]) n_true_labels.append(true_labels[index]) f_name=true_labels[index].split('/')[2] fi_name=true_labels[index].split('/')[1] q = cur.execute("SELECT num FROM count_func WHERE file_name='{}' and function_name='{}'".format(fi_name,f_name)) f = q.fetchone() if f is not None: num=int(f[0]) else: num = 0 self.number_similar[true_labels[index]]=num return n_ids, n_true_labels @staticmethod def functions_ground_truth(labels, indices, values, true_label): y_true = [] y_score = [] for i, e in enumerate(indices): y_score.append(float(values[i])) l = labels[e] if l == true_label: y_true.append(1) else: y_true.append(0) return y_true, y_score # this methos execute the test # it select the targets functions and it looks up for the targets in the entire db # the outcome is json file containing the top 200 similar for each target function. # the json file is an array and such array contains an entry for each target function # each entry is a triple (t0,t1,t2) # t0: an array that contains 1 at entry j if the entry j is similar to the target 0 otherwise # t1: the number of similar functions to the target in the whole db # t2: an array that at entry j contains the similarity score of the j-th most similar function to the target. # # def evaluate_precision_on_all_functions(self, compiler, opt): target_fcn_ids, true_labels = self.find_target_fcn(compiler, opt, 10000) batch = 1000 labels = self.SE.trunc_labels info=[] for i in range(0, len(target_fcn_ids), batch): if i + batch > len(target_fcn_ids): batch = len(target_fcn_ids) - i target = self.SE.load_target(self.db_name, target_fcn_ids[i:i+batch]) top_k = self.SE.top_k(target, self.k) for j in range(0, batch): a, b = SearchEngineEvaluator.functions_ground_truth(labels, top_k.indices[j, :], top_k.values[j, :], true_labels[i+j]) info.append((a,self.number_similar[true_labels[i + j]],b)) with open(compiler+'_'+opt+'_'+self.tables+'_top200.json', 'w') as outfile: json.dump(info, outfile) def test(dbName, table, opt,x,k): print("k:{} - Table: {} - Opt: {}".format(k,table, opt)) SEV = SearchEngineEvaluator(dbName, table, limit=2000000,k=k) SEV.evaluate_precision_on_all_functions(x, opt) print("-------------------------------------") if __name__ == '__main__': random.seed(12345) dbName = '../data/AMD64PostgreSQL.db' table = ['safe_embeddings'] opt = ["O0", "O1", "O2", "O3"] for x in ['gcc-4.8',"clang-4.0",'gcc-7','clang-6.0']: for t in table: for o in opt: p = Process(target=test, args=(dbName, t, o,x,200)) p.start() p.join() ================================================ FILE: function_search/FunctionSearchEngine.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import sys import numpy as np import sqlite3 import pandas as pd import tqdm import tensorflow as tf if sys.version_info >= (3, 0): from functools import reduce pd.set_option('display.max_column',None) pd.set_option('display.max_rows',None) pd.set_option('display.max_seq_items',None) pd.set_option('display.max_colwidth', 500) pd.set_option('expand_frame_repr', True) class TopK: # # This class computes the similarities between the targets and the list of functions on which we are searching. # This is done by using matrices multiplication and top_k of tensorflow def __init__(self): self.graph=tf.Graph() nop=0 def loads_embeddings_SE(self, lista_embeddings): with self.graph.as_default(): tf.set_random_seed(1234) dim = lista_embeddings[0].shape[0] ll = np.asarray(lista_embeddings) self.matrix = tf.constant(ll, name='matrix_embeddings', dtype=tf.float32) self.target = tf.placeholder("float", [None, dim], name='target_embedding') self.sim = tf.matmul(self.target, self.matrix, transpose_b=True, name="embeddings_similarities") self.k = tf.placeholder(tf.int32, shape=(), name='k') self.top_k = tf.nn.top_k(self.sim, self.k, sorted=True) self.session = tf.Session() def topK(self, k, target): with self.graph.as_default(): tf.set_random_seed(1234) return self.session.run(self.top_k, {self.target: target, self.k: int(k)}) class FunctionSearchEngine: def __init__(self, db_name, table_name, limit=None): self.s2v = TopK() self.db_name = db_name self.table_name = table_name self.labels = [] self.trunc_labels = [] self.lista_embedding = [] self.ids = [] self.n_similar=[] self.ret = {} self.precision = None print("Query for ids") conn = sqlite3.connect(db_name) cur = conn.cursor() if limit is None: q = cur.execute("SELECT id, project, compiler, optimization, file_name, function_name FROM functions") res = q.fetchall() else: q = cur.execute("SELECT id, project, compiler, optimization, file_name, function_name FROM functions LIMIT {}".format(limit)) res = q.fetchall() for item in tqdm.tqdm(res, total=len(res)): q = cur.execute("SELECT " + self.table_name + " FROM " + self.table_name + " WHERE id=?", (item[0],)) e = q.fetchone() if e is None: continue self.lista_embedding.append(self.embeddingToNp(e[0])) element = "{}/{}/{}".format(item[1], item[4], item[5]) self.trunc_labels.append(element) element = "{}@{}/{}/{}/{}".format(item[5], item[1], item[2], item[3], item[4]) self.labels.append(element) self.ids.append(item[0]) conn.close() self.s2v.loads_embeddings_SE(self.lista_embedding) self.num_funcs = len(self.lista_embedding) def load_target(self, target_db_name, target_fcn_ids, calc_mean=False): conn = sqlite3.connect(target_db_name) cur = conn.cursor() mean = None for id in target_fcn_ids: if target_db_name == self.db_name and id in self.ids: idx = self.ids.index(id) e = self.lista_embedding[idx] else: q = cur.execute("SELECT " + self.table_name + " FROM " + self.table_name + " WHERE id=?", (id,)) e = q.fetchone() e = self.embeddingToNp(e[0]) if mean is None: mean = e.reshape([e.shape[0], 1]) else: mean = np.hstack((mean, e.reshape(e.shape[0], 1))) if calc_mean: target = [np.mean(mean, axis=1)] else: target = mean.T return target def embeddingToNp(self, e): e = e.replace('\n', '') e = e.replace('[', '') e = e.replace(']', '') emb = np.fromstring(e, dtype=float, sep=' ') return emb def top_k(self, target, k=None): if k is not None: top_k = self.s2v.topK(k, target) else: top_k = self.s2v.topK(len(self.lista_embedding), target) return top_k def pp_search(self, k): result = pd.DataFrame(columns=['Id', 'Name', 'Score']) top_k = self.s2v.topK(k) for i, e in enumerate(top_k.indices[0]): result = result.append({'Id': self.ids[e], 'Name': self.labels[e], 'Score': top_k.values[0][i]}, ignore_index=True) print(result) def search(self, k): result = [] top_k = self.s2v.topK(k) for i, e in enumerate(top_k.indices[0]): result = result.append({'Id': self.ids[e], 'Name': self.labels[e], 'Score': top_k.values[0][i]}) return result ================================================ FILE: function_search/__init__.py ================================================ ================================================ FILE: function_search/fromJsonSearchToPlot.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni import matplotlib.pyplot as plt import json import math import numpy as np from multiprocessing import Pool def find_dcg(element_list): dcg_score = 0.0 for j, sim in enumerate(element_list): dcg_score += float(sim) / math.log(j + 2) return dcg_score def count_ones(element_list): return len([x for x in element_list if x == 1]) def extract_info(file_1): with open(file_1, 'r') as f: data1 = json.load(f) performance1 = [] average_recall_k1 = [] precision_at_k1 = [] for f_index in range(0, len(data1)): f1 = data1[f_index][0] pf1 = data1[f_index][1] tp1 = [] recall_p1 = [] precision_p1 = [] # we start from 1 to remove ourselves for k in range(1, 200): cut1 = f1[0:k] dcg1 = find_dcg(cut1) ideal1 = find_dcg(([1] * (pf1) + [0] * (k - pf1))[0:k]) p1k = float(count_ones(cut1)) tp1.append(dcg1 / ideal1) recall_p1.append(p1k / pf1) precision_p1.append(p1k / k) performance1.append(tp1) average_recall_k1.append(recall_p1) precision_at_k1.append(precision_p1) avg_p1 = np.average(performance1, axis=0) avg_p10 = np.average(average_recall_k1, axis=0) average_precision = np.average(precision_at_k1, axis=0) return avg_p1, avg_p10, average_precision def print_graph(info1, file_name, label_y, title_1, p): fig, ax = plt.subplots() ax.plot(range(0, len(info1)), info1, color='b', label=title_1) ax.legend(loc=p, shadow=True, fontsize='x-large') plt.xlabel("Number of Nearest Results") plt.ylabel(label_y) fname = file_name plt.savefig(fname) plt.close(fname) def compare_and_print(file): filename = file.split('_')[0] + '_' + file.split('_')[1] t_short = filename label_1 = t_short + '_' + file.split('_')[3] avg_p1, recall_p1, precision1 = extract_info(file) fname = filename + '_nDCG.pdf' print_graph(avg_p1, fname, 'nDCG', label_1, 'upper right') fname = filename + '_recall.pdf' print_graph(recall_p1, fname, 'Recall', label_1, 'lower right') fname = filename + '_precision.pdf' print_graph(precision1, fname, 'Precision', label_1, 'upper right') return avg_p1, recall_p1, precision1 e1 = 'embeddings_safe' opt = ['O0', 'O1', 'O2', 'O3'] compilers = ['gcc-7', 'gcc-4.8', 'clang-6.0', 'clang-4.0'] values = [] for o in opt: for c in compilers: f0 = '' + c + '_' + o + '_' + e1 + '_top200.json' values.append(f0) p = Pool(4) result = p.map(compare_and_print, values) avg_p1 = [] recal_p1 = [] pre_p1 = [] avg_p2 = [] recal_p2 = [] pre_p2 = [] for t in result: avg_p1.append(t[0]) recal_p1.append(t[1]) pre_p1.append(t[2]) avg_p1 = np.average(avg_p1, axis=0) recal_p1 = np.average(recal_p1, axis=0) pre_p1 = np.average(pre_p1, axis=0) print_graph(avg_p1[0:20], 'nDCG.pdf', 'normalized DCG', 'SAFE', 'upper right') print_graph(recal_p1, 'recall.pdf', 'recall', 'SAFE', 'lower right') print_graph(pre_p1[0:20], 'precision.pdf', 'precision', 'SAFE', 'upper right') ================================================ FILE: godown.pl ================================================ #!/usr/bin/env perl # # Google Drive direct download of big files # ./gdown.pl 'gdrive file url' ['desired file name'] # # v1.0 by circulosmeos 04-2014. # v1.1 by circulosmeos 01-2017. # http://circulosmeos.wordpress.com/2014/04/12/google-drive-direct-download-of-big-files # Distributed under GPL 3 (http://www.gnu.org/licenses/gpl-3.0.html) # use strict; my $TEMP='gdown.cookie.temp'; my $COMMAND; my $confirm; my $check; sub execute_command(); my $URL=shift; die "\n./gdown.pl 'gdrive file url' [desired file name]\n\n" if $URL eq ''; my $FILENAME=shift; $FILENAME='gdown' if $FILENAME eq ''; if ($URL=~m#^https?://drive.google.com/file/d/([^/]+)#) { $URL="https://docs.google.com/uc?id=$1&export=download"; } execute_command(); while (-s $FILENAME < 100000) { # only if the file isn't the download yet open fFILENAME, '<', $FILENAME; $check=0; foreach () { if (/href="(\/uc\?export=download[^"]+)/) { $URL='https://docs.google.com'.$1; $URL=~s/&/&/g; $confirm=''; $check=1; last; } if (/confirm=([^;&]+)/) { $confirm=$1; $check=1; last; } if (/"downloadUrl":"([^"]+)/) { $URL=$1; $URL=~s/\\u003d/=/g; $URL=~s/\\u0026/&/g; $confirm=''; $check=1; last; } } close fFILENAME; die "Couldn't download the file :-(\n" if ($check==0); $URL=~s/confirm=([^;&]+)/confirm=$confirm/ if $confirm ne ''; execute_command(); } unlink $TEMP; sub execute_command() { $COMMAND="wget --no-check-certificate --load-cookie $TEMP --save-cookie $TEMP \"$URL\""; $COMMAND.=" -O \"$FILENAME\"" if $FILENAME ne ''; `$COMMAND`; return 1; } ================================================ FILE: helloworld.c ================================================ #include "stdio.h" int main(){ printf("hello world"); int a=10; int b=20; printf("%d",a+b); } ================================================ FILE: index.md ================================================ --- # Feel free to add content and custom Front Matter to this file. # To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults layout: home ---
What is SAf(E)? ------------- **SAFE** is a **S**elf-**A**ttentive neural network that takes as input a binary **F**unction and creates an **E**mbedding. What is an embedding? ------------- An embedding is vector of real numbers. The nice feature of SAFE embeddings is that two similar binary functions should generate two embeddings that are close in the metric space.
I want to know all the details! ------------- Good, read our paper on [arXiv](https://arxiv.org/abs/1811.05296). The paper is slightly amusing! How do I get SAFE? ------------- SAFE is available in our [GitHub](https://github.com/gadiluna/SAFE) repository. Keep in mind that SAFE has been developed as a research project. We only provide a minimal working proof-of-concept, with the code and data to replicate our experiments. We are not responsible for any self-harm episode correlated with reading our (sometimes badly written) code. How I can get involved with SAFE? ------------- If you are interested in this project write us an email. ------------- SAFE has been designed and developed by:
* [Luca Massarelli](https://scholar.google.it/citations?user=mJ_QjZIAAAAJ&hl=it) (development and research)
* [Giuseppe Antonio Di Luna](https://scholar.google.it/citations?hl=it&user=RgAfuVgAAAAJ&view_op=list_works&sortby=pubdate) (development and research)
* [Fabio Petroni](https://scholar.google.it/citations?user=vxQc2L4AAAAJ&hl=it) (development and research)
* [Leonardo Querzoni](https://scholar.google.it/citations?user=-_WFIJIAAAAJ&hl=it) (research)
* [Roberto Baldoni](https://scholar.google.it/citations?user=82tR6VoAAAAJ&hl=it) (research) #### **Acknowledgments**: We are in debt with Google for providing free access to its cloud computing platform through the Education Program. Moreover, the authors would like to thank NVIDIA Corporation for partially supporting this work through the donation of a GPGPU card used during prototype development. This work is supported by a grant of the Italian Presidency of the Council of Ministers and by the CINI (Consorzio Interuniversitario Nazionale Informatica) National Laboratory of Cyber Security. Finally, we thank Davide Italiano for the insightful discussions. SAFE License. ------- # SAFE TEAM # GPL 3 License http://www.gnu.org/licenses/ ================================================ FILE: neural_network/PairFactory.py ================================================ # SAFE TEAM # distributed under license: GPL 3 License http://www.gnu.org/licenses/ import sqlite3 import json import numpy as np from multiprocessing import Queue from multiprocessing import Process from asm_embedding.FunctionNormalizer import FunctionNormalizer # # PairFactory class, used for training the SAFE network. # This class generates the pairs for training, test and validation # # # Authors: SAFE team class PairFactory: def __init__(self, db_name, dataset_type, batch_size, max_instructions, shuffle=True): self.db_name = db_name self.dataset_type = dataset_type self.max_instructions = max_instructions self.batch_dim = 0 self.num_pairs = 0 self.num_batches = 0 self.batch_size = batch_size conn = sqlite3.connect(self.db_name) cur = conn.cursor() q = cur.execute("SELECT true_pair from " + self.dataset_type + " WHERE id=?", (0,)) self.num_pairs=len(json.loads(q.fetchone()[0]))*2 n_chunk = int(self.num_pairs / self.batch_size) - 1 conn.close() self.num_batches = n_chunk self.shuffle = shuffle @staticmethod def split( a, n): return [a[i::n] for i in range(n)] @staticmethod def truncate_and_compute_lengths(pairs, max_instructions): lenghts = [] new_pairs=[] for x in pairs: f0 = np.asarray(x[0][0:max_instructions]) f1 = np.asarray(x[1][0:max_instructions]) lenghts.append((f0.shape[0], f1.shape[0])) if f0.shape[0] < max_instructions: f0 = np.pad(f0, (0, max_instructions - f0.shape[0]), mode='constant') if f1.shape[0] < max_instructions: f1 = np.pad(f1, (0, max_instructions - f1.shape[0]), mode='constant') new_pairs.append((f0, f1)) return new_pairs, lenghts def async_chunker(self, epoch): conn = sqlite3.connect(self.db_name) cur = conn.cursor() query_string = "SELECT true_pair,false_pair from {} where id=?".format(self.dataset_type) q = cur.execute(query_string, (int(epoch),)) true_pairs_id, false_pairs_id = q.fetchone() true_pairs_id = json.loads(true_pairs_id) false_pairs_id = json.loads(false_pairs_id) assert len(true_pairs_id) == len(false_pairs_id) data_len = len(true_pairs_id) # print("Data Len: " + str(data_len)) conn.close() n_chunk = int(data_len / (self.batch_size / 2)) - 1 lista_chunk = range(0, n_chunk) coda = Queue(maxsize=50) n_proc = 8 # modify this to increase the parallelism for the db loading, from our thest 8-10 is the sweet spot on a 16 cores machine with K80 listone = PairFactory.split(lista_chunk, n_proc) # this ugly workaround is somehow needed, Pool is working oddly when TF is loaded. for i in range(0, n_proc): p = Process(target=self.async_create_couple, args=((epoch, listone[i], coda))) p.start() for i in range(0, n_chunk): yield self.async_get_dataset(coda) def get_pair_fromdb(self, id_1, id_2): conn = sqlite3.connect(self.db_name) cur = conn.cursor() q0 = cur.execute("SELECT instructions_list FROM filtered_functions WHERE id=?", (id_1,)) f0 = json.loads(q0.fetchone()[0]) q1 = cur.execute("SELECT instructions_list FROM filtered_functions WHERE id=?", (id_2,)) f1 = json.loads(q1.fetchone()[0]) conn.close() return f0, f1 def get_couple_from_db(self, epoch_number, chunk): conn = sqlite3.connect(self.db_name) cur = conn.cursor() pairs = [] labels = [] q = cur.execute("SELECT true_pair, false_pair from " + self.dataset_type + " WHERE id=?", (int(epoch_number),)) true_pairs_id, false_pairs_id = q.fetchone() true_pairs_id = json.loads(true_pairs_id) false_pairs_id = json.loads(false_pairs_id) conn.close() data_len = len(true_pairs_id) i = 0 normalizer = FunctionNormalizer(self.max_instructions) while i < self.batch_size: if chunk * int(self.batch_size / 2) + i > data_len: break p = true_pairs_id[chunk * int(self.batch_size / 2) + i] f0, f1 = self.get_pair_fromdb(p[0], p[1]) pairs.append((f0, f1)) labels.append(+1) p = false_pairs_id[chunk * int(self.batch_size / 2) + i] f0, f1 = self.get_pair_fromdb(p[0], p[1]) pairs.append((f0, f1)) labels.append(-1) i += 2 pairs, lengths = normalizer.normalize_function_pairs(pairs) function1, function2 = zip(*pairs) len1, len2 = zip(*lengths) n_samples = len(pairs) if self.shuffle: shuffle_indices = np.random.permutation(np.arange(n_samples)) function1 = np.array(function1)[shuffle_indices] function2 = np.array(function2)[shuffle_indices] len1 = np.array(len1)[shuffle_indices] len2 = np.array(len2)[shuffle_indices] labels = np.array(labels)[shuffle_indices] else: function1=np.array(function1) function2=np.array(function2) len1=np.array(len1) len2=np.array(len2) labels=np.array(labels) upper_bound = min(self.batch_size, n_samples) len1 = len1[0:upper_bound] len2 = len2[0:upper_bound] function1 = function1[0:upper_bound] function2 = function2[0:upper_bound] y_ = labels[0:upper_bound] return function1, function2, len1, len2, y_ def async_create_couple(self, epoch,n_chunk,q): for i in n_chunk: function1, function2, len1, len2, y_ = self.get_couple_from_db(epoch, i) q.put((function1, function2, len1, len2, y_), block=True) def async_get_dataset(self, q): item = q.get() function1 = item[0] function2 = item[1] len1 = item[2] len2 = item[3] y_ = item[4] assert (len(function1) == len(y_)) n_samples = len(function1) self.batch_dim = n_samples #self.num_pairs += n_samples return function1, function2, len1, len2, y_ ================================================ FILE: neural_network/SAFEEmbedder.py ================================================ import tensorflow as tf # SAFE TEAM # distributed under license: GPL 3 License http://www.gnu.org/licenses/ class SAFEEmbedder: def __init__(self, model_file): self.model_file = model_file self.session = None self.x_1 = None self.adj_1 = None self.len_1 = None self.emb = None def loadmodel(self): with tf.gfile.GFile(self.model_file, "rb") as f: graph_def = tf.GraphDef() graph_def.ParseFromString(f.read()) with tf.Graph().as_default() as graph: tf.import_graph_def(graph_def) sess = tf.Session(graph=graph) self.session = sess return sess def get_tensor(self): self.x_1 = self.session.graph.get_tensor_by_name("import/x_1:0") self.len_1 = self.session.graph.get_tensor_by_name("import/lengths_1:0") self.emb = tf.nn.l2_normalize(self.session.graph.get_tensor_by_name('import/Embedding1/dense/BiasAdd:0'), axis=1) def embedd(self, nodi_input, lengths_input): out_embedding= self.session.run(self.emb, feed_dict = { self.x_1: nodi_input, self.len_1: lengths_input}) return out_embedding ================================================ FILE: neural_network/SAFE_model.py ================================================ # SAFE TEAM # distributed under license: GPL 3 License http://www.gnu.org/licenses/ from SiameseSAFE import SiameseSelfAttentive from PairFactory import PairFactory import tensorflow as tf import random import sys, os import numpy as np from sklearn import metrics import matplotlib import tqdm matplotlib.use('Agg') import matplotlib.pyplot as plt class modelSAFE: def __init__(self, flags, embedding_matrix): self.embedding_size = flags.embedding_size self.num_epochs = flags.num_epochs self.learning_rate = flags.learning_rate self.l2_reg_lambda = flags.l2_reg_lambda self.num_checkpoints = flags.num_checkpoints self.logdir = flags.logdir self.logger = flags.logger self.seed = flags.seed self.batch_size = flags.batch_size self.max_instructions = flags.max_instructions self.embeddings_matrix = embedding_matrix self.session = None self.db_name = flags.db_name self.trainable_embeddings = flags.trainable_embeddings self.cross_val = flags.cross_val self.attention_hops = flags.attention_hops self.attention_depth = flags.attention_depth self.dense_layer_size = flags.dense_layer_size self.rnn_state_size = flags.rnn_state_size random.seed(self.seed) np.random.seed(self.seed) print(self.db_name) # loads an usable model # returns the network and a tensorflow session in which the network can be used. @staticmethod def load_model(path): session = tf.Session() checkpoint_dir = os.path.abspath(os.path.join(path, "checkpoints")) saver = tf.train.import_meta_graph(os.path.join(checkpoint_dir, "model.meta")) tf.global_variables_initializer().run(session=session) saver.restore(session, os.path.join(checkpoint_dir, "model")) network = SiameseSelfAttentive( rnn_state_size=1, learning_rate=1, l2_reg_lambda=1, batch_size=1, max_instructions=1, embedding_matrix=1, trainable_embeddings=1, attention_hops=1, attention_depth=1, dense_layer_size=1, embedding_size=1 ) network.restore_model(session) return session, network def create_network(self): self.network = SiameseSelfAttentive( rnn_state_size=self.rnn_state_size, learning_rate=self.learning_rate, l2_reg_lambda=self.l2_reg_lambda, batch_size=self.batch_size, max_instructions=self.max_instructions, embedding_matrix=self.embeddings_matrix, trainable_embeddings=self.trainable_embeddings, attention_hops=self.attention_hops, attention_depth=self.attention_depth, dense_layer_size=self.dense_layer_size, embedding_size=self.embedding_size ) def train(self): tf.reset_default_graph() with tf.Graph().as_default() as g: session_conf = tf.ConfigProto( allow_soft_placement=True, log_device_placement=False ) sess = tf.Session(config=session_conf) # Sets the graph-level random seed. tf.set_random_seed(self.seed) self.create_network() self.network.generate_new_safe() # --tbrtr # Initialize all variables sess.run(tf.global_variables_initializer()) # TensorBoard # Summaries for loss and accuracy loss_summary = tf.summary.scalar("loss", self.network.loss) # Train Summaries train_summary_op = tf.summary.merge([loss_summary]) train_summary_dir = os.path.join(self.logdir, "summaries", "train") train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph) # Validation summaries val_summary_op = tf.summary.merge([loss_summary]) val_summary_dir = os.path.join(self.logdir, "summaries", "validation") val_summary_writer = tf.summary.FileWriter(val_summary_dir, sess.graph) # Test summaries test_summary_op = tf.summary.merge([loss_summary]) test_summary_dir = os.path.join(self.logdir, "summaries", "test") test_summary_writer = tf.summary.FileWriter(test_summary_dir, sess.graph) # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it checkpoint_dir = os.path.abspath(os.path.join(self.logdir, "checkpoints")) checkpoint_prefix = os.path.join(checkpoint_dir, "model") if not os.path.exists(checkpoint_dir): os.makedirs(checkpoint_dir) saver = tf.train.Saver(tf.global_variables(), max_to_keep=self.num_checkpoints) best_val_auc = 0 stat_file = open(str(self.logdir) + "/epoch_stats.tsv", "w") stat_file.write("#epoch\ttrain_loss\tval_loss\tval_auc\ttest_loss\ttest_auc\n") p_train = PairFactory(self.db_name, 'train_pairs', self.batch_size, self.max_instructions) p_validation = PairFactory(self.db_name, 'validation_pairs', self.batch_size, self.max_instructions, False) p_test = PairFactory(self.db_name, 'test_pairs', self.batch_size, self.max_instructions, False) step = 0 for epoch in range(0, self.num_epochs): epoch_msg = "" epoch_msg += " epoch: {}\n".format(epoch) epoch_loss = 0 # ----------------------# # TRAIN # # ----------------------# n_batch = 0 for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm( p_train.async_chunker(epoch % 25), total=p_train.num_batches): feed_dict = { self.network.x_1: function1_batch, self.network.x_2: function2_batch, self.network.lengths_1: len1_batch, self.network.lengths_2: len2_batch, self.network.y: y_batch, } summaries, _, loss, norms, cs = sess.run( [train_summary_op, self.network.train_step, self.network.loss, self.network.norms, self.network.cos_similarity], feed_dict=feed_dict) train_summary_writer.add_summary(summaries, step) epoch_loss += loss * p_train.batch_dim # ??? step += 1 # recap epoch epoch_loss /= p_train.num_pairs epoch_msg += "\ttrain_loss: {}\n".format(epoch_loss) # ----------------------# # VALIDATION # # ----------------------# val_loss = 0 epoch_msg += "\n" val_y = [] val_pred = [] for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm( p_validation.async_chunker(0), total=p_validation.num_batches): feed_dict = { self.network.x_1: function1_batch, self.network.x_2: function2_batch, self.network.lengths_1: len1_batch, self.network.lengths_2: len2_batch, self.network.y: y_batch, } summaries, loss, similarities = sess.run( [val_summary_op, self.network.loss, self.network.cos_similarity], feed_dict=feed_dict) val_loss += loss * p_validation.batch_dim val_summary_writer.add_summary(summaries, step) val_y.extend(y_batch) val_pred.extend(similarities.tolist()) val_loss /= p_validation.num_pairs if np.isnan(val_pred).any(): print("Validation: carefull there is NaN in some ouput values, I am fixing it but be aware...") val_pred = np.nan_to_num(val_pred) val_fpr, val_tpr, val_thresholds = metrics.roc_curve(val_y, val_pred, pos_label=1) val_auc = metrics.auc(val_fpr, val_tpr) epoch_msg += "\tval_loss : {}\n\tval_auc : {}\n".format(val_loss, val_auc) sys.stdout.write( "\r\tepoch {} / {}, loss {:g}, val_auc {:g}, norms {}".format(epoch, self.num_epochs, epoch_loss, val_auc, norms)) sys.stdout.flush() # execute test only if validation auc increased test_loss = "-" test_auc = "-" # in case of cross validation we do not need to evaluate on a test split that is effectively missing if val_auc > best_val_auc and self.cross_val: # ##-- --## # best_val_auc = val_auc saver.save(sess, checkpoint_prefix) print("\nNEW BEST_VAL_AUC: {} !\n".format(best_val_auc)) # write ROC raw data with open(str(self.logdir) + "/best_val_roc.tsv", "w") as the_file: the_file.write("#thresholds\ttpr\tfpr\n") for t, tpr, fpr in zip(val_thresholds, val_tpr, val_fpr): the_file.write("{}\t{}\t{}\n".format(t, tpr, fpr)) # in case we are not cross validating we expect to have a test split. if val_auc > best_val_auc and not self.cross_val: best_val_auc = val_auc epoch_msg += "\tNEW BEST_VAL_AUC: {} !\n".format(best_val_auc) # save best model saver.save(sess, checkpoint_prefix) # ----------------------# # TEST # # ----------------------# # TEST test_loss = 0 epoch_msg += "\n" test_y = [] test_pred = [] for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm( p_test.async_chunker(0), total=p_test.num_batches): feed_dict = { self.network.x_1: function1_batch, self.network.x_2: function2_batch, self.network.lengths_1: len1_batch, self.network.lengths_2: len2_batch, self.network.y: y_batch, } summaries, loss, similarities = sess.run( [test_summary_op, self.network.loss, self.network.cos_similarity], feed_dict=feed_dict) test_loss += loss * p_test.batch_dim test_summary_writer.add_summary(summaries, step) test_y.extend(y_batch) test_pred.extend(similarities.tolist()) test_loss /= p_test.num_pairs if np.isnan(test_pred).any(): print("Test: carefull there is NaN in some ouput values, I am fixing it but be aware...") test_pred = np.nan_to_num(test_pred) test_fpr, test_tpr, test_thresholds = metrics.roc_curve(test_y, test_pred, pos_label=1) # write ROC raw data with open(str(self.logdir) + "/best_test_roc.tsv", "w") as the_file: the_file.write("#thresholds\ttpr\tfpr\n") for t, tpr, fpr in zip(test_thresholds, test_tpr, test_fpr): the_file.write("{}\t{}\t{}\n".format(t, tpr, fpr)) test_auc = metrics.auc(test_fpr, test_tpr) epoch_msg += "\ttest_loss : {}\n\ttest_auc : {}\n".format(test_loss, test_auc) fig = plt.figure() plt.title('Receiver Operating Characteristic') plt.plot(test_fpr, test_tpr, 'b', label='AUC = %0.2f' % test_auc) fig.savefig(str(self.logdir) + "/best_test_roc.png") print( "\nNEW BEST_VAL_AUC: {} !\n\ttest_loss : {}\n\ttest_auc : {}\n".format(best_val_auc, test_loss, test_auc)) plt.close(fig) stat_file.write( "{}\t{}\t{}\t{}\t{}\t{}\n".format(epoch, epoch_loss, val_loss, val_auc, test_loss, test_auc)) self.logger.info("\n{}\n".format(epoch_msg)) stat_file.close() sess.close() return best_val_auc ================================================ FILE: neural_network/SiameseSAFE.py ================================================ import tensorflow as tf # SAFE TEAM # # # distributed under license: CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt) # # Siamese Self-Attentive Network for Binary Similarity: # # arXiv Nostro. # # based on the self attentive network:arXiv:1703.03130 Z. Lin at al. “A structured self-attentive sentence embedding'' # # Authors: SAFE team class SiameseSelfAttentive: def __init__(self, rnn_state_size, # Dimension of the RNN State learning_rate, # Learning rate l2_reg_lambda, batch_size, max_instructions, embedding_matrix, # Matrix containg the embeddings for each asm instruction trainable_embeddings, # if this value is True, the embeddings of the asm instruction are modified by the training. attention_hops, # attention hops parameter r of [1] attention_depth, # attention detph parameter d_a of [1] dense_layer_size, # parameter e of [1] embedding_size, # size of the final function embedding, in our test this is twice the rnn_state_size ): self.rnn_depth = 1 # if this value is modified then the RNN becames a multilayer network. In our tests we fix it to 1 feel free to be adventurous. self.learning_rate = learning_rate self.l2_reg_lambda = l2_reg_lambda self.rnn_state_size = rnn_state_size self.batch_size = batch_size self.max_instructions = max_instructions self.embedding_matrix = embedding_matrix self.trainable_embeddings = trainable_embeddings self.attention_hops = attention_hops self.attention_depth = attention_depth self.dense_layer_size = dense_layer_size self.embedding_size = embedding_size # self.generate_new_safe() def restore_model(self, old_session): graph = old_session.graph self.x_1 = graph.get_tensor_by_name("x_1:0") self.x_2 = graph.get_tensor_by_name("x_2:0") self.len_1 = graph.get_tensor_by_name("lengths_1:0") self.len_2 = graph.get_tensor_by_name("lengths_2:0") self.y = graph.get_tensor_by_name('y_:0') self.cos_similarity = graph.get_tensor_by_name("siamese_layer/cosSimilarity:0") self.loss = graph.get_tensor_by_name("Loss/loss:0") self.train_step = graph.get_operation_by_name("Train_Step/Adam") return def self_attentive_network(self, input_x, lengths): # each functions is a list of embeddings id (an id is an index in the embedding matrix) # with this we transform it in a list of embeddings vectors. embbedded_functions = tf.nn.embedding_lookup(self.instructions_embeddings_t, input_x) # We create the GRU RNN (output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn(self.cell_fw, self.cell_bw, embbedded_functions, sequence_length=lengths, dtype=tf.float32, time_major=False) # We create the matrix H H = tf.concat([output_fw, output_bw], axis=2) # We do a tile to account for training batches ws1_tiled = tf.tile(tf.expand_dims(self.WS1, 0), [tf.shape(H)[0], 1, 1], name="WS1_tiled") ws2_tile = tf.tile(tf.expand_dims(self.WS2, 0), [tf.shape(H)[0], 1, 1], name="WS2_tiled") # we compute the matrix A self.A = tf.nn.softmax(tf.matmul(ws2_tile, tf.nn.tanh(tf.matmul(ws1_tiled, tf.transpose(H, perm=[0, 2, 1])))), name="Attention_Matrix") # embedding matrix M M = tf.identity(tf.matmul(self.A, H), name="Attention_Embedding") # we create the flattened version of M flattened_M = tf.reshape(M, [tf.shape(M)[0], self.attention_hops * self.rnn_state_size * 2]) return flattened_M def generate_new_safe(self): self.instructions_embeddings_t = tf.Variable(initial_value=tf.constant(self.embedding_matrix), trainable=self.trainable_embeddings, name="instructions_embeddings", dtype=tf.float32) self.x_1 = tf.placeholder(tf.int32, [None, self.max_instructions], name="x_1") # List of instructions for Function 1 self.lengths_1 = tf.placeholder(tf.int32, [None], name='lengths_1') # List of lengths for Function 1 # example x_1=[[mov,add,padding,padding],[mov,mov,mov,padding]] # lenghts_1=[2,3] self.x_2 = tf.placeholder(tf.int32, [None, self.max_instructions], name="x_2") # List of instructions for Function 2 self.lengths_2 = tf.placeholder(tf.int32, [None], name='lengths_2') # List of lengths for Function 2 self.y = tf.placeholder(tf.float32, [None], name='y_') # Real label of the pairs, +1 similar, -1 dissimilar. # Euclidean norms; p = 2 self.norms = [] # Keeping track of l2 regularization loss (optional) l2_loss = tf.constant(0.0) with tf.name_scope('parameters_Attention'): self.WS1 = tf.Variable(tf.truncated_normal([self.attention_depth, 2 * self.rnn_state_size], stddev=0.1), name="WS1") self.WS2 = tf.Variable(tf.truncated_normal([self.attention_hops, self.attention_depth], stddev=0.1), name="WS2") rnn_layers_fw = [tf.nn.rnn_cell.GRUCell(size) for size in ([self.rnn_state_size] * self.rnn_depth)] rnn_layers_bw = [tf.nn.rnn_cell.GRUCell(size) for size in ([self.rnn_state_size] * self.rnn_depth)] self.cell_fw = tf.nn.rnn_cell.MultiRNNCell(rnn_layers_fw) self.cell_bw = tf.nn.rnn_cell.MultiRNNCell(rnn_layers_bw) with tf.name_scope('Self-Attentive1'): self.function_1 = self.self_attentive_network(self.x_1, self.lengths_1) with tf.name_scope('Self-Attentive2'): self.function_2 = self.self_attentive_network(self.x_2, self.lengths_2) self.dense_1 = tf.nn.relu(tf.layers.dense(self.function_1, self.dense_layer_size)) self.dense_2 = tf.nn.relu(tf.layers.dense(self.function_2, self.dense_layer_size)) with tf.name_scope('Embedding1'): self.function_embedding_1 = tf.layers.dense(self.dense_1, self.embedding_size) with tf.name_scope('Embedding2'): self.function_embedding_2 = tf.layers.dense(self.dense_2, self.embedding_size) with tf.name_scope('siamese_layer'): self.cos_similarity = tf.reduce_sum(tf.multiply(self.function_embedding_1, self.function_embedding_2), axis=1, name="cosSimilarity") # CalculateMean cross-entropy loss with tf.name_scope("Loss"): A_square = tf.matmul(self.A, tf.transpose(self.A, perm=[0, 2, 1])) I = tf.eye(tf.shape(A_square)[1]) I_tiled = tf.tile(tf.expand_dims(I, 0), [tf.shape(A_square)[0], 1, 1], name="I_tiled") self.A_pen = tf.norm(A_square - I_tiled) self.loss = tf.reduce_sum(tf.squared_difference(self.cos_similarity, self.y), name="loss") self.regularized_loss = self.loss + self.l2_reg_lambda * l2_loss + self.A_pen # Train step with tf.name_scope("Train_Step"): self.train_step = tf.train.AdamOptimizer(self.learning_rate).minimize(self.regularized_loss) ================================================ FILE: neural_network/__init__.py ================================================ ================================================ FILE: neural_network/freeze_graph.sh ================================================ #!/bin/sh echo "usage: ./freeze_graph MODEL_DIR FREEZED_NAME" MODEL_DIR=$0 FREEZED_NAME=$1 freeze_graph --input_meta_graph $MODELDIR/checkpoints/model.meta --output_graph FREEZED_NAME --output_node_names Embedding1/dense/BiasAdd --input_bin --input_checkpoint $MODEL_DIR/checkpoints/model ================================================ FILE: neural_network/parameters.py ================================================ # SAFE TEAM # distributed under license: GPL 3 License http://www.gnu.org/licenses/ import argparse import time import sys, os import logging # # Parameters File for the SAFE network. # # Authors: SAFE team def getLogger(logfile): logger = logging.getLogger(__name__) hdlr = logging.FileHandler(logfile) formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s') hdlr.setFormatter(formatter) logger.addHandler(hdlr) logger.setLevel(logging.INFO) return logger, hdlr class Flags: def __init__(self): parser = argparse.ArgumentParser(description='SAFE') parser.add_argument("-o", "--output", dest="output_file", help="output directory for logging and models", required=False) parser.add_argument("-e", "--embedder", dest="embedder_folder", help="file with the embedding matrix and dictionary for asm instructions", required=False) parser.add_argument("-n", "--dbName", dest="db_name", help="Name of the database", required=False) parser.add_argument("-ld", "--load_dir", dest="load_dir", help="Load the model from directory load_dir", required=False) parser.add_argument("-r", "--random", help="if present the network use random embedder", default=False, action="store_true", dest="random_embedding", required=False) parser.add_argument("-te", "--trainable_embedding", help="if present the network consider the embedding as trainable", action="store_true", dest="trainable_embeddings", default=False) parser.add_argument("-cv", "--cross_val", help="if present the training is done with cross validiation", default=False, action="store_true", dest="cross_val") args = parser.parse_args() # mode = mean_field self.batch_size = 250 # minibatch size (-1 = whole dataset) self.num_epochs = 50 # number of epochs self.embedding_size = 100 # dimension of the function embedding self.learning_rate = 0.001 # init learning_rate self.l2_reg_lambda = 0 # 0.002 #0.002 # regularization coefficient self.num_checkpoints = 1 # max number of checkpoints self.out_dir = args.output_file # directory for logging self.rnn_state_size = 50 # dimesion of the rnn state self.db_name = args.db_name self.load_dir = str(args.load_dir) self.random_embedding = args.random_embedding self.trainable_embeddings = args.trainable_embeddings self.cross_val = args.cross_val self.cross_val_fold = 5 # ## ## RNN PARAMETERS, these parameters are only used for RNN model. # self.rnn_depth = 1 # depth of the rnn self.max_instructions = 150 # number of instructions ## ATTENTION PARAMETERS self.attention_hops = 10 self.attention_depth = 250 # RNN SINGLE PARAMETER self.dense_layer_size = 2000 self.seed = 2 # random seed # create logdir and logger self.reset_logdir() self.embedder_folder = args.embedder_folder def reset_logdir(self): # create logdir timestamp = str(int(time.time())) self.logdir = os.path.abspath(os.path.join(self.out_dir, "runs", timestamp)) os.makedirs(self.logdir, exist_ok=True) # create logger self.log_file = str(self.logdir) + '/console.log' self.logger, self.hdlr = getLogger(self.log_file) # create symlink for last_run sym_path_logdir = str(self.out_dir) + "/last_run" try: os.unlink(sym_path_logdir) except: pass try: os.symlink(self.logdir, sym_path_logdir) except: print("\nfailed to create symlink!\n") def close_log(self): self.hdlr.close() self.logger.removeHandler(self.hdlr) handlers = self.logger.handlers[:] for handler in handlers: handler.close() self.logger.removeHandler(handler) def __str__(self): msg = "" msg += "\nParameters:\n" msg += "\tRandom embedding: {}\n".format(self.random_embedding) msg += "\tTrainable embedding: {}\n".format(self.trainable_embeddings) msg += "\tlogdir: {}\n".format(self.logdir) msg += "\tbatch_size: {}\n".format(self.batch_size) msg += "\tnum_epochs: {}\n".format(self.num_epochs) msg += "\tembedding_size: {}\n".format(self.embedding_size) msg += "\trnn_state_size: {}\n".format(self.rnn_state_size) msg += "\tattention depth: {}\n".format(self.attention_depth) msg += "\tattention hops: {}\n".format(self.attention_hops) msg += "\tdense layer e: {}\n".format(self.dense_layer_size) msg += "\tlearning_rate: {}\n".format(self.learning_rate) msg += "\tl2_reg_lambda: {}\n".format(self.l2_reg_lambda) msg += "\tnum_checkpoints: {}\n".format(self.num_checkpoints) msg += "\tseed: {}\n".format(self.seed) msg += "\tMax Instructions per functions: {}\n".format(self.max_instructions) return msg ================================================ FILE: neural_network/train.py ================================================ from SAFE_model import modelSAFE from parameters import Flags import sys import os import numpy as np from utils import utils import traceback def load_embedding_matrix(embedder_folder): matrix_file='embedding_matrix.npy' matrix_path=os.path.join(embedder_folder,matrix_file) if os.path.isfile(matrix_path): try: print('Loading embedding matrix....') with open(matrix_path,'rb') as f: return np.float32(np.load(f)) except Exception as e: print("Exception handling file:"+str(matrix_path)) print("Embedding matrix cannot be load") print(str(e)) sys.exit(-1) else: print('Embedding matrix not found at path:'+str(matrix_path)) sys.exit(-1) def run_test(): flags = Flags() flags.logger.info("\n{}\n".format(flags)) print(str(flags)) embedding_matrix = load_embedding_matrix(flags.embedder_folder) if flags.random_embedding: embedding_matrix = np.random.rand(*np.shape(embedding_matrix)).astype(np.float32) embedding_matrix[0, :] = np.zeros(np.shape(embedding_matrix)[1]).astype(np.float32) if flags.cross_val: print("STARTING CROSS VALIDATION") res = [] mean = 0 for i in range(0, flags.cross_val_fold): print("CROSS VALIDATION STARTING FOLD: " + str(i)) if i > 0: flags.close_log() flags.reset_logdir() del flags flags = Flags() flags.logger.info("\n{}\n".format(flags)) flags.logger.info("Starting cross validation fold: {}".format(i)) flags.db_name = flags.db_name + "_val_" + str(i+1) + ".db" flags.logger.info("Cross validation db name: {}".format(flags.db_name)) trainer = modelSAFE(flags, embedding_matrix) best_val_auc = trainer.train() mean += best_val_auc res.append(best_val_auc) flags.logger.info("Cross validation fold {} finished best auc: {}".format(i, best_val_auc)) print("FINISH FOLD: " + str(i) + " BEST VAL AUC: " + str(best_val_auc)) print("CROSS VALIDATION ENDED") print("Result: " + str(res)) print("") flags.logger.info("Cross validation finished results: {}".format(res)) flags.logger.info(" mean: {}".format(mean / flags.cross_val_fold)) flags.close_log() else: trainer = modelSAFE(flags, embedding_matrix) trainer.train() flags.close_log() if __name__ == '__main__': utils.print_safe() print('-Trainer for SAFE-') run_test() ================================================ FILE: neural_network/train.sh ================================================ #!/bin/sh BASE_PATH="/home/luca/work/binary_similarity_data/" DATA_PATH=$BASE_PATH/experiments/arith_mean_openSSL_no_dropout_no_shuffle_no_regeneration_emb_random_trainable OUT_PATH=$DATA_PATH/out DB_PATH=$BASE_PATH/databases/openSSL_data.db EMBEDDER=$BASE_PATH/word2vec/filtered_100_embeddings/ RANDOM="" TRAINABLE_EMBEDD="" python3 train.py $RANDOM $TRAINABLE_EMBEDD --o $OUT_PATH -n $DB_PATH -e $EMBEDDER ================================================ FILE: requirements.txt ================================================ tensorflow sklearn numpy scipy matplotlib tqdm r2pipe pyfiglet ================================================ FILE: safe.py ================================================ # SAFE TEAM # Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni from asm_embedding.FunctionAnalyzerRadare import RadareFunctionAnalyzer from argparse import ArgumentParser from asm_embedding.FunctionNormalizer import FunctionNormalizer from asm_embedding.InstructionsConverter import InstructionsConverter from neural_network.SAFEEmbedder import SAFEEmbedder from utils import utils class SAFE: def __init__(self, model): self.converter = InstructionsConverter("data/i2v/word2id.json") self.normalizer = FunctionNormalizer(max_instruction=150) self.embedder = SAFEEmbedder(model) self.embedder.loadmodel() self.embedder.get_tensor() def embedd_function(self, filename, address): analyzer = RadareFunctionAnalyzer(filename, use_symbol=False, depth=0) functions = analyzer.analyze() instructions_list = None for function in functions: if functions[function]['address'] == address: instructions_list = functions[function]['filtered_instructions'] break if instructions_list is None: print("Function not found") return None converted_instructions = self.converter.convert_to_ids(instructions_list) instructions, length = self.normalizer.normalize_functions([converted_instructions]) embedding = self.embedder.embedd(instructions, length) return embedding if __name__ == '__main__': utils.print_safe() parser = ArgumentParser(description="Safe Embedder") parser.add_argument("-m", "--model", help="Safe trained model to generate function embeddings") parser.add_argument("-i", "--input", help="Input executable that contains the function to embedd") parser.add_argument("-a", "--address", help="Hexadecimal address of the function to embedd") args = parser.parse_args() address = int(args.address, 16) safe = SAFE(args.model) embedding = safe.embedd_function(args.input, address) print(embedding[0]) ================================================ FILE: utils/__init__.py ================================================ ================================================ FILE: utils/utils.py ================================================ from pyfiglet import figlet_format def print_safe(): a = figlet_format('SAFE', font='starwars') print(a) print("By Massarelli L., Di Luna G. A., Petroni F., Querzoni L., Baldoni R.") print("Please cite: http://arxiv.org/abs/1811.05296 \n")