Repository: gadiluna/SAFE
Branch: master
Commit: fddfca90e111
Files: 40
Total size: 100.8 KB
Directory structure:
gitextract_60c_bmdf/
├── 404.html
├── Gemfile
├── LICENSE
├── README.md
├── __init__.py
├── _config.yml
├── asm_embedding/
│ ├── DocumentManipulation.py
│ ├── FunctionAnalyzerRadare.py
│ ├── FunctionNormalizer.py
│ ├── InstructionsConverter.py
│ └── __init__.py
├── dataset_creation/
│ ├── DataSplitter.py
│ ├── DatabaseFactory.py
│ ├── ExperimentUtil.py
│ ├── FunctionsEmbedder.py
│ ├── __init__.py
│ └── convertDB.py
├── download_model.sh
├── downloader.py
├── function_search/
│ ├── EvaluateSearchEngine.py
│ ├── FunctionSearchEngine.py
│ ├── __init__.py
│ └── fromJsonSearchToPlot.py
├── godown.pl
├── helloworld.c
├── helloworld.o
├── index.md
├── neural_network/
│ ├── PairFactory.py
│ ├── SAFEEmbedder.py
│ ├── SAFE_model.py
│ ├── SiameseSAFE.py
│ ├── __init__.py
│ ├── freeze_graph.sh
│ ├── parameters.py
│ ├── train.py
│ └── train.sh
├── requirements.txt
├── safe.py
└── utils/
├── __init__.py
└── utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: 404.html
================================================
---
layout: default
---
<style type="text/css" media="screen">
.container {
margin: 10px auto;
max-width: 600px;
text-align: center;
}
h1 {
margin: 30px 0;
font-size: 4em;
line-height: 1;
letter-spacing: -1px;
}
</style>
<div class="container">
<h1>404</h1>
<p><strong>Page not found :(</strong></p>
<p>The requested page could not be found.</p>
</div>
================================================
FILE: Gemfile
================================================
source "https://rubygems.org"
# Hello! This is where you manage which Jekyll version is used to run.
# When you want to use a different version, change it below, save the
# file and run `bundle install`. Run Jekyll with `bundle exec`, like so:
#
# bundle exec jekyll serve
#
# This will help ensure the proper Jekyll version is running.
# Happy Jekylling!
gem "jekyll", "~> 3.7.4"
# This is the default theme for new Jekyll sites. You may change this to anything you like.
gem "minima", "~> 2.0"
# If you want to use GitHub Pages, remove the "gem "jekyll"" above and
# uncomment the line below. To upgrade, run `bundle update github-pages`.
# gem "github-pages", group: :jekyll_plugins
#gem "github-pages", group: :jekyll_plugins
# If you have any plugins, put them here!
group :jekyll_plugins do
gem "jekyll-feed", "~> 0.6"
end
# Windows does not include zoneinfo files, so bundle the tzinfo-data gem
gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby]
# Performance-booster for watching directories on Windows
gem "wdm", "~> 0.1.0" if Gem.win_platform?
================================================
FILE: LICENSE
================================================
Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
================================================
FILE: README.md
================================================
# SAFE : Self Attentive Function Embedding
Paper
---
This software is the outcome of our accademic research. See our arXiv paper: [arxiv](https://arxiv.org/abs/1811.05296)
If you use this code, please cite our accademic paper as:
```bibtex
@inproceedings{massarelli2018safe,
title={SAFE: Self-Attentive Function Embeddings for Binary Similarity},
author={Massarelli, Luca and Di Luna, Giuseppe Antonio and Petroni, Fabio and Querzoni, Leonardo and Baldoni, Roberto},
booktitle={Proceedings of 16th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA)},
year={2019}
}
```
What you need
-----
You need [radare2](https://github.com/radare/radare2) installed in your system.
Quickstart
-----
To create the embedding of a function:
```
git clone https://github.com/gadiluna/SAFE.git
pip install -r requirements
chmod +x download_model.sh
./download_model.sh
python safe.py -m data/safe.pb -i helloworld.o -a 100000F30
```
#### What to do with an embedding?
Once you have two embeddings ```embedding_x``` and ```embedding_y``` you can compute the similarity of the corresponding functions as:
```
from sklearn.metrics.pairwise import cosine_similarity
sim=cosine_similarity(embedding_x, embedding_y)
```
Data Needed
-----
SAFE needs few information to work. Two are essentials, a model that tells safe how to
convert assembly instructions in vectors (i2v model) and a model that tells safe how
to convert an binary function into a vector.
Both models can be downloaded by using the command
```
./download_model.sh
```
the downloader downloads the model and place them in the directory data.
The directory tree after the download should be.
```
safe/-- githubcode
\
\--data/-----safe.pb
\
\---i2v/
```
The safe.pb file contains the safe-model used to convert binary function to vectors.
The i2v folder contains the i2v model.
Hardcore Details
----
This section contains details that are needed to replicate our experiments, if you are an user of safe you can skip
it.
### Safe.pb
This is the freezed tensorflow trained model for AMD64 architecture. You can import it in your project using:
```
import tensorflow as tf
with tf.gfile.GFile("safe.pb", "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def)
sess = tf.Session(graph=graph)
```
see file: neural_network/SAFEEmbedder.py
### i2v
The i2v folder contains two files.
A Matrix where each row is the embedding of an asm instruction.
A json file that contains a dictonary mapping asm instructions into row numbers of the matrix above.
see file: asm_embedding/InstructionsConverter.py
## Train the model
If you want to train the model using our datasets you have to first use:
```
python3 downloader.py -td
```
This will download the datasets into data folder. Note that the datasets are compressed so you have to decompress them yourself.
This data will be an sqlite databases.
To start the train use neural_network/train.sh.
The db can be selected by changing the parameter into train.sh.
If you want information on the dataset see our paper.
## Create your own dataset
If you want to create your own dataset you can use the script ExperimentUtil into the folder
dataset creation.
## Create a functions knowledge base
If you want to use SAFE binary code search engine you can use the script ExperimentUtil to create
the knowledge base.
Then you can search through it using the script into function_search
Related Projects
---
* YARASAFE: Automatic Binary Function Similarity Checks with Yara (https://github.com/lucamassarelli/yarasafe)
* SAFEtorch: Pytorch implemenation of the SAFE neural network (https://github.com/facebookresearch/SAFEtorch)
Thanks
---
In our code we use [godown](https://github.com/circulosmeos/gdown.pl) to download data from Google drive. We thank
circulosmeos, the creator of godown.
We thank Davide Italiano for the useful discussions.
================================================
FILE: __init__.py
================================================
================================================
FILE: _config.yml
================================================
# Welcome to Jekyll!
#
# This config file is meant for settings that affect your whole blog, values
# which you are expected to set up once and rarely edit after that. If you find
# yourself editing this file very often, consider using Jekyll's data files
# feature for the data you need to update frequently.
#
# For technical reasons, this file is *NOT* reloaded automatically when you use
# 'bundle exec jekyll serve'. If you change this file, please restart the server process.
# Site settings
# These are used to personalize your new site. If you look in the HTML files,
# you will see them accessed via {{ site.title }}, {{ site.email }}, and so on.
# You can create any custom variable you would like, and they will be accessible
# in the templates via {{ site.myvariable }}.
title: 'SAFE: Self-Attentive Function Embeddings'
email: safeteam@gmail.com
description: >- # this means to ignore newlines until "baseurl:"
Self-Attentive Function Embeddings for binary similarity.
https://arxiv.org/abs/1811.05296
baseurl: "" # the subpath of your site, e.g. /blog
url: "" # the base hostname & protocol for your site, e.g. http://example.com
twitter_username:
github_username:
# Build settings
markdown: kramdown
theme: minima
#theme: jekyll-theme-midnight
plugins:
- jekyll-feed
# Exclude from processing.
# The following items will not be processed, by default. Create a custom list
# to override the default setting.
# exclude:
# - Gemfile
# - Gemfile.lock
# - node_modules
# - vendor/bundle/
# - vendor/cache/
# - vendor/gems/
# - vendor/ruby/
================================================
FILE: asm_embedding/DocumentManipulation.py
================================================
import json
import re
import os
def list_to_str(li):
i=''
for x in li:
i=i+' '+x
i=i+' endfun'*5
return i
def document_append(strin):
with open('/Users/giuseppe/docuent_X86','a') as f:
f.write(strin)
ciro=set()
cantina=[]
num_total=0
num_filtered=0
with open('/Users/giuseppe/dump.x86.linux.json') as f:
l=f.readline()
print('loaded')
r = re.split('(\[.*?\])(?= *\[)', l)
del l
for x in r:
if '[' in x:
gennaro=json.loads(x)
for materdomini in gennaro:
num_total=num_total+1
if materdomini[0] not in ciro:
ciro.add(materdomini[0])
num_filtered=num_filtered+1
a=list_to_str(materdomini[1])
document_append(a)
del x
print(num_total)
print(num_filtered)
================================================
FILE: asm_embedding/FunctionAnalyzerRadare.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import json
import r2pipe
class RadareFunctionAnalyzer:
def __init__(self, filename, use_symbol, depth):
self.r2 = r2pipe.open(filename, flags=['-2'])
self.filename = filename
self.arch, _ = self.get_arch()
self.top_depth = depth
self.use_symbol = use_symbol
def __enter__(self):
return self
@staticmethod
def filter_reg(op):
return op["value"]
@staticmethod
def filter_imm(op):
imm = int(op["value"])
if -int(5000) <= imm <= int(5000):
ret = str(hex(op["value"]))
else:
ret = str('HIMM')
return ret
@staticmethod
def filter_mem(op):
if "base" not in op:
op["base"] = 0
if op["base"] == 0:
r = "[" + "MEM" + "]"
else:
reg_base = str(op["base"])
disp = str(op["disp"])
scale = str(op["scale"])
r = '[' + reg_base + "*" + scale + "+" + disp + ']'
return r
@staticmethod
def filter_memory_references(i):
inst = "" + i["mnemonic"]
for op in i["operands"]:
if op["type"] == 'reg':
inst += " " + RadareFunctionAnalyzer.filter_reg(op)
elif op["type"] == 'imm':
inst += " " + RadareFunctionAnalyzer.filter_imm(op)
elif op["type"] == 'mem':
inst += " " + RadareFunctionAnalyzer.filter_mem(op)
if len(i["operands"]) > 1:
inst = inst + ","
if "," in inst:
inst = inst[:-1]
inst = inst.replace(" ", "_")
return str(inst)
@staticmethod
def get_callref(my_function, depth):
calls = {}
if 'callrefs' in my_function and depth > 0:
for cc in my_function['callrefs']:
if cc["type"] == "C":
calls[cc['at']] = cc['addr']
return calls
def get_instruction(self):
instruction = json.loads(self.r2.cmd("aoj 1"))
if len(instruction) > 0:
instruction = instruction[0]
else:
return None
operands = []
if 'opex' not in instruction:
return None
for op in instruction['opex']['operands']:
operands.append(op)
instruction['operands'] = operands
return instruction
def function_to_inst(self, functions_dict, my_function, depth):
instructions = []
asm = ""
if self.use_symbol:
s = my_function['vaddr']
else:
s = my_function['offset']
calls = RadareFunctionAnalyzer.get_callref(my_function, depth)
self.r2.cmd('s ' + str(s))
if self.use_symbol:
end_address = s + my_function["size"]
else:
end_address = s + my_function["realsz"]
while s < end_address:
instruction = self.get_instruction()
asm += instruction["bytes"]
if self.arch == 'x86':
filtered_instruction = "X_" + RadareFunctionAnalyzer.filter_memory_references(instruction)
elif self.arch == 'arm':
filtered_instruction = "A_" + RadareFunctionAnalyzer.filter_memory_references(instruction)
instructions.append(filtered_instruction)
if s in calls and depth > 0:
if calls[s] in functions_dict:
ii, aa = self.function_to_inst(functions_dict, functions_dict[calls[s]], depth-1)
instructions.extend(ii)
asm += aa
self.r2.cmd("s " + str(s))
self.r2.cmd("so 1")
s = int(self.r2.cmd("s"), 16)
return instructions, asm
def get_arch(self):
try:
info = json.loads(self.r2.cmd('ij'))
if 'bin' in info:
arch = info['bin']['arch']
bits = info['bin']['bits']
except:
print("Error loading file")
arch = None
bits = None
return arch, bits
def find_functions(self):
self.r2.cmd('aaa')
try:
function_list = json.loads(self.r2.cmd('aflj'))
except:
function_list = []
return function_list
def find_functions_by_symbols(self):
self.r2.cmd('aa')
try:
symbols = json.loads(self.r2.cmd('isj'))
fcn_symb = [s for s in symbols if s['type'] == 'FUNC']
except:
fcn_symb = []
return fcn_symb
def analyze(self):
if self.use_symbol:
function_list = self.find_functions_by_symbols()
else:
function_list = self.find_functions()
functions_dict = {}
if self.top_depth > 0:
for my_function in function_list:
if self.use_symbol:
functions_dict[my_function['vaddr']] = my_function
else:
functions_dict[my_function['offset']] = my_function
result = {}
for my_function in function_list:
if self.use_symbol:
address = my_function['vaddr']
else:
address = my_function['offset']
try:
instructions, asm = self.function_to_inst(functions_dict, my_function, self.top_depth)
result[my_function['name']] = {'filtered_instructions': instructions, "asm": asm, "address": address}
except:
print("Error in functions: {} from {}".format(my_function['name'], self.filename))
pass
return result
def close(self):
self.r2.quit()
def __exit__(self, exc_type, exc_value, traceback):
self.r2.quit()
================================================
FILE: asm_embedding/FunctionNormalizer.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import numpy as np
class FunctionNormalizer:
def __init__(self, max_instruction):
self.max_instructions = max_instruction
def normalize(self, f):
f = np.asarray(f[0:self.max_instructions])
length = f.shape[0]
if f.shape[0] < self.max_instructions:
f = np.pad(f, (0, self.max_instructions - f.shape[0]), mode='constant')
return f, length
def normalize_function_pairs(self, pairs):
lengths = []
new_pairs = []
for x in pairs:
f0, len0 = self.normalize(x[0])
f1, len1 = self.normalize(x[1])
lengths.append((len0, len1))
new_pairs.append((f0, f1))
return new_pairs, lengths
def normalize_functions(self, functions):
lengths = []
new_functions = []
for f in functions:
f, length = self.normalize(f)
lengths.append(length)
new_functions.append(f)
return new_functions, lengths
================================================
FILE: asm_embedding/InstructionsConverter.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import json
class InstructionsConverter:
def __init__(self, json_i2id):
f = open(json_i2id, 'r')
self.i2id = json.load(f)
f.close()
def convert_to_ids(self, instructions_list):
ret_array = []
# For each instruction we add +1 to its ID because the first
# element of the embedding matrix is zero
for x in instructions_list:
if x in self.i2id:
ret_array.append(self.i2id[x] + 1)
elif 'X_' in x:
# print(str(x) + " is not a known x86 instruction")
ret_array.append(self.i2id['X_UNK'] + 1)
elif 'A_' in x:
# print(str(x) + " is not a known arm instruction")
ret_array.append(self.i2id['A_UNK'] + 1)
else:
# print("There is a problem " + str(x) + " does not appear to be an asm or arm instruction")
ret_array.append(self.i2id['X_UNK'] + 1)
return ret_array
================================================
FILE: asm_embedding/__init__.py
================================================
================================================
FILE: dataset_creation/DataSplitter.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import json
import random
import sqlite3
from tqdm import tqdm
class DataSplitter:
def __init__(self, db_name):
self.db_name = db_name
def create_pair_table(self, table_name):
conn = sqlite3.connect(self.db_name)
c = conn.cursor()
c.executescript("DROP TABLE IF EXISTS {} ".format(table_name))
c.execute("CREATE TABLE {} (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)".format(table_name))
conn.commit()
conn.close()
def get_ids(self, set_type):
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
q = cur.execute("SELECT id FROM {}".format(set_type))
ids = q.fetchall()
conn.close()
return ids
@staticmethod
def select_similar_cfg(id, provenance, ids, cursor):
q1 = cursor.execute('SELECT id FROM functions WHERE project=? AND file_name=? and function_name=?', provenance)
candidates = [i[0] for i in q1.fetchall() if (i[0] != id and i[0] in ids)]
if len(candidates) == 0:
return None
id_similar = random.choice(candidates)
return id_similar
@staticmethod
def select_dissimilar_cfg(ids, provenance, cursor):
while True:
id_dissimilar = random.choice(ids)
q2 = cursor.execute('SELECT project, file_name, function_name FROM functions WHERE id=?', id_dissimilar)
res = q2.fetchone()
if res != provenance:
break
return id_dissimilar
def create_epoch_pairs(self, epoch_number, pairs_table,id_table):
random.seed = epoch_number
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
ids = cur.execute("SELECT id FROM "+id_table).fetchall()
id_set=set(ids)
true_pair = []
false_pair = []
for my_id in tqdm(ids):
q = cur.execute('SELECT project, file_name, function_name FROM functions WHERE id =?', my_id)
cfg_0_provenance = q.fetchone()
id_sim = DataSplitter.select_similar_cfg(my_id, cfg_0_provenance, id_set, cur)
id_dissim = DataSplitter.select_dissimilar_cfg(ids, cfg_0_provenance, cur)
if id_sim is not None and id_dissim is not None:
true_pair.append((my_id, id_sim))
false_pair.append((my_id, id_dissim))
true_pair = str(json.dumps(true_pair))
false_pair = str(json.dumps(false_pair))
cur.execute("INSERT INTO {} VALUES (?,?,?)".format(pairs_table), (epoch_number, true_pair, false_pair))
conn.commit()
conn.close()
def create_pairs(self, total_epochs):
self.create_pair_table('train_pairs')
self.create_pair_table('validation_pairs')
self.create_pair_table('test_pairs')
for i in range(0, total_epochs):
print("Creating training pairs for epoch {} of {}".format(i, total_epochs))
self.create_epoch_pairs(i, 'train_pairs','train')
print("Creating validation pairs")
self.create_epoch_pairs(0, 'validation_pairs','validation')
print("Creating test pairs")
self.create_epoch_pairs(0, "test_pairs",'test')
@staticmethod
def prepare_set(data_to_include, table_name, file_list, cur):
i = 0
while i < data_to_include and len(file_list) > 0:
choice = random.choice(file_list)
file_list.remove(choice)
q = cur.execute("SELECT id FROM functions where project=? AND file_name=?", choice)
data = q.fetchall()
cur.executemany("INSERT INTO {} VALUES (?)".format(table_name), data)
i += len(data)
return file_list, i
def split_data(self, validation_dim, test_dim):
random.seed = 12345
conn = sqlite3.connect(self.db_name)
c = conn.cursor()
q = c.execute('''SELECT project, file_name FROM functions ''')
data = q.fetchall()
conn.commit()
num_data = len(data)
num_test = int(num_data * test_dim)
num_validation = int(num_data * validation_dim)
filename = list(set(data))
c.execute("DROP TABLE IF EXISTS train")
c.execute("DROP TABLE IF EXISTS test")
c.execute("DROP TABLE IF EXISTS validation")
c.execute("CREATE TABLE IF NOT EXISTS train (id INTEGER PRIMARY KEY)")
c.execute("CREATE TABLE IF NOT EXISTS validation (id INTEGER PRIMARY KEY)")
c.execute("CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY)")
c.execute('''CREATE INDEX IF NOT EXISTS my_index ON functions(project, file_name, function_name)''')
c.execute('''CREATE INDEX IF NOT EXISTS my_index_2 ON functions(project, file_name)''')
filename, test_num = DataSplitter.prepare_set(num_test, 'test', filename, conn.cursor())
conn.commit()
assert len(filename) > 0
filename, val_num = self.prepare_set(num_validation, 'validation', filename, conn.cursor())
conn.commit()
assert len(filename) > 0
_, train_num = self.prepare_set(num_data - num_test - num_validation, 'train', filename, conn.cursor())
conn.commit()
print("Train Size: {}".format(train_num))
print("Validation Size: {}".format(val_num))
print("Test Size: {}".format(test_num))
================================================
FILE: dataset_creation/DatabaseFactory.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
from asm_embedding.InstructionsConverter import InstructionsConverter
from asm_embedding.FunctionAnalyzerRadare import RadareFunctionAnalyzer
import json
import multiprocessing
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
import os
import random
import signal
import sqlite3
from tqdm import tqdm
class DatabaseFactory:
def __init__(self, db_name, root_path):
self.db_name = db_name
self.root_path = root_path
@staticmethod
def worker(item):
DatabaseFactory.analyze_file(item)
return 0
@staticmethod
def extract_function(graph_analyzer):
return graph_analyzer.extractAll()
@staticmethod
def insert_in_db(db_name, pool_sem, func, filename, function_name, instruction_converter):
path = filename.split(os.sep)
if len(path) < 4:
return
asm = func["asm"]
instructions_list = func["filtered_instructions"]
instruction_ids = json.dumps(instruction_converter.convert_to_ids(instructions_list))
pool_sem.acquire()
conn = sqlite3.connect(db_name)
cur = conn.cursor()
cur.execute('''INSERT INTO functions VALUES (?,?,?,?,?,?,?,?)''', (None, # id
path[-4], # project
path[-3], # compiler
path[-2], # optimization
path[-1], # file_name
function_name, # function_name
asm, # asm
len(instructions_list)) # num of instructions
)
inserted_id = cur.lastrowid
cur.execute('''INSERT INTO filtered_functions VALUES (?,?)''', (inserted_id,
instruction_ids)
)
conn.commit()
conn.close()
pool_sem.release()
@staticmethod
def analyze_file(item):
global pool_sem
os.setpgrp()
filename = item[0]
db = item[1]
use_symbol = item[2]
depth = item[3]
instruction_converter = item[4]
analyzer = RadareFunctionAnalyzer(filename, use_symbol, depth)
p = ThreadPool(1)
res = p.apply_async(analyzer.analyze)
try:
result = res.get(120)
except multiprocessing.TimeoutError:
print("Aborting due to timeout:" + str(filename))
print('Try to modify the timeout value in DatabaseFactory instruction result = res.get(TIMEOUT)')
os.killpg(0, signal.SIGKILL)
except Exception:
print("Aborting due to error:" + str(filename))
os.killpg(0, signal.SIGKILL)
for func in result:
DatabaseFactory.insert_in_db(db, pool_sem, result[func], filename, func, instruction_converter)
analyzer.close()
return 0
# Create the db where data are stored
def create_db(self):
print('Database creation...')
conn = sqlite3.connect(self.db_name)
conn.execute(''' CREATE TABLE IF NOT EXISTS functions (id INTEGER PRIMARY KEY,
project text,
compiler text,
optimization text,
file_name text,
function_name text,
asm text,
num_instructions INTEGER)
''')
conn.execute('''CREATE TABLE IF NOT EXISTS filtered_functions (id INTEGER PRIMARY KEY,
instructions_list text)
''')
conn.commit()
conn.close()
# Scan the root directory to find all the file to analyze,
# query also the db for already analyzed files.
def scan_for_file(self, start):
file_list = []
# Scan recursively all the subdirectory
directories = os.listdir(start)
for item in directories:
item = os.path.join(start,item)
if os.path.isdir(item):
file_list.extend(self.scan_for_file(item + os.sep))
elif os.path.isfile(item) and item.endswith('.o'):
file_list.append(item)
return file_list
# Looks for already existing files in the database
# It returns a list of files that are not in the database
def remove_override(self, file_list):
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
q = cur.execute('''SELECT project, compiler, optimization, file_name FROM functions''')
names = q.fetchall()
names = [os.path.join(self.root_path, n[0], n[1], n[2], n[3]) for n in names]
names = set(names)
# If some files is already in the db remove it from the file list
if len(names) > 0:
print(str(len(names)) + ' Already in the database')
cleaned_file_list = []
for f in file_list:
if not(f in names):
cleaned_file_list.append(f)
return cleaned_file_list
# root function to create the db
def build_db(self, use_symbol, depth):
global pool_sem
pool_sem = multiprocessing.BoundedSemaphore(value=1)
instruction_converter = InstructionsConverter("data/i2v/word2id.json")
self.create_db()
file_list = self.scan_for_file(self.root_path)
print('Found ' + str(len(file_list)) + ' during the scan')
file_list = self.remove_override(file_list)
print('Find ' + str(len(file_list)) + ' files to analyze')
random.shuffle(file_list)
t_args = [(f, self.db_name, use_symbol, depth, instruction_converter) for f in file_list]
# Start a parallel pool to analyze files
p = Pool(processes=None, maxtasksperchild=20)
for _ in tqdm(p.imap_unordered(DatabaseFactory.worker, t_args), total=len(file_list)):
pass
p.close()
p.join()
================================================
FILE: dataset_creation/ExperimentUtil.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import argparse
from dataset_creation import DatabaseFactory, DataSplitter, FunctionsEmbedder
from utils.utils import print_safe
def debug_msg():
msg = "SAFE DATABASE UTILITY"
msg += "-------------------------------------------------\n"
msg += "This program is an utility to save data into an sqlite database with SAFE \n\n"
msg += "There are three main command: \n"
msg += "BUILD: It create a db with two tables: functions, filtered_functions. \n"
msg += " In the first table there are all the functions extracted from the executable with their hex code.\n"
msg += " In the second table functions are converted to i2v representation. \n"
msg += "SPLIT: Data are splitted into train validation and test set. " \
" Then it generate the pairs for the training of the network.\n"
msg += "EMBEDD: Generate the embeddings of each function in the database using a trained SAFE model\n\n"
msg += "If you want to train the network use build + split"
msg += "If you want to create a knowledge base for the binary code search engine use build + embedd"
msg += "This program has been written by the SAFE team.\n"
msg += "-------------------------------------------------"
return msg
def build_configuration(db_name, root_dir, use_symbols, callee_depth):
msg = "Database creation options: \n"
msg += " - Database Name: {} \n".format(db_name)
msg += " - Root dir: {} \n".format(root_dir)
msg += " - Use symbols: {} \n".format(use_symbols)
msg += " - Callee depth: {} \n".format(callee_depth)
return msg
def split_configuration(db_name, val_split, test_split, epochs):
msg = "Splitting options: \n"
msg += " - Database Name: {} \n".format(db_name)
msg += " - Validation Size: {} \n".format(val_split)
msg += " - Test Size: {} \n".format(test_split)
msg += " - Epochs: {} \n".format(epochs)
return msg
def embedd_configuration(db_name, model, batch_size, max_instruction, embeddings_table):
msg = "Embedding options: \n"
msg += " - Database Name: {} \n".format(db_name)
msg += " - Model: {} \n".format(model)
msg += " - Batch Size: {} \n".format(batch_size)
msg += " - Max Instruction per function: {} \n".format(max_instruction)
msg += " - Table for saving embeddings: {}.".format(embeddings_table)
return msg
if __name__ == '__main__':
print_safe()
parser = argparse.ArgumentParser(description=debug_msg)
parser.add_argument("-db", "--db", help="Name of the database to create", required=True)
parser.add_argument("-b", "--build", help="Build db disassebling executables", action="store_true")
parser.add_argument("-s", "--split", help="Perform data splitting for training", action="store_true")
parser.add_argument("-e", "--embed", help="Compute functions embedding", action="store_true")
parser.add_argument("-dir", "--dir", help="Root path of the directory to scan")
parser.add_argument("-sym", "--symbols", help="Use it if you want to use symbols", action="store_true")
parser.add_argument("-dep", "--depth", help="Recursive depth for analysis", default=0, type=int)
parser.add_argument("-test", "--test_size", help="Test set size [0-1]", type=float, default=0.2)
parser.add_argument("-val", "--val_size", help="Validation set size [0-1]", type=float, default=0.2)
parser.add_argument("-epo", "--epochs", help="# Epochs to generate pairs for", type=int, default=25)
parser.add_argument("-mod", "--model", help="Model for embedding generation")
parser.add_argument("-bat", "--batch_size", help="Batch size for function embeddings", type=int, default=500)
parser.add_argument("-max", "--max_instruction", help="Maximum instruction per function", type=int, default=150)
parser.add_argument("-etb", "--embeddings_table", help="Name for the table that contains embeddings",
default="safe_embeddings")
try:
args = parser.parse_args()
except:
parser.print_help()
print(debug_msg())
exit(0)
if args.build:
print("Disassemblying files and creating dataset")
print(build_configuration(args.db, args.dir, args.symbols, args.depth))
factory = DatabaseFactory.DatabaseFactory(args.db, args.dir)
factory.build_db(args.symbols, args.depth)
if args.split:
print("Splitting data and generating epoch pairs")
print(split_configuration(args.db, args.val_size, args.test_size, args.epochs))
splitter = DataSplitter.DataSplitter(args.db)
splitter.split_data(args.val_size, args.test_size)
splitter.create_pairs(args.epochs)
if args.embed:
print("Computing embeddings for function in db")
print(embedd_configuration(args.db, args.model, args.batch_size, args.max_instruction, args.embeddings_table))
embedder = FunctionsEmbedder.FunctionsEmbedder(args.model, args.batch_size, args.max_instruction)
embedder.compute_and_save_embeddings_from_db(args.db, args.embeddings_table)
exit(0)
================================================
FILE: dataset_creation/FunctionsEmbedder.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
from asm_embedding.FunctionNormalizer import FunctionNormalizer
import json
from neural_network.SAFEEmbedder import SAFEEmbedder
import numpy as np
import sqlite3
from tqdm import tqdm
class FunctionsEmbedder:
def __init__(self, model, batch_size, max_instruction):
self.batch_size = batch_size
self.normalizer = FunctionNormalizer(max_instruction)
self.safe = SAFEEmbedder(model)
self.safe.loadmodel()
self.safe.get_tensor()
def compute_embeddings(self, functions):
functions, lenghts = self.normalizer.normalize_functions(functions)
embeddings = self.safe.embedd(functions, lenghts)
return embeddings
@staticmethod
def create_table(db_name, table_name):
conn = sqlite3.connect(db_name)
c = conn.cursor()
c.execute("CREATE TABLE IF NOT EXISTS {} (id INTEGER PRIMARY KEY, {} TEXT)".format(table_name, table_name))
conn.commit()
conn.close()
def compute_and_save_embeddings_from_db(self, db_name, table_name):
FunctionsEmbedder.create_table(db_name, table_name)
conn = sqlite3.connect(db_name)
cur = conn.cursor()
q = cur.execute("SELECT id FROM functions WHERE id not in (SELECT id from {})".format(table_name))
ids = q.fetchall()
for i in tqdm(range(0, len(ids), self.batch_size)):
functions = []
batch_ids = ids[i:i+self.batch_size]
for my_id in batch_ids:
q = cur.execute("SELECT instructions_list FROM filtered_functions where id=?", my_id)
functions.append(json.loads(q.fetchone()[0]))
embeddings = self.compute_embeddings(functions)
for l, id in enumerate(batch_ids):
cur.execute("INSERT INTO {} VALUES (?,?)".format(table_name), (id[0], np.array2string(embeddings[l])))
conn.commit()
================================================
FILE: dataset_creation/__init__.py
================================================
================================================
FILE: dataset_creation/convertDB.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import sqlite3
import json
from networkx.readwrite import json_graph
import logging
from tqdm import tqdm
from asm_embedding.InstructionsConverter import InstructionsConverter
# Create the db where data are stored
def create_db(db_name):
print('Database creation...')
conn = sqlite3.connect(db_name)
conn.execute(''' CREATE TABLE IF NOT EXISTS functions (id INTEGER PRIMARY KEY,
project text,
compiler text,
optimization text,
file_name text,
function_name text,
asm text,
num_instructions INTEGER)
''')
conn.execute('''CREATE TABLE IF NOT EXISTS filtered_functions (id INTEGER PRIMARY KEY,
instructions_list text)
''')
conn.commit()
conn.close()
def reverse_graph(cfg, lstm_cfg):
instructions = []
asm = ""
node_addr = list(cfg.nodes())
node_addr.sort()
nodes = cfg.nodes(data=True)
lstm_nodes = lstm_cfg.nodes(data=True)
for addr in node_addr:
a = nodes[addr]["asm"]
if a is not None:
asm += a
instructions.extend(lstm_nodes[addr]['features'])
return instructions, asm
def copy_split(old_cur, new_cur, table):
q = old_cur.execute("SELECT id FROM {}".format(table))
iii = q.fetchall()
print("Copying table {}".format(table))
for ii in tqdm(iii):
new_cur.execute("INSERT INTO {} VALUES (?)".format(table), ii)
def copy_table(old_cur, new_cur, table_old, table_new):
q = old_cur.execute("SELECT * FROM {}".format(table_old))
iii = q.fetchall()
print("Copying table {} to {}".format(table_old, table_new))
for ii in tqdm(iii):
new_cur.execute("INSERT INTO {} VALUES (?,?,?)".format(table_new), ii)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
db = "/home/lucamassarelli/binary_similarity_data/databases/big_dataset_X86.db"
new_db = "/home/lucamassarelli/binary_similarity_data/new_databases/big_dataset_X86_new.db"
create_db(new_db)
conn_old = sqlite3.connect(db)
conn_new = sqlite3.connect(new_db)
cur_old = conn_old.cursor()
cur_new = conn_new.cursor()
q = cur_old.execute("SELECT id FROM functions")
ids = q.fetchall()
converter = InstructionsConverter()
for my_id in tqdm(ids):
q0 = cur_old.execute("SELECT id, project, compiler, optimization, file_name, function_name, cfg FROM functions WHERE id=?", my_id)
meta = q.fetchone()
q1 = cur_old.execute("SELECT lstm_cfg FROM lstm_cfg WHERE id=?", my_id)
cfg = json_graph.adjacency_graph(json.loads(meta[6]))
lstm_cfg = json_graph.adjacency_graph(json.loads(q1.fetchone()[0]))
instructions, asm = reverse_graph(cfg, lstm_cfg)
values = meta[0:6] + (asm, len(instructions))
q_n = cur_new.execute("INSERT INTO functions VALUES (?,?,?,?,?,?,?,?)", values)
converted_instruction = json.dumps(converter.convert_to_ids(instructions))
q_n = cur_new.execute("INSERT INTO filtered_functions VALUES (?,?)", (my_id[0], converted_instruction))
conn_new.commit()
cur_new.execute("CREATE TABLE train (id INTEGER PRIMARY KEY) ")
cur_new.execute("CREATE TABLE validation (id INTEGER PRIMARY KEY) ")
cur_new.execute("CREATE TABLE test (id INTEGER PRIMARY KEY) ")
conn_new.commit()
copy_split(cur_old, cur_new, "train")
conn_new.commit()
copy_split(cur_old, cur_new, "validation")
conn_new.commit()
copy_split(cur_old, cur_new, "test")
conn_new.commit()
cur_new.execute("CREATE TABLE train_pairs (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)")
cur_new.execute("CREATE TABLE validation_pairs (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)")
cur_new.execute("CREATE TABLE test_pairs (id INTEGER PRIMARY KEY, true_pair TEXT, false_pair TEXT)")
conn_new.commit()
copy_table(cur_old, cur_new, "train_couples", "train_pairs")
conn_new.commit()
copy_table(cur_old, cur_new, "validation_couples", "validation_pairs")
conn_new.commit()
copy_table(cur_old, cur_new, "test_couples", "test_pairs")
conn_new.commit()
conn_new.close()
================================================
FILE: download_model.sh
================================================
#!/usr/bin/env bash
python3 downloader.py -b
echo 'Model downloaded and, hopefully, ready to run'
================================================
FILE: downloader.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import argparse
import os
import sys
from subprocess import call
class Downloader:
def __init__(self):
parser = argparse.ArgumentParser(description='SAFE downloader')
parser.add_argument("-m", "--model", dest="model", help="Download the trained SAFE model for x86",
action="store_true",
required=False)
parser.add_argument("-i2v", "--i2v", dest="i2v", help="Download the i2v dictionary and embedding matrix",
action="store_true",
required=False)
parser.add_argument("-b", "--bundle", dest="bundle",
help="Download all files necessary to run the model",
action="store_true",
required=False)
parser.add_argument("-td", "--train_data", dest="train_data",
help="Download the files necessary to train the model (It takes a lot of space!)",
action="store_true",
required=False)
args = parser.parse_args()
self.download_model = (args.model or args.bundle)
self.download_i2v = (args.i2v or args.bundle)
self.download_train = args.train_data
if not (self.download_model or self.download_i2v or self.download_train):
parser.print_help(sys.__stdout__)
self.url_model = "https://drive.google.com/file/d/1Kwl8Jy-g9DXe1AUjUZDhJpjRlDkB4NBs/view?usp=sharing"
self.url_i2v = "https://drive.google.com/file/d/1CqJVGYbLDEuJmJV6KH4Dzzhy-G12GjGP"
self.url_train = ['https://drive.google.com/file/d/1sNahtLTfZY5cxPaYDUjqkPTK0naZ45SH/view?usp=sharing','https://drive.google.com/file/d/16D5AVDux_Q8pCVIyvaMuiL2cw2V6gtLc/view?usp=sharing','https://drive.google.com/file/d/1cBRda8fYdqHtzLwstViuwK6U5IVHad1N/view?usp=sharing']
self.train_name = ['AMD64ARMOpenSSL.tar.bz2','AMD64multipleCompilers.tar.bz2','AMD64PostgreSQL.tar.bz2']
self.base_path = "data"
self.path_i2v = os.path.join(self.base_path, "")
self.path_model = os.path.join(self.base_path, "")
self.path_train_data = os.path.join(self.base_path, "")
self.i2v_compress_name='i2v.tar.bz2'
self.model_compress_name='model.tar.bz2'
self.datasets_compress_name='safe.pb'
@staticmethod
def download_file(id,path):
try:
print("Downloading from "+ str(id) +" into "+str(path))
call(['./godown.pl',id,path])
except Exception as e:
print("Error downloading file at url:" + str(id))
print(e)
@staticmethod
def decompress_file(file_src,file_path):
try:
call(['tar','-xvf',file_src,'-C',file_path])
except Exception as e:
print("Error decompressing file:" + str(file_src))
print('you need tar command e b2zip support')
print(e)
def download(self):
print('Making the godown.pl script executable, thanks:'+str('https://github.com/circulosmeos/gdown.pl'))
call(['chmod', '+x','godown.pl'])
print("SAFE --- downloading models")
if self.download_i2v:
print("Downloading i2v model.... in the folder data/i2v/")
if not os.path.exists(self.path_i2v):
os.makedirs(self.path_i2v)
Downloader.download_file(self.url_i2v, os.path.join(self.path_i2v,self.i2v_compress_name))
print("Decompressing i2v model and placing in" + str(self.path_i2v))
Downloader.decompress_file(os.path.join(self.path_i2v,self.i2v_compress_name),self.path_i2v)
if self.download_model:
print("Downloading the SAFE model... in the folder data")
if not os.path.exists(self.path_model):
os.makedirs(self.path_model)
Downloader.download_file(self.url_model, os.path.join(self.path_model,self.datasets_compress_name))
#print("Decompressing SAFE model and placing in" + str(self.path_model))
#Downloader.decompress_file(os.path.join(self.path_model,self.model_compress_name),self.path_model)
if self.download_train:
print("Downloading the train data.... in the folder data")
if not os.path.exists(self.path_train_data):
os.makedirs(self.path_train_data)
for i,x in enumerate(self.url_train):
print("Downloading dataset "+str(self.train_name[i]))
Downloader.download_file(x, os.path.join(self.path_train_data,self.train_name[i]))
#print("Decompressing the train data and placing in" + str(self.path_train_data))
#Downloader.decompress_file(os.path.join(self.path_train_data,self.datasets_compress_name),self.path_train_data)
if __name__=='__main__':
a=Downloader()
a.download()
================================================
FILE: function_search/EvaluateSearchEngine.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
from FunctionSearchEngine import FunctionSearchEngine
from sklearn import metrics
import sqlite3
from multiprocessing import Process
import math
import warnings
import random
import json
class SearchEngineEvaluator:
def __init__(self, db_name, table, limit=None,k=None):
self.tables = table
self.db_name = db_name
self.SE = FunctionSearchEngine(db_name, table, limit=limit)
self.k=k
self.number_similar={}
def do_search(self, target_db_name, target_fcn_ids):
self.SE.load_target(target_db_name, target_fcn_ids)
self.SE.pp_search(50)
def calc_auc(self, target_db_name, target_fcn_ids):
self.SE.load_target(target_db_name, target_fcn_ids)
result = self.SE.auc()
print(result)
#
# This methods searches for all target function in the DB, in our test we take num functions compiled with compiler and opt
# moreover it populates the self.number_similar dictionary, that contains the number of similar function for each target
#
def find_target_fcn(self, compiler, opt, num):
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
q = cur.execute("SELECT id, project, file_name, function_name FROM functions WHERE compiler=? AND optimization=?", (compiler, opt))
res = q.fetchall()
ids = [i[0] for i in res]
true_labels = [l[1]+"/"+l[2]+"/"+l[3] for l in res]
n_ids = []
n_true_labels = []
num = min(num, len(ids))
for i in range(0, num):
index = random.randrange(len(ids))
n_ids.append(ids[index])
n_true_labels.append(true_labels[index])
f_name=true_labels[index].split('/')[2]
fi_name=true_labels[index].split('/')[1]
q = cur.execute("SELECT num FROM count_func WHERE file_name='{}' and function_name='{}'".format(fi_name,f_name))
f = q.fetchone()
if f is not None:
num=int(f[0])
else:
num = 0
self.number_similar[true_labels[index]]=num
return n_ids, n_true_labels
@staticmethod
def functions_ground_truth(labels, indices, values, true_label):
y_true = []
y_score = []
for i, e in enumerate(indices):
y_score.append(float(values[i]))
l = labels[e]
if l == true_label:
y_true.append(1)
else:
y_true.append(0)
return y_true, y_score
# this methos execute the test
# it select the targets functions and it looks up for the targets in the entire db
# the outcome is json file containing the top 200 similar for each target function.
# the json file is an array and such array contains an entry for each target function
# each entry is a triple (t0,t1,t2)
# t0: an array that contains 1 at entry j if the entry j is similar to the target 0 otherwise
# t1: the number of similar functions to the target in the whole db
# t2: an array that at entry j contains the similarity score of the j-th most similar function to the target.
#
#
def evaluate_precision_on_all_functions(self, compiler, opt):
target_fcn_ids, true_labels = self.find_target_fcn(compiler, opt, 10000)
batch = 1000
labels = self.SE.trunc_labels
info=[]
for i in range(0, len(target_fcn_ids), batch):
if i + batch > len(target_fcn_ids):
batch = len(target_fcn_ids) - i
target = self.SE.load_target(self.db_name, target_fcn_ids[i:i+batch])
top_k = self.SE.top_k(target, self.k)
for j in range(0, batch):
a, b = SearchEngineEvaluator.functions_ground_truth(labels, top_k.indices[j, :], top_k.values[j, :], true_labels[i+j])
info.append((a,self.number_similar[true_labels[i + j]],b))
with open(compiler+'_'+opt+'_'+self.tables+'_top200.json', 'w') as outfile:
json.dump(info, outfile)
def test(dbName, table, opt,x,k):
print("k:{} - Table: {} - Opt: {}".format(k,table, opt))
SEV = SearchEngineEvaluator(dbName, table, limit=2000000,k=k)
SEV.evaluate_precision_on_all_functions(x, opt)
print("-------------------------------------")
if __name__ == '__main__':
random.seed(12345)
dbName = '../data/AMD64PostgreSQL.db'
table = ['safe_embeddings']
opt = ["O0", "O1", "O2", "O3"]
for x in ['gcc-4.8',"clang-4.0",'gcc-7','clang-6.0']:
for t in table:
for o in opt:
p = Process(target=test, args=(dbName, t, o,x,200))
p.start()
p.join()
================================================
FILE: function_search/FunctionSearchEngine.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import sys
import numpy as np
import sqlite3
import pandas as pd
import tqdm
import tensorflow as tf
if sys.version_info >= (3, 0):
from functools import reduce
pd.set_option('display.max_column',None)
pd.set_option('display.max_rows',None)
pd.set_option('display.max_seq_items',None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)
class TopK:
#
# This class computes the similarities between the targets and the list of functions on which we are searching.
# This is done by using matrices multiplication and top_k of tensorflow
def __init__(self):
self.graph=tf.Graph()
nop=0
def loads_embeddings_SE(self, lista_embeddings):
with self.graph.as_default():
tf.set_random_seed(1234)
dim = lista_embeddings[0].shape[0]
ll = np.asarray(lista_embeddings)
self.matrix = tf.constant(ll, name='matrix_embeddings', dtype=tf.float32)
self.target = tf.placeholder("float", [None, dim], name='target_embedding')
self.sim = tf.matmul(self.target, self.matrix, transpose_b=True, name="embeddings_similarities")
self.k = tf.placeholder(tf.int32, shape=(), name='k')
self.top_k = tf.nn.top_k(self.sim, self.k, sorted=True)
self.session = tf.Session()
def topK(self, k, target):
with self.graph.as_default():
tf.set_random_seed(1234)
return self.session.run(self.top_k, {self.target: target, self.k: int(k)})
class FunctionSearchEngine:
def __init__(self, db_name, table_name, limit=None):
self.s2v = TopK()
self.db_name = db_name
self.table_name = table_name
self.labels = []
self.trunc_labels = []
self.lista_embedding = []
self.ids = []
self.n_similar=[]
self.ret = {}
self.precision = None
print("Query for ids")
conn = sqlite3.connect(db_name)
cur = conn.cursor()
if limit is None:
q = cur.execute("SELECT id, project, compiler, optimization, file_name, function_name FROM functions")
res = q.fetchall()
else:
q = cur.execute("SELECT id, project, compiler, optimization, file_name, function_name FROM functions LIMIT {}".format(limit))
res = q.fetchall()
for item in tqdm.tqdm(res, total=len(res)):
q = cur.execute("SELECT " + self.table_name + " FROM " + self.table_name + " WHERE id=?", (item[0],))
e = q.fetchone()
if e is None:
continue
self.lista_embedding.append(self.embeddingToNp(e[0]))
element = "{}/{}/{}".format(item[1], item[4], item[5])
self.trunc_labels.append(element)
element = "{}@{}/{}/{}/{}".format(item[5], item[1], item[2], item[3], item[4])
self.labels.append(element)
self.ids.append(item[0])
conn.close()
self.s2v.loads_embeddings_SE(self.lista_embedding)
self.num_funcs = len(self.lista_embedding)
def load_target(self, target_db_name, target_fcn_ids, calc_mean=False):
conn = sqlite3.connect(target_db_name)
cur = conn.cursor()
mean = None
for id in target_fcn_ids:
if target_db_name == self.db_name and id in self.ids:
idx = self.ids.index(id)
e = self.lista_embedding[idx]
else:
q = cur.execute("SELECT " + self.table_name + " FROM " + self.table_name + " WHERE id=?", (id,))
e = q.fetchone()
e = self.embeddingToNp(e[0])
if mean is None:
mean = e.reshape([e.shape[0], 1])
else:
mean = np.hstack((mean, e.reshape(e.shape[0], 1)))
if calc_mean:
target = [np.mean(mean, axis=1)]
else:
target = mean.T
return target
def embeddingToNp(self, e):
e = e.replace('\n', '')
e = e.replace('[', '')
e = e.replace(']', '')
emb = np.fromstring(e, dtype=float, sep=' ')
return emb
def top_k(self, target, k=None):
if k is not None:
top_k = self.s2v.topK(k, target)
else:
top_k = self.s2v.topK(len(self.lista_embedding), target)
return top_k
def pp_search(self, k):
result = pd.DataFrame(columns=['Id', 'Name', 'Score'])
top_k = self.s2v.topK(k)
for i, e in enumerate(top_k.indices[0]):
result = result.append({'Id': self.ids[e], 'Name': self.labels[e], 'Score': top_k.values[0][i]}, ignore_index=True)
print(result)
def search(self, k):
result = []
top_k = self.s2v.topK(k)
for i, e in enumerate(top_k.indices[0]):
result = result.append({'Id': self.ids[e], 'Name': self.labels[e], 'Score': top_k.values[0][i]})
return result
================================================
FILE: function_search/__init__.py
================================================
================================================
FILE: function_search/fromJsonSearchToPlot.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
import matplotlib.pyplot as plt
import json
import math
import numpy as np
from multiprocessing import Pool
def find_dcg(element_list):
dcg_score = 0.0
for j, sim in enumerate(element_list):
dcg_score += float(sim) / math.log(j + 2)
return dcg_score
def count_ones(element_list):
return len([x for x in element_list if x == 1])
def extract_info(file_1):
with open(file_1, 'r') as f:
data1 = json.load(f)
performance1 = []
average_recall_k1 = []
precision_at_k1 = []
for f_index in range(0, len(data1)):
f1 = data1[f_index][0]
pf1 = data1[f_index][1]
tp1 = []
recall_p1 = []
precision_p1 = []
# we start from 1 to remove ourselves
for k in range(1, 200):
cut1 = f1[0:k]
dcg1 = find_dcg(cut1)
ideal1 = find_dcg(([1] * (pf1) + [0] * (k - pf1))[0:k])
p1k = float(count_ones(cut1))
tp1.append(dcg1 / ideal1)
recall_p1.append(p1k / pf1)
precision_p1.append(p1k / k)
performance1.append(tp1)
average_recall_k1.append(recall_p1)
precision_at_k1.append(precision_p1)
avg_p1 = np.average(performance1, axis=0)
avg_p10 = np.average(average_recall_k1, axis=0)
average_precision = np.average(precision_at_k1, axis=0)
return avg_p1, avg_p10, average_precision
def print_graph(info1, file_name, label_y, title_1, p):
fig, ax = plt.subplots()
ax.plot(range(0, len(info1)), info1, color='b', label=title_1)
ax.legend(loc=p, shadow=True, fontsize='x-large')
plt.xlabel("Number of Nearest Results")
plt.ylabel(label_y)
fname = file_name
plt.savefig(fname)
plt.close(fname)
def compare_and_print(file):
filename = file.split('_')[0] + '_' + file.split('_')[1]
t_short = filename
label_1 = t_short + '_' + file.split('_')[3]
avg_p1, recall_p1, precision1 = extract_info(file)
fname = filename + '_nDCG.pdf'
print_graph(avg_p1, fname, 'nDCG', label_1, 'upper right')
fname = filename + '_recall.pdf'
print_graph(recall_p1, fname, 'Recall', label_1, 'lower right')
fname = filename + '_precision.pdf'
print_graph(precision1, fname, 'Precision', label_1, 'upper right')
return avg_p1, recall_p1, precision1
e1 = 'embeddings_safe'
opt = ['O0', 'O1', 'O2', 'O3']
compilers = ['gcc-7', 'gcc-4.8', 'clang-6.0', 'clang-4.0']
values = []
for o in opt:
for c in compilers:
f0 = '' + c + '_' + o + '_' + e1 + '_top200.json'
values.append(f0)
p = Pool(4)
result = p.map(compare_and_print, values)
avg_p1 = []
recal_p1 = []
pre_p1 = []
avg_p2 = []
recal_p2 = []
pre_p2 = []
for t in result:
avg_p1.append(t[0])
recal_p1.append(t[1])
pre_p1.append(t[2])
avg_p1 = np.average(avg_p1, axis=0)
recal_p1 = np.average(recal_p1, axis=0)
pre_p1 = np.average(pre_p1, axis=0)
print_graph(avg_p1[0:20], 'nDCG.pdf', 'normalized DCG', 'SAFE', 'upper right')
print_graph(recal_p1, 'recall.pdf', 'recall', 'SAFE', 'lower right')
print_graph(pre_p1[0:20], 'precision.pdf', 'precision', 'SAFE', 'upper right')
================================================
FILE: godown.pl
================================================
#!/usr/bin/env perl
#
# Google Drive direct download of big files
# ./gdown.pl 'gdrive file url' ['desired file name']
#
# v1.0 by circulosmeos 04-2014.
# v1.1 by circulosmeos 01-2017.
# http://circulosmeos.wordpress.com/2014/04/12/google-drive-direct-download-of-big-files
# Distributed under GPL 3 (http://www.gnu.org/licenses/gpl-3.0.html)
#
use strict;
my $TEMP='gdown.cookie.temp';
my $COMMAND;
my $confirm;
my $check;
sub execute_command();
my $URL=shift;
die "\n./gdown.pl 'gdrive file url' [desired file name]\n\n" if $URL eq '';
my $FILENAME=shift;
$FILENAME='gdown' if $FILENAME eq '';
if ($URL=~m#^https?://drive.google.com/file/d/([^/]+)#) {
$URL="https://docs.google.com/uc?id=$1&export=download";
}
execute_command();
while (-s $FILENAME < 100000) { # only if the file isn't the download yet
open fFILENAME, '<', $FILENAME;
$check=0;
foreach (<fFILENAME>) {
if (/href="(\/uc\?export=download[^"]+)/) {
$URL='https://docs.google.com'.$1;
$URL=~s/&/&/g;
$confirm='';
$check=1;
last;
}
if (/confirm=([^;&]+)/) {
$confirm=$1;
$check=1;
last;
}
if (/"downloadUrl":"([^"]+)/) {
$URL=$1;
$URL=~s/\\u003d/=/g;
$URL=~s/\\u0026/&/g;
$confirm='';
$check=1;
last;
}
}
close fFILENAME;
die "Couldn't download the file :-(\n" if ($check==0);
$URL=~s/confirm=([^;&]+)/confirm=$confirm/ if $confirm ne '';
execute_command();
}
unlink $TEMP;
sub execute_command() {
$COMMAND="wget --no-check-certificate --load-cookie $TEMP --save-cookie $TEMP \"$URL\"";
$COMMAND.=" -O \"$FILENAME\"" if $FILENAME ne '';
`$COMMAND`;
return 1;
}
================================================
FILE: helloworld.c
================================================
#include "stdio.h"
int main(){
printf("hello world");
int a=10;
int b=20;
printf("%d",a+b);
}
================================================
FILE: index.md
================================================
---
# Feel free to add content and custom Front Matter to this file.
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults
layout: home
---
<div style="text-align:center"><img src ="img\safe2.jpg" /></div>
What is SAf(E)?
-------------
**SAFE** is a **S**elf-**A**ttentive neural network that takes as input a binary **F**unction and creates an **E**mbedding.
What is an embedding?
-------------
An embedding is vector of real numbers. The nice feature of SAFE embeddings is that two similar binary functions should generate two embeddings
that are close in the metric space.
<div style="text-align:center"><img src ="img\metric.png" /></div>
I want to know all the details!
-------------
Good, read our paper on [arXiv](https://arxiv.org/abs/1811.05296).
The paper is slightly amusing! How do I get SAFE?
-------------
SAFE is available in our [GitHub](https://github.com/gadiluna/SAFE) repository. Keep in mind that SAFE has been developed as a research project. We only provide a minimal working proof-of-concept,
with the code and data to replicate our experiments. We are not responsible for any self-harm episode correlated with reading our (sometimes badly written) code.
How I can get involved with SAFE?
-------------
If you are interested in this project write us an email.
-------------
SAFE has been designed and developed by:
<div style="text-align:left"><img src ="img\2.jpeg" /></div>
* [Luca Massarelli](https://scholar.google.it/citations?user=mJ_QjZIAAAAJ&hl=it) (development and research)
<div style="text-align:left"><img src ="img\1.jpeg" /></div>
* [Giuseppe Antonio Di Luna](https://scholar.google.it/citations?hl=it&user=RgAfuVgAAAAJ&view_op=list_works&sortby=pubdate) (development and research)
<div style="text-align:left"><img src ="img\3.jpeg" /></div>
* [Fabio Petroni](https://scholar.google.it/citations?user=vxQc2L4AAAAJ&hl=it) (development and research)
<div style="text-align:left"><img src ="img\4.jpeg" /></div>
* [Leonardo Querzoni](https://scholar.google.it/citations?user=-_WFIJIAAAAJ&hl=it) (research)
<div style="text-align:left"><img src ="img\5.jpeg" /></div>
* [Roberto Baldoni](https://scholar.google.it/citations?user=82tR6VoAAAAJ&hl=it) (research)
#### **Acknowledgments**:
We are in debt with Google for providing free access to its cloud computing platform through the Education Program. Moreover, the authors would like to thank NVIDIA Corporation for partially supporting this work through the donation of a GPGPU card used during prototype development.
This work is supported by a grant of the Italian Presidency of the Council of Ministers and by the CINI (Consorzio Interuniversitario Nazionale Informatica) National Laboratory of Cyber Security.
Finally, we thank Davide Italiano for the insightful discussions.
SAFE License.
-------
# SAFE TEAM
# GPL 3 License http://www.gnu.org/licenses/
================================================
FILE: neural_network/PairFactory.py
================================================
# SAFE TEAM
# distributed under license: GPL 3 License http://www.gnu.org/licenses/
import sqlite3
import json
import numpy as np
from multiprocessing import Queue
from multiprocessing import Process
from asm_embedding.FunctionNormalizer import FunctionNormalizer
#
# PairFactory class, used for training the SAFE network.
# This class generates the pairs for training, test and validation
#
#
# Authors: SAFE team
class PairFactory:
def __init__(self, db_name, dataset_type, batch_size, max_instructions, shuffle=True):
self.db_name = db_name
self.dataset_type = dataset_type
self.max_instructions = max_instructions
self.batch_dim = 0
self.num_pairs = 0
self.num_batches = 0
self.batch_size = batch_size
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
q = cur.execute("SELECT true_pair from " + self.dataset_type + " WHERE id=?", (0,))
self.num_pairs=len(json.loads(q.fetchone()[0]))*2
n_chunk = int(self.num_pairs / self.batch_size) - 1
conn.close()
self.num_batches = n_chunk
self.shuffle = shuffle
@staticmethod
def split( a, n):
return [a[i::n] for i in range(n)]
@staticmethod
def truncate_and_compute_lengths(pairs, max_instructions):
lenghts = []
new_pairs=[]
for x in pairs:
f0 = np.asarray(x[0][0:max_instructions])
f1 = np.asarray(x[1][0:max_instructions])
lenghts.append((f0.shape[0], f1.shape[0]))
if f0.shape[0] < max_instructions:
f0 = np.pad(f0, (0, max_instructions - f0.shape[0]), mode='constant')
if f1.shape[0] < max_instructions:
f1 = np.pad(f1, (0, max_instructions - f1.shape[0]), mode='constant')
new_pairs.append((f0, f1))
return new_pairs, lenghts
def async_chunker(self, epoch):
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
query_string = "SELECT true_pair,false_pair from {} where id=?".format(self.dataset_type)
q = cur.execute(query_string, (int(epoch),))
true_pairs_id, false_pairs_id = q.fetchone()
true_pairs_id = json.loads(true_pairs_id)
false_pairs_id = json.loads(false_pairs_id)
assert len(true_pairs_id) == len(false_pairs_id)
data_len = len(true_pairs_id)
# print("Data Len: " + str(data_len))
conn.close()
n_chunk = int(data_len / (self.batch_size / 2)) - 1
lista_chunk = range(0, n_chunk)
coda = Queue(maxsize=50)
n_proc = 8 # modify this to increase the parallelism for the db loading, from our thest 8-10 is the sweet spot on a 16 cores machine with K80
listone = PairFactory.split(lista_chunk, n_proc)
# this ugly workaround is somehow needed, Pool is working oddly when TF is loaded.
for i in range(0, n_proc):
p = Process(target=self.async_create_couple, args=((epoch, listone[i], coda)))
p.start()
for i in range(0, n_chunk):
yield self.async_get_dataset(coda)
def get_pair_fromdb(self, id_1, id_2):
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
q0 = cur.execute("SELECT instructions_list FROM filtered_functions WHERE id=?", (id_1,))
f0 = json.loads(q0.fetchone()[0])
q1 = cur.execute("SELECT instructions_list FROM filtered_functions WHERE id=?", (id_2,))
f1 = json.loads(q1.fetchone()[0])
conn.close()
return f0, f1
def get_couple_from_db(self, epoch_number, chunk):
conn = sqlite3.connect(self.db_name)
cur = conn.cursor()
pairs = []
labels = []
q = cur.execute("SELECT true_pair, false_pair from " + self.dataset_type + " WHERE id=?", (int(epoch_number),))
true_pairs_id, false_pairs_id = q.fetchone()
true_pairs_id = json.loads(true_pairs_id)
false_pairs_id = json.loads(false_pairs_id)
conn.close()
data_len = len(true_pairs_id)
i = 0
normalizer = FunctionNormalizer(self.max_instructions)
while i < self.batch_size:
if chunk * int(self.batch_size / 2) + i > data_len:
break
p = true_pairs_id[chunk * int(self.batch_size / 2) + i]
f0, f1 = self.get_pair_fromdb(p[0], p[1])
pairs.append((f0, f1))
labels.append(+1)
p = false_pairs_id[chunk * int(self.batch_size / 2) + i]
f0, f1 = self.get_pair_fromdb(p[0], p[1])
pairs.append((f0, f1))
labels.append(-1)
i += 2
pairs, lengths = normalizer.normalize_function_pairs(pairs)
function1, function2 = zip(*pairs)
len1, len2 = zip(*lengths)
n_samples = len(pairs)
if self.shuffle:
shuffle_indices = np.random.permutation(np.arange(n_samples))
function1 = np.array(function1)[shuffle_indices]
function2 = np.array(function2)[shuffle_indices]
len1 = np.array(len1)[shuffle_indices]
len2 = np.array(len2)[shuffle_indices]
labels = np.array(labels)[shuffle_indices]
else:
function1=np.array(function1)
function2=np.array(function2)
len1=np.array(len1)
len2=np.array(len2)
labels=np.array(labels)
upper_bound = min(self.batch_size, n_samples)
len1 = len1[0:upper_bound]
len2 = len2[0:upper_bound]
function1 = function1[0:upper_bound]
function2 = function2[0:upper_bound]
y_ = labels[0:upper_bound]
return function1, function2, len1, len2, y_
def async_create_couple(self, epoch,n_chunk,q):
for i in n_chunk:
function1, function2, len1, len2, y_ = self.get_couple_from_db(epoch, i)
q.put((function1, function2, len1, len2, y_), block=True)
def async_get_dataset(self, q):
item = q.get()
function1 = item[0]
function2 = item[1]
len1 = item[2]
len2 = item[3]
y_ = item[4]
assert (len(function1) == len(y_))
n_samples = len(function1)
self.batch_dim = n_samples
#self.num_pairs += n_samples
return function1, function2, len1, len2, y_
================================================
FILE: neural_network/SAFEEmbedder.py
================================================
import tensorflow as tf
# SAFE TEAM
# distributed under license: GPL 3 License http://www.gnu.org/licenses/
class SAFEEmbedder:
def __init__(self, model_file):
self.model_file = model_file
self.session = None
self.x_1 = None
self.adj_1 = None
self.len_1 = None
self.emb = None
def loadmodel(self):
with tf.gfile.GFile(self.model_file, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def)
sess = tf.Session(graph=graph)
self.session = sess
return sess
def get_tensor(self):
self.x_1 = self.session.graph.get_tensor_by_name("import/x_1:0")
self.len_1 = self.session.graph.get_tensor_by_name("import/lengths_1:0")
self.emb = tf.nn.l2_normalize(self.session.graph.get_tensor_by_name('import/Embedding1/dense/BiasAdd:0'), axis=1)
def embedd(self, nodi_input, lengths_input):
out_embedding= self.session.run(self.emb, feed_dict = {
self.x_1: nodi_input,
self.len_1: lengths_input})
return out_embedding
================================================
FILE: neural_network/SAFE_model.py
================================================
# SAFE TEAM
# distributed under license: GPL 3 License http://www.gnu.org/licenses/
from SiameseSAFE import SiameseSelfAttentive
from PairFactory import PairFactory
import tensorflow as tf
import random
import sys, os
import numpy as np
from sklearn import metrics
import matplotlib
import tqdm
matplotlib.use('Agg')
import matplotlib.pyplot as plt
class modelSAFE:
def __init__(self, flags, embedding_matrix):
self.embedding_size = flags.embedding_size
self.num_epochs = flags.num_epochs
self.learning_rate = flags.learning_rate
self.l2_reg_lambda = flags.l2_reg_lambda
self.num_checkpoints = flags.num_checkpoints
self.logdir = flags.logdir
self.logger = flags.logger
self.seed = flags.seed
self.batch_size = flags.batch_size
self.max_instructions = flags.max_instructions
self.embeddings_matrix = embedding_matrix
self.session = None
self.db_name = flags.db_name
self.trainable_embeddings = flags.trainable_embeddings
self.cross_val = flags.cross_val
self.attention_hops = flags.attention_hops
self.attention_depth = flags.attention_depth
self.dense_layer_size = flags.dense_layer_size
self.rnn_state_size = flags.rnn_state_size
random.seed(self.seed)
np.random.seed(self.seed)
print(self.db_name)
# loads an usable model
# returns the network and a tensorflow session in which the network can be used.
@staticmethod
def load_model(path):
session = tf.Session()
checkpoint_dir = os.path.abspath(os.path.join(path, "checkpoints"))
saver = tf.train.import_meta_graph(os.path.join(checkpoint_dir, "model.meta"))
tf.global_variables_initializer().run(session=session)
saver.restore(session, os.path.join(checkpoint_dir, "model"))
network = SiameseSelfAttentive(
rnn_state_size=1,
learning_rate=1,
l2_reg_lambda=1,
batch_size=1,
max_instructions=1,
embedding_matrix=1,
trainable_embeddings=1,
attention_hops=1,
attention_depth=1,
dense_layer_size=1,
embedding_size=1
)
network.restore_model(session)
return session, network
def create_network(self):
self.network = SiameseSelfAttentive(
rnn_state_size=self.rnn_state_size,
learning_rate=self.learning_rate,
l2_reg_lambda=self.l2_reg_lambda,
batch_size=self.batch_size,
max_instructions=self.max_instructions,
embedding_matrix=self.embeddings_matrix,
trainable_embeddings=self.trainable_embeddings,
attention_hops=self.attention_hops,
attention_depth=self.attention_depth,
dense_layer_size=self.dense_layer_size,
embedding_size=self.embedding_size
)
def train(self):
tf.reset_default_graph()
with tf.Graph().as_default() as g:
session_conf = tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=False
)
sess = tf.Session(config=session_conf)
# Sets the graph-level random seed.
tf.set_random_seed(self.seed)
self.create_network()
self.network.generate_new_safe()
# --tbrtr
# Initialize all variables
sess.run(tf.global_variables_initializer())
# TensorBoard
# Summaries for loss and accuracy
loss_summary = tf.summary.scalar("loss", self.network.loss)
# Train Summaries
train_summary_op = tf.summary.merge([loss_summary])
train_summary_dir = os.path.join(self.logdir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
# Validation summaries
val_summary_op = tf.summary.merge([loss_summary])
val_summary_dir = os.path.join(self.logdir, "summaries", "validation")
val_summary_writer = tf.summary.FileWriter(val_summary_dir, sess.graph)
# Test summaries
test_summary_op = tf.summary.merge([loss_summary])
test_summary_dir = os.path.join(self.logdir, "summaries", "test")
test_summary_writer = tf.summary.FileWriter(test_summary_dir, sess.graph)
# Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
checkpoint_dir = os.path.abspath(os.path.join(self.logdir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=self.num_checkpoints)
best_val_auc = 0
stat_file = open(str(self.logdir) + "/epoch_stats.tsv", "w")
stat_file.write("#epoch\ttrain_loss\tval_loss\tval_auc\ttest_loss\ttest_auc\n")
p_train = PairFactory(self.db_name, 'train_pairs', self.batch_size, self.max_instructions)
p_validation = PairFactory(self.db_name, 'validation_pairs', self.batch_size, self.max_instructions, False)
p_test = PairFactory(self.db_name, 'test_pairs', self.batch_size, self.max_instructions, False)
step = 0
for epoch in range(0, self.num_epochs):
epoch_msg = ""
epoch_msg += " epoch: {}\n".format(epoch)
epoch_loss = 0
# ----------------------#
# TRAIN #
# ----------------------#
n_batch = 0
for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm(
p_train.async_chunker(epoch % 25), total=p_train.num_batches):
feed_dict = {
self.network.x_1: function1_batch,
self.network.x_2: function2_batch,
self.network.lengths_1: len1_batch,
self.network.lengths_2: len2_batch,
self.network.y: y_batch,
}
summaries, _, loss, norms, cs = sess.run(
[train_summary_op, self.network.train_step, self.network.loss, self.network.norms,
self.network.cos_similarity],
feed_dict=feed_dict)
train_summary_writer.add_summary(summaries, step)
epoch_loss += loss * p_train.batch_dim # ???
step += 1
# recap epoch
epoch_loss /= p_train.num_pairs
epoch_msg += "\ttrain_loss: {}\n".format(epoch_loss)
# ----------------------#
# VALIDATION #
# ----------------------#
val_loss = 0
epoch_msg += "\n"
val_y = []
val_pred = []
for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm(
p_validation.async_chunker(0), total=p_validation.num_batches):
feed_dict = {
self.network.x_1: function1_batch,
self.network.x_2: function2_batch,
self.network.lengths_1: len1_batch,
self.network.lengths_2: len2_batch,
self.network.y: y_batch,
}
summaries, loss, similarities = sess.run(
[val_summary_op, self.network.loss, self.network.cos_similarity], feed_dict=feed_dict)
val_loss += loss * p_validation.batch_dim
val_summary_writer.add_summary(summaries, step)
val_y.extend(y_batch)
val_pred.extend(similarities.tolist())
val_loss /= p_validation.num_pairs
if np.isnan(val_pred).any():
print("Validation: carefull there is NaN in some ouput values, I am fixing it but be aware...")
val_pred = np.nan_to_num(val_pred)
val_fpr, val_tpr, val_thresholds = metrics.roc_curve(val_y, val_pred, pos_label=1)
val_auc = metrics.auc(val_fpr, val_tpr)
epoch_msg += "\tval_loss : {}\n\tval_auc : {}\n".format(val_loss, val_auc)
sys.stdout.write(
"\r\tepoch {} / {}, loss {:g}, val_auc {:g}, norms {}".format(epoch, self.num_epochs, epoch_loss,
val_auc, norms))
sys.stdout.flush()
# execute test only if validation auc increased
test_loss = "-"
test_auc = "-"
# in case of cross validation we do not need to evaluate on a test split that is effectively missing
if val_auc > best_val_auc and self.cross_val:
#
##-- --##
#
best_val_auc = val_auc
saver.save(sess, checkpoint_prefix)
print("\nNEW BEST_VAL_AUC: {} !\n".format(best_val_auc))
# write ROC raw data
with open(str(self.logdir) + "/best_val_roc.tsv", "w") as the_file:
the_file.write("#thresholds\ttpr\tfpr\n")
for t, tpr, fpr in zip(val_thresholds, val_tpr, val_fpr):
the_file.write("{}\t{}\t{}\n".format(t, tpr, fpr))
# in case we are not cross validating we expect to have a test split.
if val_auc > best_val_auc and not self.cross_val:
best_val_auc = val_auc
epoch_msg += "\tNEW BEST_VAL_AUC: {} !\n".format(best_val_auc)
# save best model
saver.save(sess, checkpoint_prefix)
# ----------------------#
# TEST #
# ----------------------#
# TEST
test_loss = 0
epoch_msg += "\n"
test_y = []
test_pred = []
for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm(
p_test.async_chunker(0), total=p_test.num_batches):
feed_dict = {
self.network.x_1: function1_batch,
self.network.x_2: function2_batch,
self.network.lengths_1: len1_batch,
self.network.lengths_2: len2_batch,
self.network.y: y_batch,
}
summaries, loss, similarities = sess.run(
[test_summary_op, self.network.loss, self.network.cos_similarity], feed_dict=feed_dict)
test_loss += loss * p_test.batch_dim
test_summary_writer.add_summary(summaries, step)
test_y.extend(y_batch)
test_pred.extend(similarities.tolist())
test_loss /= p_test.num_pairs
if np.isnan(test_pred).any():
print("Test: carefull there is NaN in some ouput values, I am fixing it but be aware...")
test_pred = np.nan_to_num(test_pred)
test_fpr, test_tpr, test_thresholds = metrics.roc_curve(test_y, test_pred, pos_label=1)
# write ROC raw data
with open(str(self.logdir) + "/best_test_roc.tsv", "w") as the_file:
the_file.write("#thresholds\ttpr\tfpr\n")
for t, tpr, fpr in zip(test_thresholds, test_tpr, test_fpr):
the_file.write("{}\t{}\t{}\n".format(t, tpr, fpr))
test_auc = metrics.auc(test_fpr, test_tpr)
epoch_msg += "\ttest_loss : {}\n\ttest_auc : {}\n".format(test_loss, test_auc)
fig = plt.figure()
plt.title('Receiver Operating Characteristic')
plt.plot(test_fpr, test_tpr, 'b',
label='AUC = %0.2f' % test_auc)
fig.savefig(str(self.logdir) + "/best_test_roc.png")
print(
"\nNEW BEST_VAL_AUC: {} !\n\ttest_loss : {}\n\ttest_auc : {}\n".format(best_val_auc, test_loss,
test_auc))
plt.close(fig)
stat_file.write(
"{}\t{}\t{}\t{}\t{}\t{}\n".format(epoch, epoch_loss, val_loss, val_auc, test_loss, test_auc))
self.logger.info("\n{}\n".format(epoch_msg))
stat_file.close()
sess.close()
return best_val_auc
================================================
FILE: neural_network/SiameseSAFE.py
================================================
import tensorflow as tf
# SAFE TEAM
#
#
# distributed under license: CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt)
#
# Siamese Self-Attentive Network for Binary Similarity:
#
# arXiv Nostro.
#
# based on the self attentive network:arXiv:1703.03130 Z. Lin at al. “A structured self-attentive sentence embedding''
#
# Authors: SAFE team
class SiameseSelfAttentive:
def __init__(self,
rnn_state_size, # Dimension of the RNN State
learning_rate, # Learning rate
l2_reg_lambda,
batch_size,
max_instructions,
embedding_matrix, # Matrix containg the embeddings for each asm instruction
trainable_embeddings,
# if this value is True, the embeddings of the asm instruction are modified by the training.
attention_hops, # attention hops parameter r of [1]
attention_depth, # attention detph parameter d_a of [1]
dense_layer_size, # parameter e of [1]
embedding_size, # size of the final function embedding, in our test this is twice the rnn_state_size
):
self.rnn_depth = 1 # if this value is modified then the RNN becames a multilayer network. In our tests we fix it to 1 feel free to be adventurous.
self.learning_rate = learning_rate
self.l2_reg_lambda = l2_reg_lambda
self.rnn_state_size = rnn_state_size
self.batch_size = batch_size
self.max_instructions = max_instructions
self.embedding_matrix = embedding_matrix
self.trainable_embeddings = trainable_embeddings
self.attention_hops = attention_hops
self.attention_depth = attention_depth
self.dense_layer_size = dense_layer_size
self.embedding_size = embedding_size
# self.generate_new_safe()
def restore_model(self, old_session):
graph = old_session.graph
self.x_1 = graph.get_tensor_by_name("x_1:0")
self.x_2 = graph.get_tensor_by_name("x_2:0")
self.len_1 = graph.get_tensor_by_name("lengths_1:0")
self.len_2 = graph.get_tensor_by_name("lengths_2:0")
self.y = graph.get_tensor_by_name('y_:0')
self.cos_similarity = graph.get_tensor_by_name("siamese_layer/cosSimilarity:0")
self.loss = graph.get_tensor_by_name("Loss/loss:0")
self.train_step = graph.get_operation_by_name("Train_Step/Adam")
return
def self_attentive_network(self, input_x, lengths):
# each functions is a list of embeddings id (an id is an index in the embedding matrix)
# with this we transform it in a list of embeddings vectors.
embbedded_functions = tf.nn.embedding_lookup(self.instructions_embeddings_t, input_x)
# We create the GRU RNN
(output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn(self.cell_fw, self.cell_bw, embbedded_functions,
sequence_length=lengths, dtype=tf.float32,
time_major=False)
# We create the matrix H
H = tf.concat([output_fw, output_bw], axis=2)
# We do a tile to account for training batches
ws1_tiled = tf.tile(tf.expand_dims(self.WS1, 0), [tf.shape(H)[0], 1, 1], name="WS1_tiled")
ws2_tile = tf.tile(tf.expand_dims(self.WS2, 0), [tf.shape(H)[0], 1, 1], name="WS2_tiled")
# we compute the matrix A
self.A = tf.nn.softmax(tf.matmul(ws2_tile, tf.nn.tanh(tf.matmul(ws1_tiled, tf.transpose(H, perm=[0, 2, 1])))),
name="Attention_Matrix")
# embedding matrix M
M = tf.identity(tf.matmul(self.A, H), name="Attention_Embedding")
# we create the flattened version of M
flattened_M = tf.reshape(M, [tf.shape(M)[0], self.attention_hops * self.rnn_state_size * 2])
return flattened_M
def generate_new_safe(self):
self.instructions_embeddings_t = tf.Variable(initial_value=tf.constant(self.embedding_matrix),
trainable=self.trainable_embeddings,
name="instructions_embeddings", dtype=tf.float32)
self.x_1 = tf.placeholder(tf.int32, [None, self.max_instructions],
name="x_1") # List of instructions for Function 1
self.lengths_1 = tf.placeholder(tf.int32, [None], name='lengths_1') # List of lengths for Function 1
# example x_1=[[mov,add,padding,padding],[mov,mov,mov,padding]]
# lenghts_1=[2,3]
self.x_2 = tf.placeholder(tf.int32, [None, self.max_instructions],
name="x_2") # List of instructions for Function 2
self.lengths_2 = tf.placeholder(tf.int32, [None], name='lengths_2') # List of lengths for Function 2
self.y = tf.placeholder(tf.float32, [None], name='y_') # Real label of the pairs, +1 similar, -1 dissimilar.
# Euclidean norms; p = 2
self.norms = []
# Keeping track of l2 regularization loss (optional)
l2_loss = tf.constant(0.0)
with tf.name_scope('parameters_Attention'):
self.WS1 = tf.Variable(tf.truncated_normal([self.attention_depth, 2 * self.rnn_state_size], stddev=0.1),
name="WS1")
self.WS2 = tf.Variable(tf.truncated_normal([self.attention_hops, self.attention_depth], stddev=0.1),
name="WS2")
rnn_layers_fw = [tf.nn.rnn_cell.GRUCell(size) for size in ([self.rnn_state_size] * self.rnn_depth)]
rnn_layers_bw = [tf.nn.rnn_cell.GRUCell(size) for size in ([self.rnn_state_size] * self.rnn_depth)]
self.cell_fw = tf.nn.rnn_cell.MultiRNNCell(rnn_layers_fw)
self.cell_bw = tf.nn.rnn_cell.MultiRNNCell(rnn_layers_bw)
with tf.name_scope('Self-Attentive1'):
self.function_1 = self.self_attentive_network(self.x_1, self.lengths_1)
with tf.name_scope('Self-Attentive2'):
self.function_2 = self.self_attentive_network(self.x_2, self.lengths_2)
self.dense_1 = tf.nn.relu(tf.layers.dense(self.function_1, self.dense_layer_size))
self.dense_2 = tf.nn.relu(tf.layers.dense(self.function_2, self.dense_layer_size))
with tf.name_scope('Embedding1'):
self.function_embedding_1 = tf.layers.dense(self.dense_1, self.embedding_size)
with tf.name_scope('Embedding2'):
self.function_embedding_2 = tf.layers.dense(self.dense_2, self.embedding_size)
with tf.name_scope('siamese_layer'):
self.cos_similarity = tf.reduce_sum(tf.multiply(self.function_embedding_1, self.function_embedding_2),
axis=1,
name="cosSimilarity")
# CalculateMean cross-entropy loss
with tf.name_scope("Loss"):
A_square = tf.matmul(self.A, tf.transpose(self.A, perm=[0, 2, 1]))
I = tf.eye(tf.shape(A_square)[1])
I_tiled = tf.tile(tf.expand_dims(I, 0), [tf.shape(A_square)[0], 1, 1], name="I_tiled")
self.A_pen = tf.norm(A_square - I_tiled)
self.loss = tf.reduce_sum(tf.squared_difference(self.cos_similarity, self.y), name="loss")
self.regularized_loss = self.loss + self.l2_reg_lambda * l2_loss + self.A_pen
# Train step
with tf.name_scope("Train_Step"):
self.train_step = tf.train.AdamOptimizer(self.learning_rate).minimize(self.regularized_loss)
================================================
FILE: neural_network/__init__.py
================================================
================================================
FILE: neural_network/freeze_graph.sh
================================================
#!/bin/sh
echo "usage: ./freeze_graph MODEL_DIR FREEZED_NAME"
MODEL_DIR=$0
FREEZED_NAME=$1
freeze_graph --input_meta_graph $MODELDIR/checkpoints/model.meta
--output_graph FREEZED_NAME
--output_node_names Embedding1/dense/BiasAdd
--input_bin
--input_checkpoint $MODEL_DIR/checkpoints/model
================================================
FILE: neural_network/parameters.py
================================================
# SAFE TEAM
# distributed under license: GPL 3 License http://www.gnu.org/licenses/
import argparse
import time
import sys, os
import logging
#
# Parameters File for the SAFE network.
#
# Authors: SAFE team
def getLogger(logfile):
logger = logging.getLogger(__name__)
hdlr = logging.FileHandler(logfile)
formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
hdlr.setFormatter(formatter)
logger.addHandler(hdlr)
logger.setLevel(logging.INFO)
return logger, hdlr
class Flags:
def __init__(self):
parser = argparse.ArgumentParser(description='SAFE')
parser.add_argument("-o", "--output", dest="output_file", help="output directory for logging and models",
required=False)
parser.add_argument("-e", "--embedder", dest="embedder_folder",
help="file with the embedding matrix and dictionary for asm instructions", required=False)
parser.add_argument("-n", "--dbName", dest="db_name", help="Name of the database", required=False)
parser.add_argument("-ld", "--load_dir", dest="load_dir", help="Load the model from directory load_dir",
required=False)
parser.add_argument("-r", "--random", help="if present the network use random embedder", default=False,
action="store_true", dest="random_embedding", required=False)
parser.add_argument("-te", "--trainable_embedding",
help="if present the network consider the embedding as trainable", action="store_true",
dest="trainable_embeddings", default=False)
parser.add_argument("-cv", "--cross_val", help="if present the training is done with cross validiation",
default=False, action="store_true", dest="cross_val")
args = parser.parse_args()
# mode = mean_field
self.batch_size = 250 # minibatch size (-1 = whole dataset)
self.num_epochs = 50 # number of epochs
self.embedding_size = 100 # dimension of the function embedding
self.learning_rate = 0.001 # init learning_rate
self.l2_reg_lambda = 0 # 0.002 #0.002 # regularization coefficient
self.num_checkpoints = 1 # max number of checkpoints
self.out_dir = args.output_file # directory for logging
self.rnn_state_size = 50 # dimesion of the rnn state
self.db_name = args.db_name
self.load_dir = str(args.load_dir)
self.random_embedding = args.random_embedding
self.trainable_embeddings = args.trainable_embeddings
self.cross_val = args.cross_val
self.cross_val_fold = 5
#
##
## RNN PARAMETERS, these parameters are only used for RNN model.
#
self.rnn_depth = 1 # depth of the rnn
self.max_instructions = 150 # number of instructions
## ATTENTION PARAMETERS
self.attention_hops = 10
self.attention_depth = 250
# RNN SINGLE PARAMETER
self.dense_layer_size = 2000
self.seed = 2 # random seed
# create logdir and logger
self.reset_logdir()
self.embedder_folder = args.embedder_folder
def reset_logdir(self):
# create logdir
timestamp = str(int(time.time()))
self.logdir = os.path.abspath(os.path.join(self.out_dir, "runs", timestamp))
os.makedirs(self.logdir, exist_ok=True)
# create logger
self.log_file = str(self.logdir) + '/console.log'
self.logger, self.hdlr = getLogger(self.log_file)
# create symlink for last_run
sym_path_logdir = str(self.out_dir) + "/last_run"
try:
os.unlink(sym_path_logdir)
except:
pass
try:
os.symlink(self.logdir, sym_path_logdir)
except:
print("\nfailed to create symlink!\n")
def close_log(self):
self.hdlr.close()
self.logger.removeHandler(self.hdlr)
handlers = self.logger.handlers[:]
for handler in handlers:
handler.close()
self.logger.removeHandler(handler)
def __str__(self):
msg = ""
msg += "\nParameters:\n"
msg += "\tRandom embedding: {}\n".format(self.random_embedding)
msg += "\tTrainable embedding: {}\n".format(self.trainable_embeddings)
msg += "\tlogdir: {}\n".format(self.logdir)
msg += "\tbatch_size: {}\n".format(self.batch_size)
msg += "\tnum_epochs: {}\n".format(self.num_epochs)
msg += "\tembedding_size: {}\n".format(self.embedding_size)
msg += "\trnn_state_size: {}\n".format(self.rnn_state_size)
msg += "\tattention depth: {}\n".format(self.attention_depth)
msg += "\tattention hops: {}\n".format(self.attention_hops)
msg += "\tdense layer e: {}\n".format(self.dense_layer_size)
msg += "\tlearning_rate: {}\n".format(self.learning_rate)
msg += "\tl2_reg_lambda: {}\n".format(self.l2_reg_lambda)
msg += "\tnum_checkpoints: {}\n".format(self.num_checkpoints)
msg += "\tseed: {}\n".format(self.seed)
msg += "\tMax Instructions per functions: {}\n".format(self.max_instructions)
return msg
================================================
FILE: neural_network/train.py
================================================
from SAFE_model import modelSAFE
from parameters import Flags
import sys
import os
import numpy as np
from utils import utils
import traceback
def load_embedding_matrix(embedder_folder):
matrix_file='embedding_matrix.npy'
matrix_path=os.path.join(embedder_folder,matrix_file)
if os.path.isfile(matrix_path):
try:
print('Loading embedding matrix....')
with open(matrix_path,'rb') as f:
return np.float32(np.load(f))
except Exception as e:
print("Exception handling file:"+str(matrix_path))
print("Embedding matrix cannot be load")
print(str(e))
sys.exit(-1)
else:
print('Embedding matrix not found at path:'+str(matrix_path))
sys.exit(-1)
def run_test():
flags = Flags()
flags.logger.info("\n{}\n".format(flags))
print(str(flags))
embedding_matrix = load_embedding_matrix(flags.embedder_folder)
if flags.random_embedding:
embedding_matrix = np.random.rand(*np.shape(embedding_matrix)).astype(np.float32)
embedding_matrix[0, :] = np.zeros(np.shape(embedding_matrix)[1]).astype(np.float32)
if flags.cross_val:
print("STARTING CROSS VALIDATION")
res = []
mean = 0
for i in range(0, flags.cross_val_fold):
print("CROSS VALIDATION STARTING FOLD: " + str(i))
if i > 0:
flags.close_log()
flags.reset_logdir()
del flags
flags = Flags()
flags.logger.info("\n{}\n".format(flags))
flags.logger.info("Starting cross validation fold: {}".format(i))
flags.db_name = flags.db_name + "_val_" + str(i+1) + ".db"
flags.logger.info("Cross validation db name: {}".format(flags.db_name))
trainer = modelSAFE(flags, embedding_matrix)
best_val_auc = trainer.train()
mean += best_val_auc
res.append(best_val_auc)
flags.logger.info("Cross validation fold {} finished best auc: {}".format(i, best_val_auc))
print("FINISH FOLD: " + str(i) + " BEST VAL AUC: " + str(best_val_auc))
print("CROSS VALIDATION ENDED")
print("Result: " + str(res))
print("")
flags.logger.info("Cross validation finished results: {}".format(res))
flags.logger.info(" mean: {}".format(mean / flags.cross_val_fold))
flags.close_log()
else:
trainer = modelSAFE(flags, embedding_matrix)
trainer.train()
flags.close_log()
if __name__ == '__main__':
utils.print_safe()
print('-Trainer for SAFE-')
run_test()
================================================
FILE: neural_network/train.sh
================================================
#!/bin/sh
BASE_PATH="/home/luca/work/binary_similarity_data/"
DATA_PATH=$BASE_PATH/experiments/arith_mean_openSSL_no_dropout_no_shuffle_no_regeneration_emb_random_trainable
OUT_PATH=$DATA_PATH/out
DB_PATH=$BASE_PATH/databases/openSSL_data.db
EMBEDDER=$BASE_PATH/word2vec/filtered_100_embeddings/
RANDOM=""
TRAINABLE_EMBEDD=""
python3 train.py $RANDOM $TRAINABLE_EMBEDD --o $OUT_PATH -n $DB_PATH -e $EMBEDDER
================================================
FILE: requirements.txt
================================================
tensorflow
sklearn
numpy
scipy
matplotlib
tqdm
r2pipe
pyfiglet
================================================
FILE: safe.py
================================================
# SAFE TEAM
# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni
from asm_embedding.FunctionAnalyzerRadare import RadareFunctionAnalyzer
from argparse import ArgumentParser
from asm_embedding.FunctionNormalizer import FunctionNormalizer
from asm_embedding.InstructionsConverter import InstructionsConverter
from neural_network.SAFEEmbedder import SAFEEmbedder
from utils import utils
class SAFE:
def __init__(self, model):
self.converter = InstructionsConverter("data/i2v/word2id.json")
self.normalizer = FunctionNormalizer(max_instruction=150)
self.embedder = SAFEEmbedder(model)
self.embedder.loadmodel()
self.embedder.get_tensor()
def embedd_function(self, filename, address):
analyzer = RadareFunctionAnalyzer(filename, use_symbol=False, depth=0)
functions = analyzer.analyze()
instructions_list = None
for function in functions:
if functions[function]['address'] == address:
instructions_list = functions[function]['filtered_instructions']
break
if instructions_list is None:
print("Function not found")
return None
converted_instructions = self.converter.convert_to_ids(instructions_list)
instructions, length = self.normalizer.normalize_functions([converted_instructions])
embedding = self.embedder.embedd(instructions, length)
return embedding
if __name__ == '__main__':
utils.print_safe()
parser = ArgumentParser(description="Safe Embedder")
parser.add_argument("-m", "--model", help="Safe trained model to generate function embeddings")
parser.add_argument("-i", "--input", help="Input executable that contains the function to embedd")
parser.add_argument("-a", "--address", help="Hexadecimal address of the function to embedd")
args = parser.parse_args()
address = int(args.address, 16)
safe = SAFE(args.model)
embedding = safe.embedd_function(args.input, address)
print(embedding[0])
================================================
FILE: utils/__init__.py
================================================
================================================
FILE: utils/utils.py
================================================
from pyfiglet import figlet_format
def print_safe():
a = figlet_format('SAFE', font='starwars')
print(a)
print("By Massarelli L., Di Luna G. A., Petroni F., Querzoni L., Baldoni R.")
print("Please cite: http://arxiv.org/abs/1811.05296 \n")
gitextract_60c_bmdf/
├── 404.html
├── Gemfile
├── LICENSE
├── README.md
├── __init__.py
├── _config.yml
├── asm_embedding/
│ ├── DocumentManipulation.py
│ ├── FunctionAnalyzerRadare.py
│ ├── FunctionNormalizer.py
│ ├── InstructionsConverter.py
│ └── __init__.py
├── dataset_creation/
│ ├── DataSplitter.py
│ ├── DatabaseFactory.py
│ ├── ExperimentUtil.py
│ ├── FunctionsEmbedder.py
│ ├── __init__.py
│ └── convertDB.py
├── download_model.sh
├── downloader.py
├── function_search/
│ ├── EvaluateSearchEngine.py
│ ├── FunctionSearchEngine.py
│ ├── __init__.py
│ └── fromJsonSearchToPlot.py
├── godown.pl
├── helloworld.c
├── helloworld.o
├── index.md
├── neural_network/
│ ├── PairFactory.py
│ ├── SAFEEmbedder.py
│ ├── SAFE_model.py
│ ├── SiameseSAFE.py
│ ├── __init__.py
│ ├── freeze_graph.sh
│ ├── parameters.py
│ ├── train.py
│ └── train.sh
├── requirements.txt
├── safe.py
└── utils/
├── __init__.py
└── utils.py
SYMBOL INDEX (125 symbols across 22 files)
FILE: asm_embedding/DocumentManipulation.py
function list_to_str (line 5) | def list_to_str(li):
function document_append (line 12) | def document_append(strin):
FILE: asm_embedding/FunctionAnalyzerRadare.py
class RadareFunctionAnalyzer (line 8) | class RadareFunctionAnalyzer:
method __init__ (line 10) | def __init__(self, filename, use_symbol, depth):
method __enter__ (line 17) | def __enter__(self):
method filter_reg (line 21) | def filter_reg(op):
method filter_imm (line 25) | def filter_imm(op):
method filter_mem (line 34) | def filter_mem(op):
method filter_memory_references (line 48) | def filter_memory_references(i):
method get_callref (line 68) | def get_callref(my_function, depth):
method get_instruction (line 76) | def get_instruction(self):
method function_to_inst (line 92) | def function_to_inst(self, functions_dict, my_function, depth):
method get_arch (line 130) | def get_arch(self):
method find_functions (line 142) | def find_functions(self):
method find_functions_by_symbols (line 150) | def find_functions_by_symbols(self):
method analyze (line 159) | def analyze(self):
method close (line 188) | def close(self):
method __exit__ (line 191) | def __exit__(self, exc_type, exc_value, traceback):
FILE: asm_embedding/FunctionNormalizer.py
class FunctionNormalizer (line 7) | class FunctionNormalizer:
method __init__ (line 9) | def __init__(self, max_instruction):
method normalize (line 12) | def normalize(self, f):
method normalize_function_pairs (line 19) | def normalize_function_pairs(self, pairs):
method normalize_functions (line 29) | def normalize_functions(self, functions):
FILE: asm_embedding/InstructionsConverter.py
class InstructionsConverter (line 7) | class InstructionsConverter:
method __init__ (line 9) | def __init__(self, json_i2id):
method convert_to_ids (line 14) | def convert_to_ids(self, instructions_list):
FILE: dataset_creation/DataSplitter.py
class DataSplitter (line 10) | class DataSplitter:
method __init__ (line 12) | def __init__(self, db_name):
method create_pair_table (line 15) | def create_pair_table(self, table_name):
method get_ids (line 23) | def get_ids(self, set_type):
method select_similar_cfg (line 32) | def select_similar_cfg(id, provenance, ids, cursor):
method select_dissimilar_cfg (line 41) | def select_dissimilar_cfg(ids, provenance, cursor):
method create_epoch_pairs (line 50) | def create_epoch_pairs(self, epoch_number, pairs_table,id_table):
method create_pairs (line 76) | def create_pairs(self, total_epochs):
method prepare_set (line 94) | def prepare_set(data_to_include, table_name, file_list, cur):
method split_data (line 105) | def split_data(self, validation_dim, test_dim):
FILE: dataset_creation/DatabaseFactory.py
class DatabaseFactory (line 17) | class DatabaseFactory:
method __init__ (line 19) | def __init__(self, db_name, root_path):
method worker (line 24) | def worker(item):
method extract_function (line 29) | def extract_function(graph_analyzer):
method insert_in_db (line 34) | def insert_in_db(db_name, pool_sem, func, filename, function_name, ins...
method analyze_file (line 62) | def analyze_file(item):
method create_db (line 94) | def create_db(self):
method scan_for_file (line 114) | def scan_for_file(self, start):
method remove_override (line 128) | def remove_override(self, file_list):
method build_db (line 146) | def build_db(self, use_symbol, depth):
FILE: dataset_creation/ExperimentUtil.py
function debug_msg (line 9) | def debug_msg():
function build_configuration (line 27) | def build_configuration(db_name, root_dir, use_symbols, callee_depth):
function split_configuration (line 36) | def split_configuration(db_name, val_split, test_split, epochs):
function embedd_configuration (line 45) | def embedd_configuration(db_name, model, batch_size, max_instruction, em...
FILE: dataset_creation/FunctionsEmbedder.py
class FunctionsEmbedder (line 12) | class FunctionsEmbedder:
method __init__ (line 14) | def __init__(self, model, batch_size, max_instruction):
method compute_embeddings (line 21) | def compute_embeddings(self, functions):
method create_table (line 27) | def create_table(db_name, table_name):
method compute_and_save_embeddings_from_db (line 34) | def compute_and_save_embeddings_from_db(self, db_name, table_name):
FILE: dataset_creation/convertDB.py
function create_db (line 13) | def create_db(db_name):
function reverse_graph (line 32) | def reverse_graph(cfg, lstm_cfg):
function copy_split (line 47) | def copy_split(old_cur, new_cur, table):
function copy_table (line 55) | def copy_table(old_cur, new_cur, table_old, table_new):
FILE: downloader.py
class Downloader (line 10) | class Downloader:
method __init__ (line 12) | def __init__(self):
method download_file (line 55) | def download_file(id,path):
method decompress_file (line 64) | def decompress_file(file_src,file_path):
method download (line 72) | def download(self):
FILE: function_search/EvaluateSearchEngine.py
class SearchEngineEvaluator (line 16) | class SearchEngineEvaluator:
method __init__ (line 18) | def __init__(self, db_name, table, limit=None,k=None):
method do_search (line 25) | def do_search(self, target_db_name, target_fcn_ids):
method calc_auc (line 29) | def calc_auc(self, target_db_name, target_fcn_ids):
method find_target_fcn (line 38) | def find_target_fcn(self, compiler, opt, num):
method functions_ground_truth (line 66) | def functions_ground_truth(labels, indices, values, true_label):
method evaluate_precision_on_all_functions (line 88) | def evaluate_precision_on_all_functions(self, compiler, opt):
function test (line 110) | def test(dbName, table, opt,x,k):
FILE: function_search/FunctionSearchEngine.py
class TopK (line 21) | class TopK:
method __init__ (line 26) | def __init__(self):
method loads_embeddings_SE (line 30) | def loads_embeddings_SE(self, lista_embeddings):
method topK (line 42) | def topK(self, k, target):
class FunctionSearchEngine (line 47) | class FunctionSearchEngine:
method __init__ (line 49) | def __init__(self, db_name, table_name, limit=None):
method load_target (line 91) | def load_target(self, target_db_name, target_fcn_ids, calc_mean=False):
method embeddingToNp (line 117) | def embeddingToNp(self, e):
method top_k (line 124) | def top_k(self, target, k=None):
method pp_search (line 131) | def pp_search(self, k):
method search (line 138) | def search(self, k):
FILE: function_search/fromJsonSearchToPlot.py
function find_dcg (line 11) | def find_dcg(element_list):
function count_ones (line 18) | def count_ones(element_list):
function extract_info (line 22) | def extract_info(file_1):
function print_graph (line 62) | def print_graph(info1, file_name, label_y, title_1, p):
function compare_and_print (line 73) | def compare_and_print(file):
FILE: helloworld.c
function main (line 4) | int main(){
FILE: neural_network/PairFactory.py
class PairFactory (line 20) | class PairFactory:
method __init__ (line 22) | def __init__(self, db_name, dataset_type, batch_size, max_instructions...
method split (line 40) | def split( a, n):
method truncate_and_compute_lengths (line 44) | def truncate_and_compute_lengths(pairs, max_instructions):
method async_chunker (line 59) | def async_chunker(self, epoch):
method get_pair_fromdb (line 89) | def get_pair_fromdb(self, id_1, id_2):
method get_couple_from_db (line 100) | def get_couple_from_db(self, epoch_number, chunk):
method async_create_couple (line 166) | def async_create_couple(self, epoch,n_chunk,q):
method async_get_dataset (line 171) | def async_get_dataset(self, q):
FILE: neural_network/SAFEEmbedder.py
class SAFEEmbedder (line 5) | class SAFEEmbedder:
method __init__ (line 7) | def __init__(self, model_file):
method loadmodel (line 15) | def loadmodel(self):
method get_tensor (line 28) | def get_tensor(self):
method embedd (line 33) | def embedd(self, nodi_input, lengths_input):
FILE: neural_network/SAFE_model.py
class modelSAFE (line 18) | class modelSAFE:
method __init__ (line 20) | def __init__(self, flags, embedding_matrix):
method load_model (line 49) | def load_model(path):
method create_network (line 71) | def create_network(self):
method train (line 86) | def train(self):
FILE: neural_network/SiameseSAFE.py
class SiameseSelfAttentive (line 16) | class SiameseSelfAttentive:
method __init__ (line 18) | def __init__(self,
method restore_model (line 47) | def restore_model(self, old_session):
method self_attentive_network (line 61) | def self_attentive_network(self, input_x, lengths):
method generate_new_safe (line 89) | def generate_new_safe(self):
FILE: neural_network/parameters.py
function getLogger (line 16) | def getLogger(logfile):
class Flags (line 26) | class Flags:
method __init__ (line 28) | def __init__(self):
method reset_logdir (line 85) | def reset_logdir(self):
method close_log (line 106) | def close_log(self):
method __str__ (line 114) | def __str__(self):
FILE: neural_network/train.py
function load_embedding_matrix (line 10) | def load_embedding_matrix(embedder_folder):
function run_test (line 29) | def run_test():
FILE: safe.py
class SAFE (line 12) | class SAFE:
method __init__ (line 14) | def __init__(self, model):
method embedd_function (line 21) | def embedd_function(self, filename, address):
FILE: utils/utils.py
function print_safe (line 4) | def print_safe():
Condensed preview — 40 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (109K chars).
[
{
"path": "404.html",
"chars": 398,
"preview": "---\nlayout: default\n---\n\n<style type=\"text/css\" media=\"screen\">\n .container {\n margin: 10px auto;\n max-width: 600"
},
{
"path": "Gemfile",
"chars": 1083,
"preview": "source \"https://rubygems.org\"\n\n# Hello! This is where you manage which Jekyll version is used to run.\n# When you want to"
},
{
"path": "LICENSE",
"chars": 122,
"preview": " Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n "
},
{
"path": "README.md",
"chars": 4059,
"preview": "# SAFE : Self Attentive Function Embedding\n\nPaper\n---\nThis software is the outcome of our accademic research. See our ar"
},
{
"path": "__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "_config.yml",
"chars": 1580,
"preview": "# Welcome to Jekyll!\n#\n# This config file is meant for settings that affect your whole blog, values\n# which you are expe"
},
{
"path": "asm_embedding/DocumentManipulation.py",
"chars": 869,
"preview": "import json\nimport re\nimport os\n\ndef list_to_str(li):\n i=''\n for x in li:\n i=i+' '+x\n i=i+' endfun'*5\n "
},
{
"path": "asm_embedding/FunctionAnalyzerRadare.py",
"chars": 5885,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "asm_embedding/FunctionNormalizer.py",
"chars": 1121,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "asm_embedding/InstructionsConverter.py",
"chars": 1118,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "asm_embedding/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataset_creation/DataSplitter.py",
"chars": 5467,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "dataset_creation/DatabaseFactory.py",
"chars": 6785,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "dataset_creation/ExperimentUtil.py",
"chars": 5282,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "dataset_creation/FunctionsEmbedder.py",
"chars": 2021,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "dataset_creation/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataset_creation/convertDB.py",
"chars": 4558,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "download_model.sh",
"chars": 99,
"preview": "#!/usr/bin/env bash\n\npython3 downloader.py -b\necho 'Model downloaded and, hopefully, ready to run'\n"
},
{
"path": "downloader.py",
"chars": 5027,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "function_search/EvaluateSearchEngine.py",
"chars": 4821,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "function_search/FunctionSearchEngine.py",
"chars": 5075,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "function_search/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "function_search/fromJsonSearchToPlot.py",
"chars": 3260,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "godown.pl",
"chars": 1807,
"preview": "#!/usr/bin/env perl\n#\n# Google Drive direct download of big files\n# ./gdown.pl 'gdrive file url' ['desired file name']\n#"
},
{
"path": "helloworld.c",
"chars": 104,
"preview": "#include \"stdio.h\"\n\n\nint main(){\n printf(\"hello world\");\n int a=10;\n int b=20;\n printf(\"%d\",a+b);\n}\n"
},
{
"path": "index.md",
"chars": 2914,
"preview": "---\n# Feel free to add content and custom Front Matter to this file.\n# To modify the layout, see https://jekyllrb.com/do"
},
{
"path": "neural_network/PairFactory.py",
"chars": 6350,
"preview": "# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\nimport sqlite3\n\nimport json\nimport n"
},
{
"path": "neural_network/SAFEEmbedder.py",
"chars": 1282,
"preview": "import tensorflow as tf\n# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\n\nclass SAFEE"
},
{
"path": "neural_network/SAFE_model.py",
"chars": 13262,
"preview": "# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\n\nfrom SiameseSAFE import SiameseSelf"
},
{
"path": "neural_network/SiameseSAFE.py",
"chars": 7721,
"preview": "import tensorflow as tf\n# SAFE TEAM\n#\n#\n# distributed under license: CC BY-NC-SA 4.0 (https://creativecommons.org/licens"
},
{
"path": "neural_network/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "neural_network/freeze_graph.sh",
"chars": 344,
"preview": "#!/bin/sh\n\necho \"usage: ./freeze_graph MODEL_DIR FREEZED_NAME\"\n\nMODEL_DIR=$0\nFREEZED_NAME=$1\n\nfreeze_graph --input_meta_"
},
{
"path": "neural_network/parameters.py",
"chars": 5285,
"preview": "# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\n\nimport argparse\nimport time\nimport "
},
{
"path": "neural_network/train.py",
"chars": 2675,
"preview": "from SAFE_model import modelSAFE\nfrom parameters import Flags\nimport sys\nimport os\nimport numpy as np\nfrom utils import "
},
{
"path": "neural_network/train.sh",
"chars": 415,
"preview": "#!/bin/sh\n\n\nBASE_PATH=\"/home/luca/work/binary_similarity_data/\"\n\nDATA_PATH=$BASE_PATH/experiments/arith_mean_openSSL_no_"
},
{
"path": "requirements.txt",
"chars": 62,
"preview": "tensorflow\nsklearn\nnumpy\nscipy\nmatplotlib\ntqdm\nr2pipe\npyfiglet"
},
{
"path": "safe.py",
"chars": 2105,
"preview": "# SAFE TEAM\n# Copyright (C) 2019 Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto B"
},
{
"path": "utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "utils/utils.py",
"chars": 257,
"preview": "from pyfiglet import figlet_format\n\n\ndef print_safe():\n a = figlet_format('SAFE', font='starwars')\n print(a)\n p"
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the gadiluna/SAFE GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 40 files (100.8 KB), approximately 24.7k tokens, and a symbol index with 125 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.