[
  {
    "path": "404.html",
    "content": "---\nlayout: default\n---\n\n<style type=\"text/css\" media=\"screen\">\n  .container {\n    margin: 10px auto;\n    max-width: 600px;\n    text-align: center;\n  }\n  h1 {\n    margin: 30px 0;\n    font-size: 4em;\n    line-height: 1;\n    letter-spacing: -1px;\n  }\n</style>\n\n<div class=\"container\">\n  <h1>404</h1>\n\n  <p><strong>Page not found :(</strong></p>\n  <p>The requested page could not be found.</p>\n</div>\n"
  },
  {
    "path": "Gemfile",
    "content": "source \"https://rubygems.org\"\n\n# Hello! This is where you manage which Jekyll version is used to run.\n# When you want to use a different version, change it below, save the\n# file and run `bundle install`. Run Jekyll with `bundle exec`, like so:\n#\n#     bundle exec jekyll serve\n#\n# This will help ensure the proper Jekyll version is running.\n# Happy Jekylling!\ngem \"jekyll\", \"~> 3.7.4\"\n\n# This is the default theme for new Jekyll sites. You may change this to anything you like.\ngem \"minima\", \"~> 2.0\"\n\n# If you want to use GitHub Pages, remove the \"gem \"jekyll\"\" above and\n# uncomment the line below. To upgrade, run `bundle update github-pages`.\n# gem \"github-pages\", group: :jekyll_plugins\n#gem \"github-pages\", group: :jekyll_plugins\n\n# If you have any plugins, put them here!\ngroup :jekyll_plugins do\n  gem \"jekyll-feed\", \"~> 0.6\"\nend\n\n# Windows does not include zoneinfo files, so bundle the tzinfo-data gem\ngem \"tzinfo-data\", platforms: [:mingw, :mswin, :x64_mingw, :jruby]\n\n# Performance-booster for watching directories on Windows\ngem \"wdm\", \"~> 0.1.0\" if Gem.win_platform?\n\n"
  },
  {
    "path": "LICENSE",
    "content": "    Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n    \n"
  },
  {
    "path": "README.md",
    "content": "# SAFE : Self Attentive Function Embedding\n\nPaper\n---\nThis software is the outcome of our accademic research. See our arXiv paper: [arxiv](https://arxiv.org/abs/1811.05296)\n\nIf you use this code, please cite our accademic paper as:\n\n```bibtex\n@inproceedings{massarelli2018safe,\n  title={SAFE: Self-Attentive Function Embeddings for Binary Similarity},\n  author={Massarelli, Luca and Di Luna, Giuseppe Antonio and Petroni, Fabio and Querzoni, Leonardo and Baldoni, Roberto},\n  booktitle={Proceedings of 16th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA)},\n  year={2019}\n}\n```\n\nWhat you need  \n-----\nYou need [radare2](https://github.com/radare/radare2) installed in your system. \n  \nQuickstart\n-----\nTo create the embedding of a function:\n```\ngit clone https://github.com/gadiluna/SAFE.git\npip install -r requirements\nchmod +x download_model.sh\n./download_model.sh\npython safe.py -m data/safe.pb -i helloworld.o -a 100000F30\n```\n#### What to do with an embedding?\nOnce you have two embeddings ```embedding_x``` and ```embedding_y``` you can compute the similarity of the corresponding functions as: \n```\nfrom sklearn.metrics.pairwise import cosine_similarity\n\nsim=cosine_similarity(embedding_x, embedding_y)\n \n```\n\n\nData Needed\n-----\nSAFE needs few information to work. Two are essentials, a model that tells safe how to \nconvert assembly instructions in vectors (i2v model) and a model that tells safe how\nto convert an binary function into a vector.\nBoth models can be downloaded by using the command\n```\n./download_model.sh\n```\nthe downloader downloads the model and place them in the directory data.\nThe directory tree after the download should be.\n```\nsafe/-- githubcode\n     \\\n      \\--data/-----safe.pb\n               \\\n                \\---i2v/\n            \n```\nThe safe.pb file contains the safe-model used to convert binary function to vectors.\nThe i2v folder contains the i2v model. \n\n\nHardcore Details\n----\nThis section contains details that are needed to replicate our experiments, if you are an user of safe you can skip\nit. \n\n### Safe.pb\nThis is the freezed tensorflow trained model for AMD64 architecture. You can import it in your project using:\n\n```\n import tensorflow as tf\n \n with tf.gfile.GFile(\"safe.pb\", \"rb\") as f:\n    graph_def = tf.GraphDef()\n    graph_def.ParseFromString(f.read())\n\n with tf.Graph().as_default() as graph:\n    tf.import_graph_def(graph_def)\n    \n sess = tf.Session(graph=graph)\n``` \n\nsee file: neural_network/SAFEEmbedder.py\n\n### i2v\nThe i2v folder contains two files. \nA Matrix where each row is the embedding of an asm instruction.\nA json file that contains a dictonary mapping asm instructions into row numbers of the matrix above.\nsee file: asm_embedding/InstructionsConverter.py\n\n\n\n## Train the model\nIf you want to train the model using our datasets you have to first use:\n```\n python3 downloader.py -td\n```\nThis will download the datasets into data folder. Note that the datasets are compressed so you have to decompress them yourself.\nThis data will be an sqlite databases.\nTo start the train use neural_network/train.sh.\nThe db can be selected by changing the parameter into train.sh.\nIf you want information on the dataset see our paper.\n\n## Create your own dataset\nIf you want to create your own dataset you can use the script ExperimentUtil into the folder\ndataset creation.\n\n## Create a functions knowledge base\nIf you want to use SAFE binary code search engine you can use the script ExperimentUtil to create\nthe knowledge base.\nThen you can search through it using the script into function_search\n\n\nRelated Projects\n---\n\n* YARASAFE: Automatic Binary Function Similarity Checks with Yara (https://github.com/lucamassarelli/yarasafe) \n* SAFEtorch: Pytorch implemenation of the SAFE neural network (https://github.com/facebookresearch/SAFEtorch)\n\nThanks\n---\nIn our code we use [godown](https://github.com/circulosmeos/gdown.pl) to download data from Google drive. We thank \ncirculosmeos, the creator of godown.\n\nWe thank Davide Italiano for the useful discussions. \n"
  },
  {
    "path": "__init__.py",
    "content": ""
  },
  {
    "path": "_config.yml",
    "content": "# Welcome to Jekyll!\n#\n# This config file is meant for settings that affect your whole blog, values\n# which you are expected to set up once and rarely edit after that. If you find\n# yourself editing this file very often, consider using Jekyll's data files\n# feature for the data you need to update frequently.\n#\n# For technical reasons, this file is *NOT* reloaded automatically when you use\n# 'bundle exec jekyll serve'. If you change this file, please restart the server process.\n\n# Site settings\n# These are used to personalize your new site. If you look in the HTML files,\n# you will see them accessed via {{ site.title }}, {{ site.email }}, and so on.\n# You can create any custom variable you would like, and they will be accessible\n# in the templates via {{ site.myvariable }}.\ntitle: 'SAFE: Self-Attentive Function Embeddings'\nemail: safeteam@gmail.com\ndescription: >- # this means to ignore newlines until \"baseurl:\"\n    Self-Attentive Function Embeddings for binary similarity.\n    https://arxiv.org/abs/1811.05296\nbaseurl: \"\" # the subpath of your site, e.g. /blog\nurl: \"\" # the base hostname & protocol for your site, e.g. http://example.com\ntwitter_username: \ngithub_username:  \n\n# Build settings\nmarkdown: kramdown\ntheme: minima\n#theme: jekyll-theme-midnight\nplugins:\n  - jekyll-feed\n\n# Exclude from processing.\n# The following items will not be processed, by default. Create a custom list\n# to override the default setting.\n# exclude:\n#   - Gemfile\n#   - Gemfile.lock\n#   - node_modules\n#   - vendor/bundle/\n#   - vendor/cache/\n#   - vendor/gems/\n#   - vendor/ruby/\n"
  },
  {
    "path": "asm_embedding/DocumentManipulation.py",
    "content": "import json\nimport re\nimport os\n\ndef list_to_str(li):\n    i=''\n    for x in li:\n        i=i+' '+x\n    i=i+' endfun'*5\n    return i\n\ndef document_append(strin):\n    with open('/Users/giuseppe/docuent_X86','a') as f:\n        f.write(strin)\n\nciro=set()\ncantina=[]\nnum_total=0\nnum_filtered=0\nwith open('/Users/giuseppe/dump.x86.linux.json') as f:\n    l=f.readline()\n    print('loaded')\n    r = re.split('(\\[.*?\\])(?= *\\[)', l)\n    del l\n    for x in r:\n        if '[' in x:\n            gennaro=json.loads(x)\n            for materdomini in gennaro:\n                num_total=num_total+1\n                if materdomini[0] not in ciro:\n                    ciro.add(materdomini[0])\n                    num_filtered=num_filtered+1\n                    a=list_to_str(materdomini[1])\n                    document_append(a)\n        del x\n    print(num_total)\n    print(num_filtered)"
  },
  {
    "path": "asm_embedding/FunctionAnalyzerRadare.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport json\nimport r2pipe\n\n\nclass RadareFunctionAnalyzer:\n\n    def __init__(self, filename, use_symbol, depth):\n        self.r2 = r2pipe.open(filename, flags=['-2'])\n        self.filename = filename\n        self.arch, _ = self.get_arch()\n        self.top_depth = depth\n        self.use_symbol = use_symbol\n\n    def __enter__(self):\n        return self\n\n    @staticmethod\n    def filter_reg(op):\n        return op[\"value\"]\n\n    @staticmethod\n    def filter_imm(op):\n        imm = int(op[\"value\"])\n        if -int(5000) <= imm <= int(5000):\n            ret = str(hex(op[\"value\"]))\n        else:\n            ret = str('HIMM')\n        return ret\n\n    @staticmethod\n    def filter_mem(op):\n        if \"base\" not in op:\n            op[\"base\"] = 0\n\n        if op[\"base\"] == 0:\n            r = \"[\" + \"MEM\" + \"]\"\n        else:\n            reg_base = str(op[\"base\"])\n            disp = str(op[\"disp\"])\n            scale = str(op[\"scale\"])\n            r = '[' + reg_base + \"*\" + scale + \"+\" + disp + ']'\n        return r\n\n    @staticmethod\n    def filter_memory_references(i):\n        inst = \"\" + i[\"mnemonic\"]\n\n        for op in i[\"operands\"]:\n            if op[\"type\"] == 'reg':\n                inst += \" \" + RadareFunctionAnalyzer.filter_reg(op)\n            elif op[\"type\"] == 'imm':\n                inst += \" \" + RadareFunctionAnalyzer.filter_imm(op)\n            elif op[\"type\"] == 'mem':\n                inst += \" \" + RadareFunctionAnalyzer.filter_mem(op)\n            if len(i[\"operands\"]) > 1:\n                inst = inst + \",\"\n\n        if \",\" in inst:\n            inst = inst[:-1]\n        inst = inst.replace(\" \", \"_\")\n\n        return str(inst)\n\n    @staticmethod\n    def get_callref(my_function, depth):\n        calls = {}\n        if 'callrefs' in my_function and depth > 0:\n            for cc in my_function['callrefs']:\n                if cc[\"type\"] == \"C\":\n                    calls[cc['at']] = cc['addr']\n        return calls\n\n    def get_instruction(self):\n        instruction = json.loads(self.r2.cmd(\"aoj 1\"))\n        if len(instruction) > 0:\n            instruction = instruction[0]\n        else:\n            return None\n\n        operands = []\n        if 'opex' not in instruction:\n            return None\n\n        for op in instruction['opex']['operands']:\n            operands.append(op)\n        instruction['operands'] = operands\n        return instruction\n\n    def function_to_inst(self, functions_dict, my_function, depth):\n        instructions = []\n        asm = \"\"\n\n        if self.use_symbol:\n            s = my_function['vaddr']\n        else:\n            s = my_function['offset']\n        calls = RadareFunctionAnalyzer.get_callref(my_function, depth)\n        self.r2.cmd('s ' + str(s))\n\n        if self.use_symbol:\n            end_address = s + my_function[\"size\"]\n        else:\n            end_address = s + my_function[\"realsz\"]\n\n        while s < end_address:\n            instruction = self.get_instruction()\n            asm += instruction[\"bytes\"]\n            if self.arch == 'x86':\n                filtered_instruction = \"X_\" + RadareFunctionAnalyzer.filter_memory_references(instruction)\n            elif self.arch == 'arm':\n                filtered_instruction = \"A_\" + RadareFunctionAnalyzer.filter_memory_references(instruction)\n\n            instructions.append(filtered_instruction)\n\n            if s in calls and depth > 0:\n                if calls[s] in functions_dict:\n                    ii, aa = self.function_to_inst(functions_dict, functions_dict[calls[s]], depth-1)\n                    instructions.extend(ii)\n                    asm += aa\n                    self.r2.cmd(\"s \" + str(s))\n\n            self.r2.cmd(\"so 1\")\n            s = int(self.r2.cmd(\"s\"), 16)\n\n        return instructions, asm\n\n    def get_arch(self):\n        try:\n            info = json.loads(self.r2.cmd('ij'))\n            if 'bin' in info:\n                arch = info['bin']['arch']\n                bits = info['bin']['bits']\n        except:\n            print(\"Error loading file\")\n            arch = None\n            bits = None\n        return arch, bits\n\n    def find_functions(self):\n        self.r2.cmd('aaa')\n        try:\n            function_list = json.loads(self.r2.cmd('aflj'))\n        except:\n            function_list = []\n        return function_list\n\n    def find_functions_by_symbols(self):\n        self.r2.cmd('aa')\n        try:\n            symbols = json.loads(self.r2.cmd('isj'))\n            fcn_symb = [s for s in symbols if s['type'] == 'FUNC']\n        except:\n            fcn_symb = []\n        return fcn_symb\n\n    def analyze(self):\n        if self.use_symbol:\n            function_list = self.find_functions_by_symbols()\n        else:\n            function_list = self.find_functions()\n\n        functions_dict = {}\n        if self.top_depth > 0:\n            for my_function in function_list:\n                if self.use_symbol:\n                    functions_dict[my_function['vaddr']] = my_function\n                else:\n                    functions_dict[my_function['offset']] = my_function\n\n        result = {}\n        for my_function in function_list:\n            if self.use_symbol:\n                address = my_function['vaddr']\n            else:\n                address = my_function['offset']\n\n            try:\n                instructions, asm = self.function_to_inst(functions_dict, my_function, self.top_depth)\n                result[my_function['name']] = {'filtered_instructions': instructions, \"asm\": asm, \"address\": address}\n            except:\n                print(\"Error in functions: {} from {}\".format(my_function['name'], self.filename))\n                pass\n        return result\n\n    def close(self):\n        self.r2.quit()\n\n    def __exit__(self, exc_type, exc_value, traceback):\n        self.r2.quit()\n\n\n\n"
  },
  {
    "path": "asm_embedding/FunctionNormalizer.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport numpy as np\n\n\nclass FunctionNormalizer:\n\n    def __init__(self, max_instruction):\n        self.max_instructions = max_instruction\n\n    def normalize(self, f):\n        f = np.asarray(f[0:self.max_instructions])\n        length = f.shape[0]\n        if f.shape[0] < self.max_instructions:\n            f = np.pad(f, (0, self.max_instructions - f.shape[0]), mode='constant')\n        return f, length\n\n    def normalize_function_pairs(self, pairs):\n        lengths = []\n        new_pairs = []\n        for x in pairs:\n            f0, len0 = self.normalize(x[0])\n            f1, len1 = self.normalize(x[1])\n            lengths.append((len0, len1))\n            new_pairs.append((f0, f1))\n        return new_pairs, lengths\n\n    def normalize_functions(self, functions):\n        lengths = []\n        new_functions = []\n        for f in functions:\n            f, length = self.normalize(f)\n            lengths.append(length)\n            new_functions.append(f)\n        return new_functions, lengths\n"
  },
  {
    "path": "asm_embedding/InstructionsConverter.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport json\n\n\nclass InstructionsConverter:\n\n    def __init__(self, json_i2id):\n        f = open(json_i2id, 'r')\n        self.i2id = json.load(f)\n        f.close()\n\n    def convert_to_ids(self, instructions_list):\n        ret_array = []\n        # For each instruction we add +1 to its ID because the first\n        # element of the embedding matrix is zero\n        for x in instructions_list:\n            if x in self.i2id:\n                ret_array.append(self.i2id[x] + 1)\n            elif 'X_' in x:\n                # print(str(x) + \" is not a known x86 instruction\")\n                ret_array.append(self.i2id['X_UNK'] + 1)\n            elif 'A_' in x:\n                # print(str(x) + \" is not a known arm instruction\")\n                ret_array.append(self.i2id['A_UNK'] + 1)\n            else:\n                # print(\"There is a problem \" + str(x) + \" does not appear to be an asm or arm instruction\")\n                ret_array.append(self.i2id['X_UNK'] + 1)\n        return ret_array\n\n\n"
  },
  {
    "path": "asm_embedding/__init__.py",
    "content": ""
  },
  {
    "path": "dataset_creation/DataSplitter.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport json\nimport random\nimport sqlite3\nfrom tqdm import tqdm\n\n\nclass DataSplitter:\n\n    def __init__(self, db_name):\n        self.db_name = db_name\n\n    def create_pair_table(self, table_name):\n        conn = sqlite3.connect(self.db_name)\n        c = conn.cursor()\n        c.executescript(\"DROP TABLE IF EXISTS {} \".format(table_name))\n        c.execute(\"CREATE TABLE  {} (id INTEGER PRIMARY KEY, true_pair  TEXT, false_pair TEXT)\".format(table_name))\n        conn.commit()\n        conn.close()\n\n    def get_ids(self, set_type):\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n        q = cur.execute(\"SELECT id FROM {}\".format(set_type))\n        ids = q.fetchall()\n        conn.close()\n        return ids\n\n    @staticmethod\n    def select_similar_cfg(id, provenance, ids, cursor):\n        q1 = cursor.execute('SELECT id FROM functions WHERE project=? AND file_name=? and function_name=?', provenance)\n        candidates = [i[0] for i in q1.fetchall() if (i[0] != id and i[0] in ids)]\n        if len(candidates) == 0:\n            return None\n        id_similar = random.choice(candidates)\n        return id_similar\n\n    @staticmethod\n    def select_dissimilar_cfg(ids, provenance, cursor):\n        while True:\n            id_dissimilar = random.choice(ids)\n            q2 = cursor.execute('SELECT project, file_name, function_name FROM functions WHERE id=?', id_dissimilar)\n            res = q2.fetchone()\n            if res != provenance:\n                break\n        return id_dissimilar\n\n    def create_epoch_pairs(self, epoch_number, pairs_table,id_table):\n        random.seed = epoch_number\n\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n        ids = cur.execute(\"SELECT id FROM \"+id_table).fetchall()\n        id_set=set(ids)\n        true_pair = []\n        false_pair = []\n\n        for my_id in tqdm(ids):\n            q = cur.execute('SELECT project, file_name, function_name FROM functions WHERE id =?', my_id)\n            cfg_0_provenance = q.fetchone()\n            id_sim = DataSplitter.select_similar_cfg(my_id, cfg_0_provenance, id_set, cur)\n            id_dissim = DataSplitter.select_dissimilar_cfg(ids, cfg_0_provenance, cur)\n            if id_sim is not None and id_dissim is not None:\n                true_pair.append((my_id, id_sim))\n                false_pair.append((my_id, id_dissim))\n\n        true_pair = str(json.dumps(true_pair))\n        false_pair = str(json.dumps(false_pair))\n\n        cur.execute(\"INSERT INTO {} VALUES (?,?,?)\".format(pairs_table), (epoch_number, true_pair, false_pair))\n        conn.commit()\n        conn.close()\n\n    def create_pairs(self, total_epochs):\n\n        self.create_pair_table('train_pairs')\n        self.create_pair_table('validation_pairs')\n        self.create_pair_table('test_pairs')\n\n        for i in range(0, total_epochs):\n            print(\"Creating training pairs for epoch {} of {}\".format(i, total_epochs))\n            self.create_epoch_pairs(i, 'train_pairs','train')\n\n        print(\"Creating validation pairs\")\n        self.create_epoch_pairs(0, 'validation_pairs','validation')\n\n        print(\"Creating test pairs\")\n        self.create_epoch_pairs(0, \"test_pairs\",'test')\n\n\n    @staticmethod\n    def prepare_set(data_to_include, table_name, file_list, cur):\n        i = 0\n        while i < data_to_include and len(file_list) > 0:\n            choice = random.choice(file_list)\n            file_list.remove(choice)\n            q = cur.execute(\"SELECT id FROM functions where project=? AND file_name=?\", choice)\n            data = q.fetchall()\n            cur.executemany(\"INSERT INTO {} VALUES (?)\".format(table_name), data)\n            i += len(data)\n        return file_list, i\n\n    def split_data(self, validation_dim, test_dim):\n        random.seed = 12345\n        conn = sqlite3.connect(self.db_name)\n        c = conn.cursor()\n\n        q = c.execute('''SELECT project, file_name FROM functions ''')\n        data = q.fetchall()\n        conn.commit()\n\n        num_data = len(data)\n        num_test = int(num_data * test_dim)\n        num_validation = int(num_data * validation_dim)\n\n        filename = list(set(data))\n\n        c.execute(\"DROP TABLE IF EXISTS train\")\n        c.execute(\"DROP TABLE IF EXISTS test\")\n        c.execute(\"DROP TABLE IF EXISTS validation\")\n\n        c.execute(\"CREATE TABLE IF NOT EXISTS train (id INTEGER PRIMARY KEY)\")\n        c.execute(\"CREATE TABLE IF NOT EXISTS validation (id INTEGER PRIMARY KEY)\")\n        c.execute(\"CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY)\")\n\n        c.execute('''CREATE INDEX IF NOT EXISTS  my_index   ON functions(project, file_name, function_name)''')\n        c.execute('''CREATE INDEX IF NOT EXISTS  my_index_2 ON functions(project, file_name)''')\n\n        filename, test_num = DataSplitter.prepare_set(num_test, 'test', filename, conn.cursor())\n        conn.commit()\n        assert len(filename) > 0\n        filename, val_num = self.prepare_set(num_validation, 'validation', filename, conn.cursor())\n        conn.commit()\n        assert len(filename) > 0\n        _, train_num = self.prepare_set(num_data - num_test - num_validation, 'train', filename, conn.cursor())\n        conn.commit()\n\n        print(\"Train Size: {}\".format(train_num))\n        print(\"Validation Size:  {}\".format(val_num))\n        print(\"Test Size: {}\".format(test_num))\n"
  },
  {
    "path": "dataset_creation/DatabaseFactory.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nfrom asm_embedding.InstructionsConverter import InstructionsConverter\nfrom asm_embedding.FunctionAnalyzerRadare import RadareFunctionAnalyzer\nimport json\nimport multiprocessing\nfrom multiprocessing import Pool\nfrom multiprocessing.dummy import Pool as ThreadPool\nimport os\nimport random\nimport signal\nimport sqlite3\nfrom tqdm import tqdm\n\n\nclass DatabaseFactory:\n\n    def __init__(self, db_name, root_path):\n        self.db_name = db_name\n        self.root_path = root_path\n\n    @staticmethod\n    def worker(item):\n        DatabaseFactory.analyze_file(item)\n        return 0\n\n    @staticmethod\n    def extract_function(graph_analyzer):\n        return graph_analyzer.extractAll()\n\n\n    @staticmethod\n    def insert_in_db(db_name, pool_sem, func, filename, function_name, instruction_converter):\n        path = filename.split(os.sep)\n        if len(path) < 4:\n            return\n        asm = func[\"asm\"]\n        instructions_list = func[\"filtered_instructions\"]\n        instruction_ids = json.dumps(instruction_converter.convert_to_ids(instructions_list))\n        pool_sem.acquire()\n        conn = sqlite3.connect(db_name)\n        cur = conn.cursor()\n        cur.execute('''INSERT INTO functions VALUES (?,?,?,?,?,?,?,?)''', (None,  # id\n                                                                         path[-4],  # project\n                                                                         path[-3],  # compiler\n                                                                         path[-2],  # optimization\n                                                                         path[-1],  # file_name\n                                                                         function_name,  # function_name\n                                                                         asm,            # asm\n                                                                         len(instructions_list)) # num of instructions\n                    )\n        inserted_id = cur.lastrowid\n        cur.execute('''INSERT INTO filtered_functions VALUES (?,?)''', (inserted_id,\n                                                                        instruction_ids)\n                    )\n        conn.commit()\n        conn.close()\n        pool_sem.release()\n\n    @staticmethod\n    def analyze_file(item):\n        global pool_sem\n        os.setpgrp()\n\n        filename = item[0]\n        db = item[1]\n        use_symbol = item[2]\n        depth = item[3]\n        instruction_converter = item[4]\n\n        analyzer =  RadareFunctionAnalyzer(filename, use_symbol, depth)\n        p = ThreadPool(1)\n        res = p.apply_async(analyzer.analyze)\n\n        try:\n            result = res.get(120)\n        except multiprocessing.TimeoutError:\n                print(\"Aborting due to timeout:\" + str(filename))\n                print('Try to modify the timeout value in DatabaseFactory instruction  result = res.get(TIMEOUT)')\n                os.killpg(0, signal.SIGKILL)\n        except Exception:\n                print(\"Aborting due to error:\" + str(filename))\n                os.killpg(0, signal.SIGKILL)\n\n        for func in result:\n            DatabaseFactory.insert_in_db(db, pool_sem, result[func], filename, func, instruction_converter)\n\n        analyzer.close()\n\n        return 0\n\n    # Create the db where data are stored\n    def create_db(self):\n        print('Database creation...')\n        conn = sqlite3.connect(self.db_name)\n        conn.execute(''' CREATE TABLE  IF NOT EXISTS functions (id INTEGER PRIMARY KEY, \n                                                                project text, \n                                                                compiler text, \n                                                                optimization text, \n                                                                file_name text, \n                                                                function_name text, \n                                                                asm text,\n                                                                num_instructions INTEGER)\n                    ''')\n        conn.execute('''CREATE TABLE  IF NOT EXISTS filtered_functions  (id INTEGER PRIMARY KEY, \n                                                                         instructions_list text)\n                     ''')\n        conn.commit()\n        conn.close()\n\n    # Scan the root directory to find all the file to analyze,\n    # query also the db for already analyzed files.\n    def scan_for_file(self, start):\n        file_list = []\n        # Scan recursively all the subdirectory\n        directories = os.listdir(start)\n        for item in directories:\n            item = os.path.join(start,item)\n            if os.path.isdir(item):\n                file_list.extend(self.scan_for_file(item + os.sep))\n            elif os.path.isfile(item) and item.endswith('.o'):\n                file_list.append(item)\n        return file_list\n\n    # Looks for already existing files in the database\n    # It returns a list of files that are not in the database\n    def remove_override(self, file_list):\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n        q = cur.execute('''SELECT project, compiler, optimization, file_name FROM functions''')\n        names = q.fetchall()\n        names = [os.path.join(self.root_path, n[0], n[1], n[2], n[3]) for n in names]\n        names = set(names)\n        # If some files is already in the db remove it from the file list\n        if len(names) > 0:\n            print(str(len(names)) + ' Already in the database')\n        cleaned_file_list = []\n        for f in file_list:\n            if not(f in names):\n                cleaned_file_list.append(f)\n\n        return cleaned_file_list\n\n    # root function to create the db\n    def build_db(self, use_symbol, depth):\n        global pool_sem\n\n        pool_sem = multiprocessing.BoundedSemaphore(value=1)\n\n        instruction_converter = InstructionsConverter(\"data/i2v/word2id.json\")\n        self.create_db()\n        file_list = self.scan_for_file(self.root_path)\n\n        print('Found ' + str(len(file_list)) + ' during the scan')\n        file_list = self.remove_override(file_list)\n        print('Find ' + str(len(file_list)) + ' files to analyze')\n        random.shuffle(file_list)\n\n        t_args = [(f, self.db_name, use_symbol, depth, instruction_converter) for f in file_list]\n\n        # Start a parallel pool to analyze files\n        p = Pool(processes=None, maxtasksperchild=20)\n        for _ in tqdm(p.imap_unordered(DatabaseFactory.worker, t_args), total=len(file_list)):\n            pass\n\n        p.close()\n        p.join()\n\n\n"
  },
  {
    "path": "dataset_creation/ExperimentUtil.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport argparse\nfrom dataset_creation import DatabaseFactory, DataSplitter, FunctionsEmbedder\nfrom utils.utils import print_safe\n\n\ndef debug_msg():\n    msg = \"SAFE DATABASE UTILITY\"\n    msg += \"-------------------------------------------------\\n\"\n    msg += \"This program is an utility to save data into an sqlite database with SAFE \\n\\n\"\n    msg += \"There are three main command: \\n\"\n    msg += \"BUILD:  It create a db with two tables: functions, filtered_functions. \\n\"\n    msg += \"        In the first table there are all the functions extracted from the executable with their hex code.\\n\"\n    msg += \"        In the second table functions are converted to i2v representation. \\n\"\n    msg += \"SPLIT:  Data are splitted into train validation and test set. \" \\\n           \"        Then it generate the pairs for the training of the network.\\n\"\n    msg += \"EMBEDD: Generate the embeddings of each function in the database using a trained SAFE model\\n\\n\"\n    msg += \"If you want to train the network use build + split\"\n    msg += \"If you want to create a knowledge base for the binary code search engine use build + embedd\"\n    msg += \"This program has been written by the SAFE team.\\n\"\n    msg += \"-------------------------------------------------\"\n    return msg\n\n\ndef build_configuration(db_name, root_dir, use_symbols, callee_depth):\n    msg = \"Database creation options: \\n\"\n    msg += \" - Database Name: {} \\n\".format(db_name)\n    msg += \" - Root dir: {} \\n\".format(root_dir)\n    msg += \" - Use symbols: {} \\n\".format(use_symbols)\n    msg += \" - Callee depth: {} \\n\".format(callee_depth)\n    return msg\n\n\ndef split_configuration(db_name, val_split, test_split, epochs):\n    msg = \"Splitting options: \\n\"\n    msg += \" - Database Name: {} \\n\".format(db_name)\n    msg += \" - Validation Size: {} \\n\".format(val_split)\n    msg += \" - Test Size: {} \\n\".format(test_split)\n    msg += \" - Epochs: {} \\n\".format(epochs)\n    return msg\n\n\ndef embedd_configuration(db_name, model, batch_size, max_instruction, embeddings_table):\n    msg = \"Embedding options: \\n\"\n    msg += \" - Database Name: {} \\n\".format(db_name)\n    msg += \" - Model: {} \\n\".format(model)\n    msg += \" - Batch Size: {} \\n\".format(batch_size)\n    msg += \" - Max Instruction per function: {} \\n\".format(max_instruction)\n    msg += \" - Table for saving embeddings: {}.\".format(embeddings_table)\n    return msg\n\n\nif __name__ == '__main__':\n\n    print_safe()\n\n    parser = argparse.ArgumentParser(description=debug_msg)\n\n    parser.add_argument(\"-db\", \"--db\", help=\"Name of the database to create\", required=True)\n\n    parser.add_argument(\"-b\", \"--build\", help=\"Build db disassebling executables\",   action=\"store_true\")\n    parser.add_argument(\"-s\", \"--split\", help=\"Perform data splitting for training\", action=\"store_true\")\n    parser.add_argument(\"-e\", \"--embed\", help=\"Compute functions embedding\",         action=\"store_true\")\n\n    parser.add_argument(\"-dir\", \"--dir\",     help=\"Root path of the directory to scan\")\n    parser.add_argument(\"-sym\", \"--symbols\", help=\"Use it if you want to use symbols\", action=\"store_true\")\n    parser.add_argument(\"-dep\", \"--depth\",   help=\"Recursive depth for analysis\",      default=0, type=int)\n\n    parser.add_argument(\"-test\", \"--test_size\", help=\"Test set size [0-1]\",            type=float, default=0.2)\n    parser.add_argument(\"-val\",  \"--val_size\",  help=\"Validation set size [0-1]\",      type=float, default=0.2)\n    parser.add_argument(\"-epo\",  \"--epochs\",    help=\"# Epochs to generate pairs for\", type=int,    default=25)\n\n    parser.add_argument(\"-mod\", \"--model\",            help=\"Model for embedding generation\")\n    parser.add_argument(\"-bat\", \"--batch_size\",       help=\"Batch size for function embeddings\", type=int, default=500)\n    parser.add_argument(\"-max\", \"--max_instruction\",  help=\"Maximum instruction per function\", type=int,   default=150)\n    parser.add_argument(\"-etb\", \"--embeddings_table\", help=\"Name for the table that contains embeddings\",\n                        default=\"safe_embeddings\")\n\n    try:\n        args = parser.parse_args()\n    except:\n        parser.print_help()\n        print(debug_msg())\n        exit(0)\n\n    if args.build:\n        print(\"Disassemblying files and creating dataset\")\n        print(build_configuration(args.db, args.dir, args.symbols, args.depth))\n        factory = DatabaseFactory.DatabaseFactory(args.db, args.dir)\n        factory.build_db(args.symbols, args.depth)\n\n    if args.split:\n        print(\"Splitting data and generating epoch pairs\")\n        print(split_configuration(args.db, args.val_size, args.test_size, args.epochs))\n        splitter = DataSplitter.DataSplitter(args.db)\n        splitter.split_data(args.val_size, args.test_size)\n        splitter.create_pairs(args.epochs)\n\n    if args.embed:\n        print(\"Computing embeddings for function in db\")\n        print(embedd_configuration(args.db, args.model, args.batch_size, args.max_instruction, args.embeddings_table))\n        embedder = FunctionsEmbedder.FunctionsEmbedder(args.model, args.batch_size, args.max_instruction)\n        embedder.compute_and_save_embeddings_from_db(args.db, args.embeddings_table)\n\n    exit(0)\n"
  },
  {
    "path": "dataset_creation/FunctionsEmbedder.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nfrom asm_embedding.FunctionNormalizer import FunctionNormalizer\nimport json\nfrom neural_network.SAFEEmbedder import SAFEEmbedder\nimport numpy as np\nimport sqlite3\nfrom tqdm import tqdm\n\n\nclass FunctionsEmbedder:\n\n    def __init__(self,  model, batch_size, max_instruction):\n        self.batch_size = batch_size\n        self.normalizer = FunctionNormalizer(max_instruction)\n        self.safe = SAFEEmbedder(model)\n        self.safe.loadmodel()\n        self.safe.get_tensor()\n\n    def compute_embeddings(self, functions):\n        functions, lenghts = self.normalizer.normalize_functions(functions)\n        embeddings = self.safe.embedd(functions, lenghts)\n        return embeddings\n\n    @staticmethod\n    def create_table(db_name, table_name):\n        conn = sqlite3.connect(db_name)\n        c = conn.cursor()\n        c.execute(\"CREATE TABLE IF NOT EXISTS {} (id INTEGER PRIMARY KEY, {}  TEXT)\".format(table_name, table_name))\n        conn.commit()\n        conn.close()\n\n    def compute_and_save_embeddings_from_db(self, db_name, table_name):\n        FunctionsEmbedder.create_table(db_name, table_name)\n        conn = sqlite3.connect(db_name)\n        cur = conn.cursor()\n        q = cur.execute(\"SELECT id FROM functions WHERE id not in (SELECT id from {})\".format(table_name))\n        ids = q.fetchall()\n\n        for i in tqdm(range(0, len(ids), self.batch_size)):\n            functions = []\n            batch_ids = ids[i:i+self.batch_size]\n            for my_id in batch_ids:\n                q = cur.execute(\"SELECT instructions_list FROM filtered_functions where id=?\", my_id)\n                functions.append(json.loads(q.fetchone()[0]))\n            embeddings = self.compute_embeddings(functions)\n\n            for l, id in enumerate(batch_ids):\n                cur.execute(\"INSERT INTO {} VALUES (?,?)\".format(table_name), (id[0], np.array2string(embeddings[l])))\n            conn.commit()\n"
  },
  {
    "path": "dataset_creation/__init__.py",
    "content": ""
  },
  {
    "path": "dataset_creation/convertDB.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport sqlite3\nimport json\nfrom networkx.readwrite import json_graph\nimport logging\nfrom tqdm import tqdm\nfrom asm_embedding.InstructionsConverter import InstructionsConverter\n\n\n# Create the db where data are stored\ndef create_db(db_name):\n    print('Database creation...')\n    conn = sqlite3.connect(db_name)\n    conn.execute(''' CREATE TABLE  IF NOT EXISTS functions (id INTEGER PRIMARY KEY, \n                                                            project text, \n                                                            compiler text, \n                                                            optimization text, \n                                                            file_name text, \n                                                            function_name text, \n                                                            asm text,\n                                                            num_instructions INTEGER)\n                ''')\n    conn.execute('''CREATE TABLE  IF NOT EXISTS filtered_functions  (id INTEGER PRIMARY KEY, \n                                                                     instructions_list text)\n                 ''')\n    conn.commit()\n    conn.close()\n\n\ndef reverse_graph(cfg, lstm_cfg):\n    instructions = []\n    asm = \"\"\n    node_addr = list(cfg.nodes())\n    node_addr.sort()\n    nodes = cfg.nodes(data=True)\n    lstm_nodes = lstm_cfg.nodes(data=True)\n    for addr in node_addr:\n        a = nodes[addr][\"asm\"]\n        if a is not None:\n            asm += a\n        instructions.extend(lstm_nodes[addr]['features'])\n    return instructions, asm\n\n\ndef copy_split(old_cur, new_cur, table):\n    q = old_cur.execute(\"SELECT id FROM {}\".format(table))\n    iii = q.fetchall()\n    print(\"Copying table {}\".format(table))\n    for ii in tqdm(iii):\n        new_cur.execute(\"INSERT INTO {} VALUES (?)\".format(table), ii)\n\n\ndef copy_table(old_cur, new_cur, table_old, table_new):\n    q = old_cur.execute(\"SELECT * FROM {}\".format(table_old))\n    iii = q.fetchall()\n    print(\"Copying table {} to {}\".format(table_old, table_new))\n    for ii in tqdm(iii):\n        new_cur.execute(\"INSERT INTO {} VALUES (?,?,?)\".format(table_new), ii)\n\nlogger = logging.getLogger()\nlogger.setLevel(logging.DEBUG)\n\ndb = \"/home/lucamassarelli/binary_similarity_data/databases/big_dataset_X86.db\"\nnew_db = \"/home/lucamassarelli/binary_similarity_data/new_databases/big_dataset_X86_new.db\"\n\ncreate_db(new_db)\n\nconn_old = sqlite3.connect(db)\nconn_new = sqlite3.connect(new_db)\n\n\ncur_old = conn_old.cursor()\ncur_new = conn_new.cursor()\n\n\nq = cur_old.execute(\"SELECT id FROM functions\")\nids = q.fetchall()\nconverter = InstructionsConverter()\n\nfor my_id in tqdm(ids):\n\n    q0 = cur_old.execute(\"SELECT id, project, compiler, optimization, file_name, function_name, cfg FROM functions WHERE id=?\", my_id)\n    meta = q.fetchone()\n\n    q1 = cur_old.execute(\"SELECT lstm_cfg FROM lstm_cfg WHERE id=?\", my_id)\n    cfg = json_graph.adjacency_graph(json.loads(meta[6]))\n    lstm_cfg = json_graph.adjacency_graph(json.loads(q1.fetchone()[0]))\n    instructions, asm = reverse_graph(cfg, lstm_cfg)\n    values = meta[0:6] + (asm, len(instructions))\n    q_n = cur_new.execute(\"INSERT INTO functions VALUES (?,?,?,?,?,?,?,?)\", values)\n    converted_instruction = json.dumps(converter.convert_to_ids(instructions))\n    q_n = cur_new.execute(\"INSERT INTO filtered_functions VALUES (?,?)\", (my_id[0], converted_instruction))\n\nconn_new.commit()\n\ncur_new.execute(\"CREATE TABLE train (id INTEGER PRIMARY KEY) \")\ncur_new.execute(\"CREATE TABLE validation (id INTEGER PRIMARY KEY) \")\ncur_new.execute(\"CREATE TABLE test (id INTEGER PRIMARY KEY) \")\nconn_new.commit()\n\ncopy_split(cur_old, cur_new, \"train\")\nconn_new.commit()\ncopy_split(cur_old, cur_new, \"validation\")\nconn_new.commit()\ncopy_split(cur_old, cur_new, \"test\")\nconn_new.commit()\n\ncur_new.execute(\"CREATE TABLE  train_pairs (id INTEGER PRIMARY KEY, true_pair  TEXT, false_pair TEXT)\")\ncur_new.execute(\"CREATE TABLE  validation_pairs (id INTEGER PRIMARY KEY, true_pair  TEXT, false_pair TEXT)\")\ncur_new.execute(\"CREATE TABLE  test_pairs (id INTEGER PRIMARY KEY, true_pair  TEXT, false_pair TEXT)\")\nconn_new.commit()\n\ncopy_table(cur_old, cur_new, \"train_couples\", \"train_pairs\")\nconn_new.commit()\ncopy_table(cur_old, cur_new, \"validation_couples\", \"validation_pairs\")\nconn_new.commit()\ncopy_table(cur_old, cur_new, \"test_couples\", \"test_pairs\")\nconn_new.commit()\n\nconn_new.close()"
  },
  {
    "path": "download_model.sh",
    "content": "#!/usr/bin/env bash\n\npython3 downloader.py -b\necho 'Model downloaded and, hopefully, ready to run'\n"
  },
  {
    "path": "downloader.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\n\nimport argparse\nimport os\nimport sys\nfrom subprocess import call\n\nclass Downloader:\n\n    def __init__(self):\n        parser = argparse.ArgumentParser(description='SAFE downloader')\n\n        parser.add_argument(\"-m\", \"--model\", dest=\"model\", help=\"Download the trained SAFE model for x86\",\n                            action=\"store_true\",\n                            required=False)\n\n        parser.add_argument(\"-i2v\", \"--i2v\", dest=\"i2v\", help=\"Download the i2v dictionary and embedding matrix\",\n                            action=\"store_true\",\n                            required=False)\n\n        parser.add_argument(\"-b\", \"--bundle\", dest=\"bundle\",\n                            help=\"Download all files necessary to run the model\",\n                            action=\"store_true\",\n                            required=False)\n\n        parser.add_argument(\"-td\", \"--train_data\", dest=\"train_data\",\n                            help=\"Download the files necessary to train the model (It takes a lot of space!)\",\n                            action=\"store_true\",\n                            required=False)\n\n        args = parser.parse_args()\n\n        self.download_model = (args.model or args.bundle)\n        self.download_i2v = (args.i2v or args.bundle)\n        self.download_train = args.train_data\n\n        if not (self.download_model or self.download_i2v or self.download_train):\n            parser.print_help(sys.__stdout__)\n\n        self.url_model = \"https://drive.google.com/file/d/1Kwl8Jy-g9DXe1AUjUZDhJpjRlDkB4NBs/view?usp=sharing\"\n        self.url_i2v = \"https://drive.google.com/file/d/1CqJVGYbLDEuJmJV6KH4Dzzhy-G12GjGP\"\n        self.url_train = ['https://drive.google.com/file/d/1sNahtLTfZY5cxPaYDUjqkPTK0naZ45SH/view?usp=sharing','https://drive.google.com/file/d/16D5AVDux_Q8pCVIyvaMuiL2cw2V6gtLc/view?usp=sharing','https://drive.google.com/file/d/1cBRda8fYdqHtzLwstViuwK6U5IVHad1N/view?usp=sharing']\n        self.train_name = ['AMD64ARMOpenSSL.tar.bz2','AMD64multipleCompilers.tar.bz2','AMD64PostgreSQL.tar.bz2']\n        self.base_path = \"data\"\n        self.path_i2v = os.path.join(self.base_path, \"\")\n        self.path_model = os.path.join(self.base_path, \"\")\n        self.path_train_data = os.path.join(self.base_path, \"\")\n        self.i2v_compress_name='i2v.tar.bz2'\n        self.model_compress_name='model.tar.bz2'\n        self.datasets_compress_name='safe.pb'\n\n    @staticmethod\n    def download_file(id,path):\n        try:\n            print(\"Downloading from \"+ str(id) +\" into \"+str(path))\n            call(['./godown.pl',id,path])\n        except Exception as e:\n            print(\"Error downloading file at url:\" + str(id))\n            print(e)\n\n    @staticmethod\n    def decompress_file(file_src,file_path):\n        try:\n            call(['tar','-xvf',file_src,'-C',file_path])\n        except Exception as e:\n            print(\"Error decompressing file:\" + str(file_src))\n            print('you need tar command e b2zip support')\n            print(e)\n\n    def download(self):\n        print('Making the godown.pl script executable, thanks:'+str('https://github.com/circulosmeos/gdown.pl'))\n        call(['chmod', '+x','godown.pl'])\n        print(\"SAFE --- downloading models\")\n\n        if self.download_i2v:\n            print(\"Downloading i2v model.... in the folder data/i2v/\")\n            if not os.path.exists(self.path_i2v):\n                os.makedirs(self.path_i2v)\n            Downloader.download_file(self.url_i2v, os.path.join(self.path_i2v,self.i2v_compress_name))\n            print(\"Decompressing i2v model and placing in\" + str(self.path_i2v))\n            Downloader.decompress_file(os.path.join(self.path_i2v,self.i2v_compress_name),self.path_i2v)\n\n        if self.download_model:\n            print(\"Downloading the SAFE model... in the folder data\")\n            if not os.path.exists(self.path_model):\n                os.makedirs(self.path_model)\n            Downloader.download_file(self.url_model, os.path.join(self.path_model,self.datasets_compress_name))\n            #print(\"Decompressing SAFE model and placing in\" + str(self.path_model))\n            #Downloader.decompress_file(os.path.join(self.path_model,self.model_compress_name),self.path_model)\n\n        if self.download_train:\n            print(\"Downloading the train data.... in the folder data\")\n            if not os.path.exists(self.path_train_data):\n                os.makedirs(self.path_train_data)\n            for i,x in enumerate(self.url_train):\n                print(\"Downloading dataset \"+str(self.train_name[i]))\n                Downloader.download_file(x, os.path.join(self.path_train_data,self.train_name[i]))\n            #print(\"Decompressing the train data and placing in\" + str(self.path_train_data))\n            #Downloader.decompress_file(os.path.join(self.path_train_data,self.datasets_compress_name),self.path_train_data)\n\nif __name__=='__main__':\n    a=Downloader()\n    a.download()"
  },
  {
    "path": "function_search/EvaluateSearchEngine.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\n\nfrom FunctionSearchEngine import FunctionSearchEngine\nfrom sklearn import metrics\nimport sqlite3\n\nfrom multiprocessing import Process\nimport math\n\nimport warnings\nimport random\nimport json\n\nclass SearchEngineEvaluator:\n\n    def __init__(self, db_name, table, limit=None,k=None):\n        self.tables = table\n        self.db_name = db_name\n        self.SE = FunctionSearchEngine(db_name, table, limit=limit)\n        self.k=k\n        self.number_similar={}\n\n    def do_search(self, target_db_name, target_fcn_ids):\n        self.SE.load_target(target_db_name, target_fcn_ids)\n        self.SE.pp_search(50)\n\n    def calc_auc(self, target_db_name, target_fcn_ids):\n        self.SE.load_target(target_db_name, target_fcn_ids)\n        result = self.SE.auc()\n        print(result)\n\n    #\n    # This methods searches for all target function in the DB, in our test we take num functions compiled with compiler and opt\n    # moreover it populates the self.number_similar dictionary, that contains the number of similar function for each target\n    #\n    def find_target_fcn(self, compiler, opt, num):\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n        q = cur.execute(\"SELECT id, project, file_name, function_name FROM functions WHERE compiler=? AND optimization=?\", (compiler, opt))\n        res = q.fetchall()\n        ids = [i[0] for i in res]\n        true_labels = [l[1]+\"/\"+l[2]+\"/\"+l[3] for l in res]\n        n_ids = []\n        n_true_labels = []\n        num = min(num, len(ids))\n\n        for i in range(0, num):\n            index = random.randrange(len(ids))\n            n_ids.append(ids[index])\n            n_true_labels.append(true_labels[index])\n            f_name=true_labels[index].split('/')[2]\n            fi_name=true_labels[index].split('/')[1]\n            q = cur.execute(\"SELECT num FROM count_func WHERE file_name='{}' and function_name='{}'\".format(fi_name,f_name))\n            f = q.fetchone()\n            if f is not None:\n                num=int(f[0])\n            else:\n                num = 0\n            self.number_similar[true_labels[index]]=num\n\n        return n_ids, n_true_labels\n\n    @staticmethod\n    def functions_ground_truth(labels, indices, values, true_label):\n        y_true = []\n        y_score = []\n        for i, e in enumerate(indices):\n            y_score.append(float(values[i]))\n            l = labels[e]\n            if l == true_label:\n                y_true.append(1)\n            else:\n                y_true.append(0)\n        return y_true, y_score\n\n    # this methos execute the test\n    # it select the targets functions and it looks up for the targets in the entire db\n    # the outcome is json file containing the top 200 similar for each target function.\n    # the json file is an array and such array contains an entry for each target function\n    # each entry is a triple (t0,t1,t2)\n    # t0: an array that contains 1 at entry j if the entry j is similar to the target 0 otherwise\n    # t1: the number of similar functions to the target in the whole db\n    # t2: an array that at entry j contains the similarity score of the j-th most similar function to the target.\n    #\n    #\n    def evaluate_precision_on_all_functions(self, compiler, opt):\n        target_fcn_ids, true_labels = self.find_target_fcn(compiler, opt, 10000)\n        batch = 1000\n        labels = self.SE.trunc_labels\n\n        info=[]\n\n        for i in range(0, len(target_fcn_ids), batch):\n            if i + batch > len(target_fcn_ids):\n                batch = len(target_fcn_ids) - i\n            target = self.SE.load_target(self.db_name, target_fcn_ids[i:i+batch])\n            top_k = self.SE.top_k(target, self.k)\n\n            for j in range(0, batch):\n                a, b = SearchEngineEvaluator.functions_ground_truth(labels, top_k.indices[j, :], top_k.values[j, :], true_labels[i+j])\n\n                info.append((a,self.number_similar[true_labels[i + j]],b))\n\n        with open(compiler+'_'+opt+'_'+self.tables+'_top200.json', 'w') as outfile:\n                json.dump(info, outfile)\n\n\ndef test(dbName, table, opt,x,k):\n\n    print(\"k:{} - Table: {} - Opt: {}\".format(k,table, opt))\n\n    SEV = SearchEngineEvaluator(dbName, table, limit=2000000,k=k)\n    SEV.evaluate_precision_on_all_functions(x, opt)\n\n    print(\"-------------------------------------\")\n\n\nif __name__ == '__main__':\n\n    random.seed(12345)\n\n    dbName = '../data/AMD64PostgreSQL.db'\n    table = ['safe_embeddings']\n    opt = [\"O0\", \"O1\", \"O2\", \"O3\"]\n    for x in ['gcc-4.8',\"clang-4.0\",'gcc-7','clang-6.0']:\n        for t in table:\n            for o in opt:\n                p = Process(target=test, args=(dbName, t, o,x,200))\n                p.start()\n                p.join()\n"
  },
  {
    "path": "function_search/FunctionSearchEngine.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport sys\nimport numpy as np\nimport sqlite3\nimport pandas as pd\nimport tqdm\nimport tensorflow as tf\n\nif sys.version_info >= (3, 0):\n    from functools import reduce\n\n\npd.set_option('display.max_column',None)\npd.set_option('display.max_rows',None)\npd.set_option('display.max_seq_items',None)\npd.set_option('display.max_colwidth', 500)\npd.set_option('expand_frame_repr', True)\n\nclass TopK:\n\n    #\n    # This class computes the similarities between the targets and the list of functions on which we are searching.\n    # This is done by using matrices multiplication and top_k of tensorflow\n    def __init__(self):\n        self.graph=tf.Graph()\n        nop=0\n\n    def loads_embeddings_SE(self, lista_embeddings):\n        with self.graph.as_default():\n            tf.set_random_seed(1234)\n            dim = lista_embeddings[0].shape[0]\n            ll = np.asarray(lista_embeddings)\n            self.matrix = tf.constant(ll, name='matrix_embeddings', dtype=tf.float32)\n            self.target = tf.placeholder(\"float\", [None, dim], name='target_embedding')\n            self.sim = tf.matmul(self.target, self.matrix, transpose_b=True, name=\"embeddings_similarities\")\n            self.k = tf.placeholder(tf.int32, shape=(), name='k')\n            self.top_k = tf.nn.top_k(self.sim, self.k, sorted=True)\n            self.session = tf.Session()\n\n    def topK(self, k, target):\n        with self.graph.as_default():\n            tf.set_random_seed(1234)\n            return self.session.run(self.top_k, {self.target: target, self.k: int(k)})\n\nclass FunctionSearchEngine:\n\n    def __init__(self, db_name, table_name, limit=None):\n        self.s2v = TopK()\n        self.db_name = db_name\n        self.table_name = table_name\n        self.labels = []\n        self.trunc_labels = []\n        self.lista_embedding = []\n        self.ids = []\n        self.n_similar=[]\n        self.ret = {}\n        self.precision = None\n\n        print(\"Query for ids\")\n        conn = sqlite3.connect(db_name)\n        cur = conn.cursor()\n        if limit is None:\n            q = cur.execute(\"SELECT id, project, compiler, optimization, file_name, function_name FROM functions\")\n            res = q.fetchall()\n        else:\n            q = cur.execute(\"SELECT id, project, compiler, optimization, file_name, function_name FROM functions LIMIT {}\".format(limit))\n            res = q.fetchall()\n\n        for item in tqdm.tqdm(res, total=len(res)):\n            q = cur.execute(\"SELECT \" + self.table_name + \" FROM \" + self.table_name + \" WHERE id=?\", (item[0],))\n            e = q.fetchone()\n            if e is None:\n                continue\n\n            self.lista_embedding.append(self.embeddingToNp(e[0]))\n\n            element = \"{}/{}/{}\".format(item[1], item[4], item[5])\n            self.trunc_labels.append(element)\n\n            element = \"{}@{}/{}/{}/{}\".format(item[5], item[1], item[2], item[3], item[4])\n            self.labels.append(element)\n            self.ids.append(item[0])\n\n        conn.close()\n\n        self.s2v.loads_embeddings_SE(self.lista_embedding)\n        self.num_funcs = len(self.lista_embedding)\n\n    def load_target(self, target_db_name, target_fcn_ids, calc_mean=False):\n        conn = sqlite3.connect(target_db_name)\n        cur = conn.cursor()\n        mean = None\n        for id in target_fcn_ids:\n\n            if target_db_name == self.db_name and id in self.ids:\n                idx = self.ids.index(id)\n                e = self.lista_embedding[idx]\n            else:\n                q = cur.execute(\"SELECT \" + self.table_name + \" FROM \" + self.table_name + \" WHERE id=?\", (id,))\n                e = q.fetchone()\n                e = self.embeddingToNp(e[0])\n\n\n            if mean is None:\n                mean = e.reshape([e.shape[0], 1])\n            else:\n                mean = np.hstack((mean, e.reshape(e.shape[0], 1)))\n\n        if calc_mean:\n            target = [np.mean(mean, axis=1)]\n        else:\n            target = mean.T\n        return target\n\n    def embeddingToNp(self, e):\n        e = e.replace('\\n', '')\n        e = e.replace('[', '')\n        e = e.replace(']', '')\n        emb = np.fromstring(e, dtype=float, sep=' ')\n        return emb\n\n    def top_k(self, target, k=None):\n        if k is not None:\n            top_k = self.s2v.topK(k, target)\n        else:\n            top_k = self.s2v.topK(len(self.lista_embedding), target)\n        return top_k\n\n    def pp_search(self, k):\n        result = pd.DataFrame(columns=['Id', 'Name', 'Score'])\n        top_k = self.s2v.topK(k)\n        for i, e in enumerate(top_k.indices[0]):\n            result = result.append({'Id': self.ids[e], 'Name': self.labels[e], 'Score': top_k.values[0][i]}, ignore_index=True)\n        print(result)\n\n    def search(self, k):\n        result = []\n        top_k = self.s2v.topK(k)\n        for i, e in enumerate(top_k.indices[0]):\n            result = result.append({'Id': self.ids[e], 'Name': self.labels[e], 'Score': top_k.values[0][i]})\n        return result\n"
  },
  {
    "path": "function_search/__init__.py",
    "content": ""
  },
  {
    "path": "function_search/fromJsonSearchToPlot.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\nimport matplotlib.pyplot as plt\nimport json\nimport math\nimport numpy as np\nfrom multiprocessing import Pool\n\n\ndef find_dcg(element_list):\n    dcg_score = 0.0\n    for j, sim in enumerate(element_list):\n        dcg_score += float(sim) / math.log(j + 2)\n    return dcg_score\n\n\ndef count_ones(element_list):\n    return len([x for x in element_list if x == 1])\n\n\ndef extract_info(file_1):\n    with open(file_1, 'r') as f:\n        data1 = json.load(f)\n\n    performance1 = []\n\n    average_recall_k1 = []\n    precision_at_k1 = []\n\n    for f_index in range(0, len(data1)):\n\n        f1 = data1[f_index][0]\n        pf1 = data1[f_index][1]\n\n        tp1 = []\n\n        recall_p1 = []\n        precision_p1 = []\n        # we start from 1 to remove ourselves\n        for k in range(1, 200):\n            cut1 = f1[0:k]\n            dcg1 = find_dcg(cut1)\n            ideal1 = find_dcg(([1] * (pf1) + [0] * (k - pf1))[0:k])\n\n            p1k = float(count_ones(cut1))\n\n            tp1.append(dcg1 / ideal1)\n            recall_p1.append(p1k / pf1)\n            precision_p1.append(p1k / k)\n\n        performance1.append(tp1)\n        average_recall_k1.append(recall_p1)\n        precision_at_k1.append(precision_p1)\n\n    avg_p1 = np.average(performance1, axis=0)\n    avg_p10 = np.average(average_recall_k1, axis=0)\n    average_precision = np.average(precision_at_k1, axis=0)\n    return avg_p1, avg_p10, average_precision\n\n\ndef print_graph(info1, file_name, label_y, title_1, p):\n    fig, ax = plt.subplots()\n    ax.plot(range(0, len(info1)), info1, color='b', label=title_1)\n    ax.legend(loc=p, shadow=True, fontsize='x-large')\n    plt.xlabel(\"Number of Nearest Results\")\n    plt.ylabel(label_y)\n    fname = file_name\n    plt.savefig(fname)\n    plt.close(fname)\n\n\ndef compare_and_print(file):\n    filename = file.split('_')[0] + '_' + file.split('_')[1]\n    t_short = filename\n    label_1 = t_short + '_' + file.split('_')[3]\n\n    avg_p1, recall_p1, precision1 = extract_info(file)\n\n    fname = filename + '_nDCG.pdf'\n    print_graph(avg_p1, fname, 'nDCG', label_1, 'upper right')\n\n    fname = filename + '_recall.pdf'\n    print_graph(recall_p1, fname, 'Recall', label_1, 'lower right')\n\n    fname = filename + '_precision.pdf'\n    print_graph(precision1, fname, 'Precision', label_1, 'upper right')\n\n    return avg_p1, recall_p1, precision1\n\n\ne1 = 'embeddings_safe'\n\nopt = ['O0', 'O1', 'O2', 'O3']\ncompilers = ['gcc-7', 'gcc-4.8', 'clang-6.0', 'clang-4.0']\nvalues = []\nfor o in opt:\n    for c in compilers:\n        f0 = '' + c + '_' + o + '_' + e1 + '_top200.json'\n        values.append(f0)\n\np = Pool(4)\nresult = p.map(compare_and_print, values)\n\navg_p1 = []\nrecal_p1 = []\npre_p1 = []\n\navg_p2 = []\nrecal_p2 = []\npre_p2 = []\n\nfor t in result:\n    avg_p1.append(t[0])\n    recal_p1.append(t[1])\n    pre_p1.append(t[2])\n\navg_p1 = np.average(avg_p1, axis=0)\nrecal_p1 = np.average(recal_p1, axis=0)\npre_p1 = np.average(pre_p1, axis=0)\n\nprint_graph(avg_p1[0:20], 'nDCG.pdf', 'normalized DCG', 'SAFE', 'upper right')\nprint_graph(recal_p1, 'recall.pdf', 'recall', 'SAFE', 'lower right')\nprint_graph(pre_p1[0:20], 'precision.pdf', 'precision', 'SAFE', 'upper right')\n"
  },
  {
    "path": "godown.pl",
    "content": "#!/usr/bin/env perl\n#\n# Google Drive direct download of big files\n# ./gdown.pl 'gdrive file url' ['desired file name']\n#\n# v1.0 by circulosmeos 04-2014.\n# v1.1 by circulosmeos 01-2017.\n# http://circulosmeos.wordpress.com/2014/04/12/google-drive-direct-download-of-big-files\n# Distributed under GPL 3 (http://www.gnu.org/licenses/gpl-3.0.html)\n#\nuse strict;\n\nmy $TEMP='gdown.cookie.temp';\nmy $COMMAND;\nmy $confirm;\nmy $check;\nsub execute_command();\n\nmy $URL=shift;\ndie \"\\n./gdown.pl 'gdrive file url' [desired file name]\\n\\n\" if $URL eq '';\n\nmy $FILENAME=shift;\n$FILENAME='gdown' if $FILENAME eq '';\n\nif ($URL=~m#^https?://drive.google.com/file/d/([^/]+)#) {\n    $URL=\"https://docs.google.com/uc?id=$1&export=download\";\n}\n\nexecute_command();\n\nwhile (-s $FILENAME < 100000) { # only if the file isn't the download yet\n    open fFILENAME, '<', $FILENAME;\n    $check=0;\n    foreach (<fFILENAME>) {\n        if (/href=\"(\\/uc\\?export=download[^\"]+)/) {\n            $URL='https://docs.google.com'.$1;\n            $URL=~s/&amp;/&/g;\n            $confirm='';\n            $check=1;\n            last;\n        }\n        if (/confirm=([^;&]+)/) {\n            $confirm=$1;\n            $check=1;\n            last;\n        }\n        if (/\"downloadUrl\":\"([^\"]+)/) {\n            $URL=$1;\n            $URL=~s/\\\\u003d/=/g;\n            $URL=~s/\\\\u0026/&/g;\n            $confirm='';\n            $check=1;\n            last;\n        }\n    }\n    close fFILENAME;\n    die \"Couldn't download the file :-(\\n\" if ($check==0);\n    $URL=~s/confirm=([^;&]+)/confirm=$confirm/ if $confirm ne '';\n\n    execute_command();\n}\n\nunlink $TEMP;\n\nsub execute_command() {\n    $COMMAND=\"wget --no-check-certificate --load-cookie $TEMP --save-cookie $TEMP \\\"$URL\\\"\";\n    $COMMAND.=\" -O \\\"$FILENAME\\\"\" if $FILENAME ne '';\n    `$COMMAND`;\n    return 1;\n}\n"
  },
  {
    "path": "helloworld.c",
    "content": "#include \"stdio.h\"\n\n\nint main(){\n  printf(\"hello world\");\n  int a=10;\n  int b=20;\n  printf(\"%d\",a+b);\n}\n"
  },
  {
    "path": "index.md",
    "content": "---\n# Feel free to add content and custom Front Matter to this file.\n# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults\n\nlayout: home\n\n---\n\n<div style=\"text-align:center\"><img src =\"img\\safe2.jpg\" /></div>\n\nWhat is SAf(E)?\n-------------\n\n**SAFE** is a **S**elf-**A**ttentive neural network that takes as input a binary **F**unction and creates an **E**mbedding.\n\nWhat is an embedding?\n-------------\n An embedding is vector of real numbers. The nice feature of SAFE embeddings is that two similar binary functions should generate two embeddings\n that are close in the metric space. \n  \n<div style=\"text-align:center\"><img src =\"img\\metric.png\" /></div>\n \n I want to know all the details!\n-------------\nGood, read our paper on [arXiv](https://arxiv.org/abs/1811.05296).\n \nThe paper is slightly amusing! How do I get SAFE?\n------------- \nSAFE is available in our [GitHub](https://github.com/gadiluna/SAFE) repository. Keep in mind that SAFE has been developed as a research project. We only provide a minimal working proof-of-concept,\nwith the code and data to replicate our experiments. We are not responsible for any self-harm episode correlated with reading our (sometimes badly written) code.\n\nHow I can get involved with SAFE?\n------------- \nIf you are interested in this project write us an email. \n\n\n------------- \nSAFE has been designed and developed by:\n<div style=\"text-align:left\"><img src =\"img\\2.jpeg\" /></div>\n* [Luca Massarelli](https://scholar.google.it/citations?user=mJ_QjZIAAAAJ&hl=it) (development and research)\n<div style=\"text-align:left\"><img src =\"img\\1.jpeg\" /></div>\n* [Giuseppe Antonio Di Luna](https://scholar.google.it/citations?hl=it&user=RgAfuVgAAAAJ&view_op=list_works&sortby=pubdate) (development and research)\n<div style=\"text-align:left\"><img src =\"img\\3.jpeg\" /></div>\n* [Fabio Petroni](https://scholar.google.it/citations?user=vxQc2L4AAAAJ&hl=it) (development and research)\n<div style=\"text-align:left\"><img src =\"img\\4.jpeg\" /></div>\n* [Leonardo Querzoni](https://scholar.google.it/citations?user=-_WFIJIAAAAJ&hl=it) (research)\n<div style=\"text-align:left\"><img src =\"img\\5.jpeg\" /></div>\n* [Roberto Baldoni](https://scholar.google.it/citations?user=82tR6VoAAAAJ&hl=it) (research)\n\n\n\n\n#### **Acknowledgments**:\n We are in debt with  Google for providing free access to its cloud computing platform through the Education Program. Moreover, the authors would like to thank NVIDIA Corporation for partially supporting this work through the donation of a GPGPU card used during prototype development.\n This work is supported by a grant of the Italian Presidency of the Council of Ministers and by the CINI (Consorzio Interuniversitario Nazionale Informatica) National Laboratory of Cyber Security.\n Finally, we thank Davide Italiano for the insightful discussions. \n \nSAFE License.\n-------\n# SAFE TEAM\n# GPL 3 License http://www.gnu.org/licenses/"
  },
  {
    "path": "neural_network/PairFactory.py",
    "content": "# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\nimport sqlite3\n\nimport json\nimport numpy as np\n\nfrom multiprocessing import Queue\nfrom multiprocessing import Process\nfrom asm_embedding.FunctionNormalizer import FunctionNormalizer\n\n#\n# PairFactory class, used for training the SAFE network.\n# This class generates the pairs for training, test and validation\n#\n#\n# Authors: SAFE team\n\n\nclass PairFactory:\n\n    def __init__(self, db_name, dataset_type, batch_size, max_instructions, shuffle=True):\n        self.db_name = db_name\n        self.dataset_type = dataset_type\n        self.max_instructions = max_instructions\n        self.batch_dim = 0\n        self.num_pairs = 0\n        self.num_batches = 0\n        self.batch_size = batch_size\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n        q = cur.execute(\"SELECT true_pair from \" + self.dataset_type + \" WHERE id=?\", (0,))\n        self.num_pairs=len(json.loads(q.fetchone()[0]))*2\n        n_chunk = int(self.num_pairs / self.batch_size) - 1\n        conn.close()\n        self.num_batches = n_chunk\n        self.shuffle = shuffle\n\n    @staticmethod\n    def split( a, n):\n        return [a[i::n] for i in range(n)]\n\n    @staticmethod\n    def truncate_and_compute_lengths(pairs, max_instructions):\n        lenghts = []\n        new_pairs=[]\n        for x in pairs:\n            f0 = np.asarray(x[0][0:max_instructions])\n            f1 = np.asarray(x[1][0:max_instructions])\n            lenghts.append((f0.shape[0], f1.shape[0]))\n            if f0.shape[0] < max_instructions:\n                f0 = np.pad(f0, (0, max_instructions - f0.shape[0]), mode='constant')\n            if f1.shape[0] < max_instructions:\n                f1 = np.pad(f1, (0, max_instructions - f1.shape[0]), mode='constant')\n\n            new_pairs.append((f0, f1))\n        return new_pairs, lenghts\n\n    def async_chunker(self, epoch):\n\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n        query_string = \"SELECT true_pair,false_pair from {} where id=?\".format(self.dataset_type)\n        q = cur.execute(query_string, (int(epoch),))\n        true_pairs_id, false_pairs_id = q.fetchone()\n        true_pairs_id = json.loads(true_pairs_id)\n        false_pairs_id = json.loads(false_pairs_id)\n\n        assert len(true_pairs_id) == len(false_pairs_id)\n        data_len = len(true_pairs_id)\n\n        # print(\"Data Len: \" + str(data_len))\n        conn.close()\n\n        n_chunk = int(data_len / (self.batch_size / 2)) - 1\n        lista_chunk = range(0, n_chunk)\n        coda = Queue(maxsize=50)\n        n_proc = 8  # modify this to increase the parallelism for the db loading, from our thest 8-10 is the sweet spot on a 16 cores machine with K80\n        listone = PairFactory.split(lista_chunk, n_proc)\n\n        # this ugly workaround is somehow needed, Pool is working oddly when TF is loaded.\n        for i in range(0, n_proc):\n            p = Process(target=self.async_create_couple, args=((epoch, listone[i], coda)))\n            p.start()\n\n        for i in range(0, n_chunk):\n            yield self.async_get_dataset(coda)\n\n    def get_pair_fromdb(self, id_1, id_2):\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n        q0 = cur.execute(\"SELECT instructions_list FROM filtered_functions WHERE id=?\", (id_1,))\n        f0 = json.loads(q0.fetchone()[0])\n\n        q1 = cur.execute(\"SELECT instructions_list FROM filtered_functions WHERE id=?\", (id_2,))\n        f1 = json.loads(q1.fetchone()[0])\n        conn.close()\n        return f0, f1\n\n    def get_couple_from_db(self, epoch_number, chunk):\n\n        conn = sqlite3.connect(self.db_name)\n        cur = conn.cursor()\n\n        pairs = []\n        labels = []\n\n        q = cur.execute(\"SELECT true_pair, false_pair from \" + self.dataset_type + \" WHERE id=?\", (int(epoch_number),))\n        true_pairs_id, false_pairs_id = q.fetchone()\n\n        true_pairs_id = json.loads(true_pairs_id)\n        false_pairs_id = json.loads(false_pairs_id)\n        conn.close()\n        data_len = len(true_pairs_id)\n\n        i = 0\n\n        normalizer = FunctionNormalizer(self.max_instructions)\n\n        while i < self.batch_size:\n            if chunk * int(self.batch_size / 2) + i > data_len:\n                break\n\n            p = true_pairs_id[chunk * int(self.batch_size / 2) + i]\n            f0, f1 = self.get_pair_fromdb(p[0], p[1])\n            pairs.append((f0, f1))\n            labels.append(+1)\n\n            p = false_pairs_id[chunk * int(self.batch_size / 2) + i]\n            f0, f1 = self.get_pair_fromdb(p[0], p[1])\n            pairs.append((f0, f1))\n            labels.append(-1)\n\n            i += 2\n\n        pairs, lengths = normalizer.normalize_function_pairs(pairs)\n\n        function1, function2 = zip(*pairs)\n        len1, len2 = zip(*lengths)\n        n_samples = len(pairs)\n\n        if self.shuffle:\n            shuffle_indices = np.random.permutation(np.arange(n_samples))\n\n            function1 = np.array(function1)[shuffle_indices]\n\n            function2 = np.array(function2)[shuffle_indices]\n            len1 = np.array(len1)[shuffle_indices]\n            len2 = np.array(len2)[shuffle_indices]\n            labels = np.array(labels)[shuffle_indices]\n        else:\n            function1=np.array(function1)\n            function2=np.array(function2)\n            len1=np.array(len1)\n            len2=np.array(len2)\n            labels=np.array(labels)\n\n        upper_bound = min(self.batch_size, n_samples)\n        len1 = len1[0:upper_bound]\n        len2 = len2[0:upper_bound]\n        function1 = function1[0:upper_bound]\n        function2 = function2[0:upper_bound]\n        y_ = labels[0:upper_bound]\n        return function1, function2, len1, len2, y_\n\n    def async_create_couple(self, epoch,n_chunk,q):\n        for i in n_chunk:\n            function1, function2, len1, len2, y_ = self.get_couple_from_db(epoch, i)\n            q.put((function1, function2, len1, len2, y_), block=True)\n\n    def async_get_dataset(self, q):\n\n        item = q.get()\n        function1 = item[0]\n        function2 = item[1]\n        len1 = item[2]\n        len2 = item[3]\n        y_ = item[4]\n\n        assert (len(function1) == len(y_))\n        n_samples = len(function1)\n        self.batch_dim = n_samples\n        #self.num_pairs += n_samples\n\n        return function1, function2, len1, len2, y_\n\n"
  },
  {
    "path": "neural_network/SAFEEmbedder.py",
    "content": "import tensorflow as tf\n# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\n\nclass SAFEEmbedder:\n\n    def __init__(self, model_file):\n        self.model_file = model_file\n        self.session = None\n        self.x_1 = None\n        self.adj_1 = None\n        self.len_1 = None\n        self.emb = None\n\n    def loadmodel(self):\n        with tf.gfile.GFile(self.model_file, \"rb\") as f:\n            graph_def = tf.GraphDef()\n            graph_def.ParseFromString(f.read())\n\n        with tf.Graph().as_default() as graph:\n            tf.import_graph_def(graph_def)\n\n        sess = tf.Session(graph=graph)\n        self.session = sess\n\n        return sess\n\n    def get_tensor(self):\n        self.x_1 = self.session.graph.get_tensor_by_name(\"import/x_1:0\")\n        self.len_1 = self.session.graph.get_tensor_by_name(\"import/lengths_1:0\")\n        self.emb = tf.nn.l2_normalize(self.session.graph.get_tensor_by_name('import/Embedding1/dense/BiasAdd:0'), axis=1)\n\n    def embedd(self, nodi_input, lengths_input):\n\n        out_embedding= self.session.run(self.emb, feed_dict = {\n                                                    self.x_1: nodi_input,\n                                                    self.len_1: lengths_input})\n\n        return out_embedding\n"
  },
  {
    "path": "neural_network/SAFE_model.py",
    "content": "# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\n\nfrom SiameseSAFE import SiameseSelfAttentive\nfrom PairFactory import PairFactory\nimport tensorflow as tf\nimport random\nimport sys, os\nimport numpy as np\nfrom sklearn import metrics\nimport matplotlib\nimport tqdm\n\nmatplotlib.use('Agg')\nimport matplotlib.pyplot as plt\n\n\nclass modelSAFE:\n\n    def __init__(self, flags, embedding_matrix):\n        self.embedding_size = flags.embedding_size\n        self.num_epochs = flags.num_epochs\n        self.learning_rate = flags.learning_rate\n        self.l2_reg_lambda = flags.l2_reg_lambda\n        self.num_checkpoints = flags.num_checkpoints\n        self.logdir = flags.logdir\n        self.logger = flags.logger\n        self.seed = flags.seed\n        self.batch_size = flags.batch_size\n        self.max_instructions = flags.max_instructions\n        self.embeddings_matrix = embedding_matrix\n        self.session = None\n        self.db_name = flags.db_name\n        self.trainable_embeddings = flags.trainable_embeddings\n        self.cross_val = flags.cross_val\n        self.attention_hops = flags.attention_hops\n        self.attention_depth = flags.attention_depth\n        self.dense_layer_size = flags.dense_layer_size\n        self.rnn_state_size = flags.rnn_state_size\n\n        random.seed(self.seed)\n        np.random.seed(self.seed)\n\n        print(self.db_name)\n\n    # loads an usable model\n    # returns the network and a tensorflow session in which the network can be used.\n    @staticmethod\n    def load_model(path):\n        session = tf.Session()\n        checkpoint_dir = os.path.abspath(os.path.join(path, \"checkpoints\"))\n        saver = tf.train.import_meta_graph(os.path.join(checkpoint_dir, \"model.meta\"))\n        tf.global_variables_initializer().run(session=session)\n        saver.restore(session, os.path.join(checkpoint_dir, \"model\"))\n        network = SiameseSelfAttentive(\n            rnn_state_size=1,\n            learning_rate=1,\n            l2_reg_lambda=1,\n            batch_size=1,\n            max_instructions=1,\n            embedding_matrix=1,\n            trainable_embeddings=1,\n            attention_hops=1,\n            attention_depth=1,\n            dense_layer_size=1,\n            embedding_size=1\n        )\n        network.restore_model(session)\n        return session, network\n\n    def create_network(self):\n        self.network = SiameseSelfAttentive(\n            rnn_state_size=self.rnn_state_size,\n            learning_rate=self.learning_rate,\n            l2_reg_lambda=self.l2_reg_lambda,\n            batch_size=self.batch_size,\n            max_instructions=self.max_instructions,\n            embedding_matrix=self.embeddings_matrix,\n            trainable_embeddings=self.trainable_embeddings,\n            attention_hops=self.attention_hops,\n            attention_depth=self.attention_depth,\n            dense_layer_size=self.dense_layer_size,\n            embedding_size=self.embedding_size\n        )\n\n    def train(self):\n        tf.reset_default_graph()\n        with tf.Graph().as_default() as g:\n            session_conf = tf.ConfigProto(\n                allow_soft_placement=True,\n                log_device_placement=False\n            )\n            sess = tf.Session(config=session_conf)\n\n            # Sets the graph-level random seed.\n            tf.set_random_seed(self.seed)\n\n            self.create_network()\n            self.network.generate_new_safe()\n            # --tbrtr\n\n            # Initialize all variables\n            sess.run(tf.global_variables_initializer())\n\n            # TensorBoard\n            # Summaries for loss and accuracy\n            loss_summary = tf.summary.scalar(\"loss\", self.network.loss)\n\n            # Train Summaries\n            train_summary_op = tf.summary.merge([loss_summary])\n            train_summary_dir = os.path.join(self.logdir, \"summaries\", \"train\")\n            train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)\n\n            # Validation summaries\n            val_summary_op = tf.summary.merge([loss_summary])\n            val_summary_dir = os.path.join(self.logdir, \"summaries\", \"validation\")\n            val_summary_writer = tf.summary.FileWriter(val_summary_dir, sess.graph)\n\n            # Test summaries\n            test_summary_op = tf.summary.merge([loss_summary])\n            test_summary_dir = os.path.join(self.logdir, \"summaries\", \"test\")\n            test_summary_writer = tf.summary.FileWriter(test_summary_dir, sess.graph)\n\n            # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it\n            checkpoint_dir = os.path.abspath(os.path.join(self.logdir, \"checkpoints\"))\n            checkpoint_prefix = os.path.join(checkpoint_dir, \"model\")\n            if not os.path.exists(checkpoint_dir):\n                os.makedirs(checkpoint_dir)\n            saver = tf.train.Saver(tf.global_variables(), max_to_keep=self.num_checkpoints)\n\n            best_val_auc = 0\n            stat_file = open(str(self.logdir) + \"/epoch_stats.tsv\", \"w\")\n            stat_file.write(\"#epoch\\ttrain_loss\\tval_loss\\tval_auc\\ttest_loss\\ttest_auc\\n\")\n\n            p_train = PairFactory(self.db_name, 'train_pairs', self.batch_size, self.max_instructions)\n            p_validation = PairFactory(self.db_name, 'validation_pairs', self.batch_size, self.max_instructions, False)\n            p_test = PairFactory(self.db_name, 'test_pairs', self.batch_size, self.max_instructions, False)\n\n            step = 0\n            for epoch in range(0, self.num_epochs):\n                epoch_msg = \"\"\n                epoch_msg += \"  epoch: {}\\n\".format(epoch)\n\n                epoch_loss = 0\n\n                # ----------------------#\n                #         TRAIN\t       #\n                # ----------------------#\n                n_batch = 0\n                for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm(\n                        p_train.async_chunker(epoch % 25), total=p_train.num_batches):\n                    feed_dict = {\n                        self.network.x_1: function1_batch,\n                        self.network.x_2: function2_batch,\n                        self.network.lengths_1: len1_batch,\n                        self.network.lengths_2: len2_batch,\n                        self.network.y: y_batch,\n                    }\n\n                    summaries, _, loss, norms, cs = sess.run(\n                        [train_summary_op, self.network.train_step, self.network.loss, self.network.norms,\n                         self.network.cos_similarity],\n                        feed_dict=feed_dict)\n\n                    train_summary_writer.add_summary(summaries, step)\n                    epoch_loss += loss * p_train.batch_dim  # ???\n                    step += 1\n                # recap epoch\n                epoch_loss /= p_train.num_pairs\n                epoch_msg += \"\\ttrain_loss: {}\\n\".format(epoch_loss)\n\n                # ----------------------#\n                #      VALIDATION\t   #\n                # ----------------------#\n                val_loss = 0\n                epoch_msg += \"\\n\"\n                val_y = []\n                val_pred = []\n                for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm(\n                        p_validation.async_chunker(0), total=p_validation.num_batches):\n                    feed_dict = {\n                        self.network.x_1: function1_batch,\n                        self.network.x_2: function2_batch,\n                        self.network.lengths_1: len1_batch,\n                        self.network.lengths_2: len2_batch,\n                        self.network.y: y_batch,\n                    }\n\n                    summaries, loss, similarities = sess.run(\n                        [val_summary_op, self.network.loss, self.network.cos_similarity], feed_dict=feed_dict)\n                    val_loss += loss * p_validation.batch_dim\n                    val_summary_writer.add_summary(summaries, step)\n                    val_y.extend(y_batch)\n                    val_pred.extend(similarities.tolist())\n\n                val_loss /= p_validation.num_pairs\n\n                if np.isnan(val_pred).any():\n                    print(\"Validation: carefull there is  NaN in some ouput values, I am fixing it but be aware...\")\n                    val_pred = np.nan_to_num(val_pred)\n\n                val_fpr, val_tpr, val_thresholds = metrics.roc_curve(val_y, val_pred, pos_label=1)\n                val_auc = metrics.auc(val_fpr, val_tpr)\n                epoch_msg += \"\\tval_loss : {}\\n\\tval_auc : {}\\n\".format(val_loss, val_auc)\n\n                sys.stdout.write(\n                    \"\\r\\tepoch {} / {}, loss {:g}, val_auc {:g}, norms {}\".format(epoch, self.num_epochs, epoch_loss,\n                                                                                  val_auc, norms))\n                sys.stdout.flush()\n\n                # execute test only if validation auc increased\n                test_loss = \"-\"\n                test_auc = \"-\"\n\n                # in case of cross validation we do not need to evaluate on a test split that is effectively missing\n                if val_auc > best_val_auc and self.cross_val:\n                    #\n                    ##--  --##\n                    #\n                    best_val_auc = val_auc\n                    saver.save(sess, checkpoint_prefix)\n                    print(\"\\nNEW BEST_VAL_AUC: {} !\\n\".format(best_val_auc))\n                    # write ROC raw data\n                    with open(str(self.logdir) + \"/best_val_roc.tsv\", \"w\") as the_file:\n                        the_file.write(\"#thresholds\\ttpr\\tfpr\\n\")\n                        for t, tpr, fpr in zip(val_thresholds, val_tpr, val_fpr):\n                            the_file.write(\"{}\\t{}\\t{}\\n\".format(t, tpr, fpr))\n\n                # in case we are not cross validating we expect to have a test split.\n                if val_auc > best_val_auc and not self.cross_val:\n\n                    best_val_auc = val_auc\n                    epoch_msg += \"\\tNEW BEST_VAL_AUC: {} !\\n\".format(best_val_auc)\n\n                    # save best model\n                    saver.save(sess, checkpoint_prefix)\n\n                    # ----------------------#\n                    #         TEST  \t    #\n                    # ----------------------#\n\n                    # TEST\n                    test_loss = 0\n                    epoch_msg += \"\\n\"\n                    test_y = []\n                    test_pred = []\n\n                    for function1_batch, function2_batch, len1_batch, len2_batch, y_batch in tqdm.tqdm(\n                            p_test.async_chunker(0), total=p_test.num_batches):\n                        feed_dict = {\n                            self.network.x_1: function1_batch,\n                            self.network.x_2: function2_batch,\n                            self.network.lengths_1: len1_batch,\n                            self.network.lengths_2: len2_batch,\n                            self.network.y: y_batch,\n                        }\n                        summaries, loss, similarities = sess.run(\n                            [test_summary_op, self.network.loss, self.network.cos_similarity], feed_dict=feed_dict)\n                        test_loss += loss * p_test.batch_dim\n                        test_summary_writer.add_summary(summaries, step)\n                        test_y.extend(y_batch)\n                        test_pred.extend(similarities.tolist())\n\n                    test_loss /= p_test.num_pairs\n                    if np.isnan(test_pred).any():\n                        print(\"Test: carefull there is  NaN in some ouput values, I am fixing it but be aware...\")\n                        test_pred = np.nan_to_num(test_pred)\n\n                    test_fpr, test_tpr, test_thresholds = metrics.roc_curve(test_y, test_pred, pos_label=1)\n\n                    # write ROC raw data\n                    with open(str(self.logdir) + \"/best_test_roc.tsv\", \"w\") as the_file:\n                        the_file.write(\"#thresholds\\ttpr\\tfpr\\n\")\n                        for t, tpr, fpr in zip(test_thresholds, test_tpr, test_fpr):\n                            the_file.write(\"{}\\t{}\\t{}\\n\".format(t, tpr, fpr))\n\n                    test_auc = metrics.auc(test_fpr, test_tpr)\n                    epoch_msg += \"\\ttest_loss : {}\\n\\ttest_auc : {}\\n\".format(test_loss, test_auc)\n                    fig = plt.figure()\n                    plt.title('Receiver Operating Characteristic')\n                    plt.plot(test_fpr, test_tpr, 'b',\n                             label='AUC = %0.2f' % test_auc)\n                    fig.savefig(str(self.logdir) + \"/best_test_roc.png\")\n                    print(\n                        \"\\nNEW BEST_VAL_AUC: {} !\\n\\ttest_loss : {}\\n\\ttest_auc : {}\\n\".format(best_val_auc, test_loss,\n                                                                                               test_auc))\n                    plt.close(fig)\n\n                stat_file.write(\n                    \"{}\\t{}\\t{}\\t{}\\t{}\\t{}\\n\".format(epoch, epoch_loss, val_loss, val_auc, test_loss, test_auc))\n                self.logger.info(\"\\n{}\\n\".format(epoch_msg))\n            stat_file.close()\n            sess.close()\n            return best_val_auc\n"
  },
  {
    "path": "neural_network/SiameseSAFE.py",
    "content": "import tensorflow as tf\n# SAFE TEAM\n#\n#\n# distributed under license: CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt)\n#\n\n# Siamese Self-Attentive Network for Binary Similarity:\n#\n# arXiv Nostro.\n#\n# based on the self attentive network:arXiv:1703.03130  Z. Lin at al. “A structured self-attentive sentence embedding''\n#\n# Authors: SAFE team\n\nclass SiameseSelfAttentive:\n\n    def __init__(self,\n                 rnn_state_size,  # Dimension of the RNN State\n                 learning_rate,  # Learning rate\n                 l2_reg_lambda,\n                 batch_size,\n                 max_instructions,\n                 embedding_matrix,  # Matrix containg the embeddings for each asm instruction\n                 trainable_embeddings,\n                 # if this value is True, the embeddings of the asm instruction are modified by the training.\n                 attention_hops,  # attention hops parameter r of [1]\n                 attention_depth,  # attention detph parameter d_a of [1]\n                 dense_layer_size,  # parameter e of [1]\n                 embedding_size,  # size of the final function embedding, in our test this is twice the rnn_state_size\n                 ):\n        self.rnn_depth = 1  # if this value is modified then the RNN becames a multilayer network. In our tests we fix it to 1 feel free to be adventurous.\n        self.learning_rate = learning_rate\n        self.l2_reg_lambda = l2_reg_lambda\n        self.rnn_state_size = rnn_state_size\n        self.batch_size = batch_size\n        self.max_instructions = max_instructions\n        self.embedding_matrix = embedding_matrix\n        self.trainable_embeddings = trainable_embeddings\n        self.attention_hops = attention_hops\n        self.attention_depth = attention_depth\n        self.dense_layer_size = dense_layer_size\n        self.embedding_size = embedding_size\n\n        # self.generate_new_safe()\n\n    def restore_model(self, old_session):\n        graph = old_session.graph\n\n        self.x_1 = graph.get_tensor_by_name(\"x_1:0\")\n        self.x_2 = graph.get_tensor_by_name(\"x_2:0\")\n        self.len_1 = graph.get_tensor_by_name(\"lengths_1:0\")\n        self.len_2 = graph.get_tensor_by_name(\"lengths_2:0\")\n        self.y = graph.get_tensor_by_name('y_:0')\n        self.cos_similarity = graph.get_tensor_by_name(\"siamese_layer/cosSimilarity:0\")\n        self.loss = graph.get_tensor_by_name(\"Loss/loss:0\")\n        self.train_step = graph.get_operation_by_name(\"Train_Step/Adam\")\n\n        return\n\n    def self_attentive_network(self, input_x, lengths):\n        # each functions is a list of embeddings id (an id is an index in the embedding matrix)\n        # with this we transform it in a list of embeddings vectors.\n        embbedded_functions = tf.nn.embedding_lookup(self.instructions_embeddings_t, input_x)\n\n        # We create the GRU RNN\n        (output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn(self.cell_fw, self.cell_bw, embbedded_functions,\n                                                                    sequence_length=lengths, dtype=tf.float32,\n                                                                    time_major=False)\n\n        # We create the matrix H\n        H = tf.concat([output_fw, output_bw], axis=2)\n\n        # We do a tile to account for training batches\n        ws1_tiled = tf.tile(tf.expand_dims(self.WS1, 0), [tf.shape(H)[0], 1, 1], name=\"WS1_tiled\")\n        ws2_tile = tf.tile(tf.expand_dims(self.WS2, 0), [tf.shape(H)[0], 1, 1], name=\"WS2_tiled\")\n\n        # we compute the matrix A\n        self.A = tf.nn.softmax(tf.matmul(ws2_tile, tf.nn.tanh(tf.matmul(ws1_tiled, tf.transpose(H, perm=[0, 2, 1])))),\n                               name=\"Attention_Matrix\")\n        # embedding matrix M\n        M = tf.identity(tf.matmul(self.A, H), name=\"Attention_Embedding\")\n\n        # we create the flattened version of M\n        flattened_M = tf.reshape(M, [tf.shape(M)[0], self.attention_hops * self.rnn_state_size * 2])\n\n        return flattened_M\n\n    def generate_new_safe(self):\n        self.instructions_embeddings_t = tf.Variable(initial_value=tf.constant(self.embedding_matrix),\n                                                     trainable=self.trainable_embeddings,\n                                                     name=\"instructions_embeddings\", dtype=tf.float32)\n\n        self.x_1 = tf.placeholder(tf.int32, [None, self.max_instructions],\n                                  name=\"x_1\")  # List of instructions for Function 1\n        self.lengths_1 = tf.placeholder(tf.int32, [None], name='lengths_1')  # List of lengths for Function 1\n        # example  x_1=[[mov,add,padding,padding],[mov,mov,mov,padding]]\n        # lenghts_1=[2,3]\n\n        self.x_2 = tf.placeholder(tf.int32, [None, self.max_instructions],\n                                  name=\"x_2\")  # List of instructions for Function 2\n        self.lengths_2 = tf.placeholder(tf.int32, [None], name='lengths_2')  # List of lengths for Function 2\n        self.y = tf.placeholder(tf.float32, [None], name='y_')  # Real label of the pairs, +1 similar, -1 dissimilar.\n\n        # Euclidean norms; p = 2\n        self.norms = []\n\n        # Keeping track of l2 regularization loss (optional)\n        l2_loss = tf.constant(0.0)\n\n        with tf.name_scope('parameters_Attention'):\n            self.WS1 = tf.Variable(tf.truncated_normal([self.attention_depth, 2 * self.rnn_state_size], stddev=0.1),\n                                   name=\"WS1\")\n            self.WS2 = tf.Variable(tf.truncated_normal([self.attention_hops, self.attention_depth], stddev=0.1),\n                                   name=\"WS2\")\n\n            rnn_layers_fw = [tf.nn.rnn_cell.GRUCell(size) for size in ([self.rnn_state_size] * self.rnn_depth)]\n            rnn_layers_bw = [tf.nn.rnn_cell.GRUCell(size) for size in ([self.rnn_state_size] * self.rnn_depth)]\n\n            self.cell_fw = tf.nn.rnn_cell.MultiRNNCell(rnn_layers_fw)\n            self.cell_bw = tf.nn.rnn_cell.MultiRNNCell(rnn_layers_bw)\n\n        with tf.name_scope('Self-Attentive1'):\n            self.function_1 = self.self_attentive_network(self.x_1, self.lengths_1)\n        with tf.name_scope('Self-Attentive2'):\n            self.function_2 = self.self_attentive_network(self.x_2, self.lengths_2)\n\n        self.dense_1 = tf.nn.relu(tf.layers.dense(self.function_1, self.dense_layer_size))\n        self.dense_2 = tf.nn.relu(tf.layers.dense(self.function_2, self.dense_layer_size))\n\n        with tf.name_scope('Embedding1'):\n            self.function_embedding_1 = tf.layers.dense(self.dense_1, self.embedding_size)\n        with tf.name_scope('Embedding2'):\n            self.function_embedding_2 = tf.layers.dense(self.dense_2, self.embedding_size)\n\n        with tf.name_scope('siamese_layer'):\n            self.cos_similarity = tf.reduce_sum(tf.multiply(self.function_embedding_1, self.function_embedding_2),\n                                                axis=1,\n                                                name=\"cosSimilarity\")\n\n            # CalculateMean cross-entropy loss\n        with tf.name_scope(\"Loss\"):\n            A_square = tf.matmul(self.A, tf.transpose(self.A, perm=[0, 2, 1]))\n\n            I = tf.eye(tf.shape(A_square)[1])\n            I_tiled = tf.tile(tf.expand_dims(I, 0), [tf.shape(A_square)[0], 1, 1], name=\"I_tiled\")\n            self.A_pen = tf.norm(A_square - I_tiled)\n\n            self.loss = tf.reduce_sum(tf.squared_difference(self.cos_similarity, self.y), name=\"loss\")\n            self.regularized_loss = self.loss + self.l2_reg_lambda * l2_loss + self.A_pen\n\n            # Train step\n        with tf.name_scope(\"Train_Step\"):\n            self.train_step = tf.train.AdamOptimizer(self.learning_rate).minimize(self.regularized_loss)\n"
  },
  {
    "path": "neural_network/__init__.py",
    "content": ""
  },
  {
    "path": "neural_network/freeze_graph.sh",
    "content": "#!/bin/sh\n\necho \"usage: ./freeze_graph MODEL_DIR FREEZED_NAME\"\n\nMODEL_DIR=$0\nFREEZED_NAME=$1\n\nfreeze_graph --input_meta_graph $MODELDIR/checkpoints/model.meta\n             --output_graph FREEZED_NAME\n             --output_node_names Embedding1/dense/BiasAdd\n             --input_bin\n             --input_checkpoint $MODEL_DIR/checkpoints/model\n"
  },
  {
    "path": "neural_network/parameters.py",
    "content": "# SAFE TEAM\n# distributed under license: GPL 3 License http://www.gnu.org/licenses/\n\nimport argparse\nimport time\nimport sys, os\nimport logging\n\n\n#\n# Parameters File for the SAFE network.\n#\n# Authors: SAFE team\n\n\ndef getLogger(logfile):\n    logger = logging.getLogger(__name__)\n    hdlr = logging.FileHandler(logfile)\n    formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')\n    hdlr.setFormatter(formatter)\n    logger.addHandler(hdlr)\n    logger.setLevel(logging.INFO)\n    return logger, hdlr\n\n\nclass Flags:\n\n    def __init__(self):\n        parser = argparse.ArgumentParser(description='SAFE')\n\n        parser.add_argument(\"-o\", \"--output\", dest=\"output_file\", help=\"output directory for logging and models\",\n                            required=False)\n        parser.add_argument(\"-e\", \"--embedder\", dest=\"embedder_folder\",\n                            help=\"file with the embedding matrix and dictionary for asm instructions\", required=False)\n        parser.add_argument(\"-n\", \"--dbName\", dest=\"db_name\", help=\"Name of the database\", required=False)\n        parser.add_argument(\"-ld\", \"--load_dir\", dest=\"load_dir\", help=\"Load the model from directory load_dir\",\n                            required=False)\n        parser.add_argument(\"-r\", \"--random\", help=\"if present the network use random embedder\", default=False,\n                            action=\"store_true\", dest=\"random_embedding\", required=False)\n        parser.add_argument(\"-te\", \"--trainable_embedding\",\n                            help=\"if present the network consider the embedding as trainable\", action=\"store_true\",\n                            dest=\"trainable_embeddings\", default=False)\n        parser.add_argument(\"-cv\", \"--cross_val\", help=\"if present the training is done with cross validiation\",\n                            default=False, action=\"store_true\", dest=\"cross_val\")\n\n        args = parser.parse_args()\n\n        # mode = mean_field\n        self.batch_size = 250  # minibatch size (-1 = whole dataset)\n        self.num_epochs = 50  # number of epochs\n        self.embedding_size = 100  # dimension of the function embedding\n        self.learning_rate = 0.001  # init learning_rate\n        self.l2_reg_lambda = 0  # 0.002 #0.002 # regularization coefficient\n        self.num_checkpoints = 1  # max number of checkpoints\n        self.out_dir = args.output_file  # directory for logging\n        self.rnn_state_size = 50  # dimesion of the rnn state\n        self.db_name = args.db_name\n        self.load_dir = str(args.load_dir)\n        self.random_embedding = args.random_embedding\n        self.trainable_embeddings = args.trainable_embeddings\n        self.cross_val = args.cross_val\n        self.cross_val_fold = 5\n\n        #\n        ##\n        ## RNN PARAMETERS, these parameters are only used for RNN model.\n        #\n        self.rnn_depth = 1  # depth of the rnn\n        self.max_instructions = 150  # number of instructions\n\n        ## ATTENTION PARAMETERS\n        self.attention_hops = 10\n        self.attention_depth = 250\n\n        # RNN SINGLE PARAMETER\n        self.dense_layer_size = 2000\n\n        self.seed = 2  # random seed\n\n        # create logdir and logger\n        self.reset_logdir()\n\n        self.embedder_folder = args.embedder_folder\n\n    def reset_logdir(self):\n        # create logdir\n        timestamp = str(int(time.time()))\n        self.logdir = os.path.abspath(os.path.join(self.out_dir, \"runs\", timestamp))\n        os.makedirs(self.logdir, exist_ok=True)\n\n        # create logger\n        self.log_file = str(self.logdir) + '/console.log'\n        self.logger, self.hdlr = getLogger(self.log_file)\n\n        # create symlink for last_run\n        sym_path_logdir = str(self.out_dir) + \"/last_run\"\n        try:\n            os.unlink(sym_path_logdir)\n        except:\n            pass\n        try:\n            os.symlink(self.logdir, sym_path_logdir)\n        except:\n            print(\"\\nfailed to create symlink!\\n\")\n\n    def close_log(self):\n        self.hdlr.close()\n        self.logger.removeHandler(self.hdlr)\n        handlers = self.logger.handlers[:]\n        for handler in handlers:\n            handler.close()\n            self.logger.removeHandler(handler)\n\n    def __str__(self):\n        msg = \"\"\n        msg += \"\\nParameters:\\n\"\n        msg += \"\\tRandom embedding: {}\\n\".format(self.random_embedding)\n        msg += \"\\tTrainable embedding: {}\\n\".format(self.trainable_embeddings)\n        msg += \"\\tlogdir: {}\\n\".format(self.logdir)\n        msg += \"\\tbatch_size: {}\\n\".format(self.batch_size)\n        msg += \"\\tnum_epochs: {}\\n\".format(self.num_epochs)\n        msg += \"\\tembedding_size: {}\\n\".format(self.embedding_size)\n        msg += \"\\trnn_state_size: {}\\n\".format(self.rnn_state_size)\n        msg += \"\\tattention depth: {}\\n\".format(self.attention_depth)\n        msg += \"\\tattention hops: {}\\n\".format(self.attention_hops)\n        msg += \"\\tdense layer e: {}\\n\".format(self.dense_layer_size)\n\n        msg += \"\\tlearning_rate: {}\\n\".format(self.learning_rate)\n        msg += \"\\tl2_reg_lambda: {}\\n\".format(self.l2_reg_lambda)\n        msg += \"\\tnum_checkpoints: {}\\n\".format(self.num_checkpoints)\n\n\n        msg += \"\\tseed: {}\\n\".format(self.seed)\n        msg += \"\\tMax Instructions per functions: {}\\n\".format(self.max_instructions)\n        return msg\n"
  },
  {
    "path": "neural_network/train.py",
    "content": "from SAFE_model import modelSAFE\nfrom parameters import Flags\nimport sys\nimport os\nimport numpy as np\nfrom utils import utils\nimport traceback\n\n\ndef load_embedding_matrix(embedder_folder):\n    matrix_file='embedding_matrix.npy'\n    matrix_path=os.path.join(embedder_folder,matrix_file)\n    if os.path.isfile(matrix_path):\n        try:\n            print('Loading embedding matrix....')\n            with open(matrix_path,'rb') as f:\n                return np.float32(np.load(f))\n        except Exception as e:\n            print(\"Exception handling file:\"+str(matrix_path))\n            print(\"Embedding matrix cannot be load\")\n            print(str(e))\n            sys.exit(-1)\n\n    else:\n        print('Embedding matrix not found at path:'+str(matrix_path))\n        sys.exit(-1)\n\n\ndef run_test():\n    flags = Flags()\n    flags.logger.info(\"\\n{}\\n\".format(flags))\n\n    print(str(flags))\n\n    embedding_matrix = load_embedding_matrix(flags.embedder_folder)\n    if flags.random_embedding:\n        embedding_matrix = np.random.rand(*np.shape(embedding_matrix)).astype(np.float32)\n        embedding_matrix[0, :] = np.zeros(np.shape(embedding_matrix)[1]).astype(np.float32)\n\n    if flags.cross_val:\n        print(\"STARTING CROSS VALIDATION\")\n        res = []\n        mean = 0\n        for i in range(0, flags.cross_val_fold):\n            print(\"CROSS VALIDATION STARTING FOLD: \" + str(i))\n            if i > 0:\n                flags.close_log()\n                flags.reset_logdir()\n                del flags\n                flags = Flags()\n                flags.logger.info(\"\\n{}\\n\".format(flags))\n\n            flags.logger.info(\"Starting cross validation fold: {}\".format(i))\n\n            flags.db_name = flags.db_name + \"_val_\" + str(i+1) + \".db\"\n            flags.logger.info(\"Cross validation db name: {}\".format(flags.db_name))\n\n            trainer = modelSAFE(flags, embedding_matrix)\n            best_val_auc = trainer.train()\n\n            mean += best_val_auc\n            res.append(best_val_auc)\n\n            flags.logger.info(\"Cross validation fold {} finished best auc: {}\".format(i, best_val_auc))\n            print(\"FINISH FOLD: \" + str(i) + \" BEST VAL AUC: \" + str(best_val_auc))\n\n        print(\"CROSS VALIDATION ENDED\")\n        print(\"Result: \" + str(res))\n        print(\"\")\n\n        flags.logger.info(\"Cross validation finished results: {}\".format(res))\n        flags.logger.info(\" mean: {}\".format(mean / flags.cross_val_fold))\n        flags.close_log()\n\n    else:\n        trainer = modelSAFE(flags, embedding_matrix)\n        trainer.train()\n        flags.close_log()\n\n\nif __name__ == '__main__':\n    utils.print_safe()\n    print('-Trainer for SAFE-')\n    run_test()\n"
  },
  {
    "path": "neural_network/train.sh",
    "content": "#!/bin/sh\n\n\nBASE_PATH=\"/home/luca/work/binary_similarity_data/\"\n\nDATA_PATH=$BASE_PATH/experiments/arith_mean_openSSL_no_dropout_no_shuffle_no_regeneration_emb_random_trainable\nOUT_PATH=$DATA_PATH/out\n\nDB_PATH=$BASE_PATH/databases/openSSL_data.db\n\nEMBEDDER=$BASE_PATH/word2vec/filtered_100_embeddings/\n\nRANDOM=\"\"\nTRAINABLE_EMBEDD=\"\"\n\npython3 train.py $RANDOM $TRAINABLE_EMBEDD --o $OUT_PATH -n $DB_PATH -e $EMBEDDER\n"
  },
  {
    "path": "requirements.txt",
    "content": "tensorflow\nsklearn\nnumpy\nscipy\nmatplotlib\ntqdm\nr2pipe\npyfiglet"
  },
  {
    "path": "safe.py",
    "content": "# SAFE TEAM\n# Copyright (C) 2019  Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni\n\n\nfrom asm_embedding.FunctionAnalyzerRadare import RadareFunctionAnalyzer\nfrom argparse import ArgumentParser\nfrom asm_embedding.FunctionNormalizer import FunctionNormalizer\nfrom asm_embedding.InstructionsConverter import InstructionsConverter\nfrom neural_network.SAFEEmbedder import SAFEEmbedder\nfrom utils import utils\n\nclass SAFE:\n\n    def __init__(self, model):\n        self.converter = InstructionsConverter(\"data/i2v/word2id.json\")\n        self.normalizer = FunctionNormalizer(max_instruction=150)\n        self.embedder = SAFEEmbedder(model)\n        self.embedder.loadmodel()\n        self.embedder.get_tensor()\n\n    def embedd_function(self, filename, address):\n        analyzer = RadareFunctionAnalyzer(filename, use_symbol=False, depth=0)\n        functions = analyzer.analyze()\n        instructions_list = None\n        for function in functions:\n            if functions[function]['address'] == address:\n                instructions_list = functions[function]['filtered_instructions']\n                break\n        if instructions_list is None:\n            print(\"Function not found\")\n            return None\n        converted_instructions = self.converter.convert_to_ids(instructions_list)\n        instructions, length = self.normalizer.normalize_functions([converted_instructions])\n        embedding = self.embedder.embedd(instructions, length)\n        return embedding\n\n\nif __name__ == '__main__':\n\n    utils.print_safe()\n\n    parser = ArgumentParser(description=\"Safe Embedder\")\n\n    parser.add_argument(\"-m\", \"--model\",   help=\"Safe trained model to generate function embeddings\")\n    parser.add_argument(\"-i\", \"--input\",   help=\"Input executable that contains the function to embedd\")\n    parser.add_argument(\"-a\", \"--address\", help=\"Hexadecimal address of the function to embedd\")\n\n    args = parser.parse_args()\n\n    address = int(args.address, 16)\n    safe = SAFE(args.model)\n    embedding = safe.embedd_function(args.input, address)\n    print(embedding[0])\n\n\n\n"
  },
  {
    "path": "utils/__init__.py",
    "content": ""
  },
  {
    "path": "utils/utils.py",
    "content": "from pyfiglet import figlet_format\n\n\ndef print_safe():\n    a = figlet_format('SAFE', font='starwars')\n    print(a)\n    print(\"By Massarelli L., Di Luna G. A., Petroni F., Querzoni L., Baldoni R.\")\n    print(\"Please cite: http://arxiv.org/abs/1811.05296 \\n\")"
  }
]