Repository: nlitsme/pyidbutil Branch: master Commit: e77d0e79e5c1 Files: 9 Total size: 114.7 KB Directory structure: gitextract_dtt79ccf/ ├── LICENSE ├── README.md ├── idaunpack.py ├── idblib.py ├── idbtool.py ├── setup.cfg ├── test_idblib.py ├── tree-walking.py └── tstbs.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2020 Willem Hengeveld Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ IDBTOOL ======= A tool for extracting information from IDA databases. `idbtool` knows how to handle databases from all IDA versions since v2.0, both `i64` and `idb` files. You can also use `idbtool` to recover information from unclosed databases. `idbtool` works without change with IDA v7.0. Much faster than loading a file in IDA -------------------------------------- With idbtool you can search thousands of .idb files in seconds. More precisely: on my laptop it takes: * 1.5 seconds to extract 143 idc scripts from 119 idb and i64 files. * 3.8 seconds to print idb info for 441 files. * 5.6 seconds to extract 281 enums containing 4726 members from 35 files. * 67.8 seconds to extract 5942 structs containing 33672 members from 265 files. Loading a approximately 5 Gbyte idb file in IDA, takes about 45 minutes. While idb3.h takes basically no time at all, no more than a few milliseconds. Download ======== Two versions of this tool exist: One written in python * https://github.com/nlitsme/pyidbutil One written in C++ * https://github.com/nlitsme/idbutil Both repositories contain a library which can be used for reading `.idb` or `.i64` files. Usage ===== Usage: idbtool [options] [database file(s)] * `-n` or `--names` will list all named values in the database. * `-s` or `--scripts` will list all scripts stored in the database. * `-u` or `--structs` will list all structs stored in the database. * `-e` or `--enums` will list all enums stored in the database. * `--imports` will list all imported symbols from the database. * `--funcdirs` will list function folders stored in the database. * `-i` or `--info` will print some general info about the database. * `-d` or `--pagedump` dump btree page tree contents. * `--inc`, `--dec` list all records in ascending / descending order. * `-q` or `--query` search specific records in the database. * `-m` or `--limit` limit the number of results returned by `-q`. * `-id0`, `-id1` dump only one specific section. * `--i64`, `--i32` tell idbtool that the specified file is from a 64 or 32 bit database. * `--recover` group files from an unpacked database. * `--classify` summarizes node usage in the database * `--dump` hexdump the original binary data query ----- Queries need to be specified last on the commandline. example: idbtool [database file(s)] --query "Root Node;V" Will list the source binary for all the databases specified on the commandline. A query is a string with the following format: * [==,<=,>=,<,>] - optional relation, default: == * a base node key: * a DOT followed by the numeric value of the nodeid. * a HASH followed by the numeric value of the system-nodeid. * a QUESTION followed by the name of the node. -> a 'N'ame node * the name of the node. -> the name is resolved, results in a '.'Dot node * an optional tag ( A for Alt, S for Supval, etc ) * an optional index value example queries: * `Root Node;V` -> prints record containing the source binary name * `?Root Node` -> prints the Name record pointing to the root * `>Root Node` -> prints the first 10 records starting with the root node id. * ` prints the 10 records startng with the recordsbefore the rootnode. * `.0xff000001;N` -> prints the rootnode name entry. * `#1;N` -> prints the rootnode name entry. List the highest node and following record in the database in two different ways, the first: starting at the first record below `ffc00000`, and listing the next. The second: starting at the first record after `ffc00000`, and listing the previous: * `--query "<#0xc00000" --limit 2 --inc -v` * `--query ">#0xc00000" --limit 2 --dec -v` Note that this should be the nodeid in the `$ MAX NODE` record. List the last two records: * `--limit 2 --dec -v` List the first two records, the `$ MAX LINK` and `$ MAX NODE` records: * `--limit 2 --inc -v` A full database dump -------------------- Several methods exist for printing all records in the database. This may be useful if you want to investigate more of IDA''s internals. But can also be useful in recovering data from corrupted databases. * `--inc`, `--dec` can be used to enumerate all b-tree records in either forward, or backward direction. * add `-v` to get a prettier key/value output * `--id0` walks the page tree, instead of the record tree, printing the contents of each page * `--pagedump` linearly skip through the file, this will also reveal information in deleted pages. naked files =========== When IDA or your computer crashed while working on a disassembly, and you did not yet save the database, you are left with a couple of files with extensions like `.id0`, `.id1`, `.nam`, etc. These files are the unpacked database, i call them `naked` files. Using the `--filetype` and `--i64` or `--i32` options you can inspect these `naked` files individually. or use the `--recover` option to view them as a complete database together. `idbtool` will figure out automatically which files would belong together. `idbtool` can figure out the bitsize of the database from an `.id0` file, but not(yet) from the others. LIBRARY ======= The file `idblib.py` contains a library. TODO ==== * add option to list all comments stored in the database * add option to list flags for a list of addresses. Author ====== Willem Hengeveld ================================================ FILE: idaunpack.py ================================================ """ `idaunpack` is a tool to aid in decoding packed data structures from an IDA idb or i64 database. """ from __future__ import print_function, division import struct import re import sys from binascii import a2b_hex, b2a_hex from idblib import IdaUnpacker def dump_packed(data, wordsize, pattern): p = IdaUnpacker(wordsize, data) if pattern: for c in pattern: if p.eof(): print("EOF") break if c == 'H': val = p.next16() fmt = "%04x" elif c == 'L': val = p.next32() fmt = "%08x" elif c == 'Q': val = p.next64() fmt = "%016x" elif c == 'W': val = p.nextword() if wordsize==4: fmt = "[%08x]" else: fmt = "[%016x]" else: raise Exception("unknown pattern: %s" % c) print(fmt % val, end=" ") while not p.eof(): val = p.next32() print("%08x" % val, end=" ") print() def unhex(hextxt): return a2b_hex(re.sub(r'\W+', '', hextxt, flags=re.DOTALL)) def main(): import argparse parser = argparse.ArgumentParser(description='idaunpack') parser.add_argument('--verbose', '-v', action='store_true') parser.add_argument('--debug', action='store_true', help='abort on exceptions.') parser.add_argument('--pattern', '-p', type=str, help='unpack pattern: sequence of H, L, Q, W') parser.add_argument('-4', '-3', '-32', const=4, dest='wordsize', action='store_const', help='use 32 bit words') parser.add_argument('-8', '-6', '-64', const=8, dest='wordsize', action='store_const', help='use 64 bit words') parser.add_argument('--wordsize', '-w', type=int, help='specify wordsize') parser.add_argument('hexconsts', nargs='*', type=str) args = parser.parse_args() if args.wordsize is None: args.wordsize = 4 for x in args.hexconsts: dump_packed(unhex(x), args.wordsize, args.pattern) if __name__ == '__main__': main() ================================================ FILE: idblib.py ================================================ """ idblib - a module for reading hex-rays Interactive DisAssembler databases Supports database versions starting with IDA v2.0 IDA v1.x is not supported, that was an entirely different file format. IDA v2.x databases are organised as several files, in a directory IDA v3.x databases are bundled into .idb files IDA v4 .. v6 various improvements, like databases larger than 4Gig, and 64 bit support. Copyright (c) 2016 Willem Hengeveld An IDB file can contain up to 6 sections: id0 the main database id1 contains flags for each byte - what is returned by idc.GetFlags(ea) nam contains a list of addresses of named items seg .. only in older databases til type info id2 ? The id0 database is a simple key/value database, much like leveldb types of records: Some bookkeeping: "$ MAX NODE" -> the highest numbered node value in use. A list of names: "N" + name -> the node id for that name. names are both user/disassembler symbols assigned to addresses in the disassembled code, and IDA internals, like lists of items, For example: '$ structs', or 'Root Node'. The main part: "." + nodeid + tag + index This maps directly onto the idasdk netnode interface. The size of the nodeid and index is 32bits for .idb files and 64 bits for .i64 files. The nodeid and index are encoded as bigendian numbers in the key, and as little endian numbers in (most of) the values. """ from __future__ import division, print_function, absolute_import, unicode_literals import struct import binascii import re import os ############################################################################# # some code to make this library run with both python2 and python3 ############################################################################# import sys if sys.version_info[0] == 3: long = int else: bytes = bytearray try: cmp(1, 2) except: # python3 does not have cmp def cmp(a, b): return (a > b) - (a < b) class cachedproperty(object): ## .. only works with python3 somehow. -- todo: figure out why not with python2 def __init__(self, method): self.method = method self.name = '_' + method.__name__ def __get__(self, obj, cls): if not hasattr(obj, self.name): value = self.method(obj) setattr(obj, self.name, value) else: value = getattr(obj, self.name) return value def strz(b, o): return b[o:b.find(b'\x00', o)].decode('utf-8', 'ignore') def makeStringIO(data): if sys.version_info[0] == 2: from StringIO import StringIO return StringIO(data) else: from io import BytesIO return BytesIO(data) ############################################################################# # some utility functions ############################################################################# def nonefmt(fmt, item): # helper for outputting None without raising an error if item is None: return "-" return fmt % item def hexdump(data): if data is None: return return binascii.b2a_hex(data).decode('utf-8') ############################################################################# class FileSection(object): """ Presents a file like object which is a section of a larger file. `fh` is expected to have a seek and read method. This class is used to access a section (e.g. the .id0 file) of a larger file (e.g. the .idb file) and make read/seek behave as if it were a separate file. """ def __init__(self, fh, start, end): self.fh = fh self.start = start self.end = end self.curpos = 0 self.fh.seek(self.start) def read(self, size=None): want = self.end - self.start - self.curpos if size is not None and want > size: want = size if want <= 0: return b"" # make sure filepointer is at correct position since we are sharing the fh object with others. self.fh.seek(self.curpos + self.start) data = self.fh.read(want) self.curpos += len(data) return data def seek(self, offset, *args): def isvalidpos(offset): return 0 <= offset <= self.end - self.start if len(args) == 0: whence = 0 else: whence = args[0] if whence == 0: if not isvalidpos(offset): print("invalid seek: from %x to SET:%x" % (self.curpos, offset)) raise Exception("illegal offset") self.curpos = offset elif whence == 1: if not isvalidpos(self.curpos + offset): raise Exception("illegal offset") self.curpos += offset elif whence == 2: if not isvalidpos(self.end - self.start + offset): raise Exception("illegal offset") self.curpos = self.end - self.start + offset self.fh.seek(self.curpos + self.start) def tell(self): return self.curpos class IdaUnpacker: """ Decodes packed ida structures. This is used o.a. in struct definitions, and .id2 files Related sdk functions: pack_dd, unpack_dd, etc. """ def __init__(self, wordsize, data): self.wordsize = wordsize self.data = data self.o = 0 def eof(self): return self.o >= len(self.data) def have(self, n): return self.o+n <= len(self.data) def nextword(self): """ Return an unsigned word-sized integer from the buffer """ if self.wordsize == 4: return self.next32() elif self.wordsize == 8: return self.next64() else: raise Exception("unsupported wordsize") def nextwordsigned(self): """ Return a signed word-sized integer from the buffer """ if self.wordsize == 4: val = self.next32() if val < 0x80000000: return val return val - 0x100000000 elif self.wordsize == 8: val = self.next64() if val < 0x8000000000000000: return val return val - 0x10000000000000000 else: raise Exception("unsupported wordsize") def next64(self): if self.eof(): return None lo = self.next32() hi = self.next32() return (hi<<32) | lo def next16(self): """ Return a packed 16 bit integer from the buffer """ if self.eof(): return None byte = self.data[self.o:self.o+1] if byte == b'\xff': # a 16 bit value: # 1111 1111 xxxx xxxx xxxx xxxx if self.o+3 > len(self.data): return None val, = struct.unpack_from(">H", self.data, self.o+1) self.o += 3 return val elif byte < b'\x80': # a 7 bit value: # 0xxx xxxx self.o += 1 val, = struct.unpack("B", byte) return val elif byte < b'\xc0': # a 14 bit value: # 10xx xxxx xxxx xxxx if self.o+2 > len(self.data): return None val, = struct.unpack_from(">H", self.data, self.o) self.o += 2 return val&0x3FFF else: return None def next8(self): if self.eof(): return None byte = self.data[self.o:self.o+1] self.o += 1 val, = struct.unpack("B", byte) return val def next32(self): """ Return a packed integer from the buffer """ if self.eof(): return None byte = self.data[self.o:self.o+1] if byte == b'\xff': # a 32 bit value: # 1111 1111 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx if self.o+5 > len(self.data): return None val, = struct.unpack_from(">L", self.data, self.o+1) self.o += 5 return val elif byte < b'\x80': # a 7 bit value: # 0xxx xxxx self.o += 1 val, = struct.unpack("B", byte) return val elif byte < b'\xc0': # a 14 bit value: # 10xx xxxx xxxx xxxx if self.o+2 > len(self.data): return None val, = struct.unpack_from(">H", self.data, self.o) self.o += 2 return val&0x3FFF elif byte < b'\xe0': # a 29 bit value: # 110x xxxx xxxx xxxx xxxx xxxx xxxx xxxx if self.o+4 > len(self.data): return None val, = struct.unpack_from(">L", self.data, self.o) self.o += 4 return val&0x1FFFFFFF else: return None def bytes(self, n): """ Return fixed length string from buffer """ if not self.have(n): return None data = self.data[self.o : self.o+n] self.o += n return data class IDBFile(object): """ Provide access to the various sections in an .idb file. Usage: idb = IDBFile(fhandle) id0 = idb.getsection(ID0File) ID0File is expected to have a class property 'INDEX' # v1..v5 id1 and nam files start with 'Va0' .. 'Va4' # v6 id1 and nam files start with 'VA*' # til files start with 'IDATIL' # id2 files start with 'IDAS\x1d\xa5\x55\x55' """ def __init__(self, fh): """ constructor takes a filehandle """ self.fh = fh self.fh.seek(0) hdrdata = self.fh.read(0x100) self.magic = hdrdata[0:4].decode('utf-8', 'ignore') if self.magic not in ('IDA0', 'IDA1', 'IDA2'): raise Exception("invalid file magic") values = struct.unpack_from("<6LH6L", hdrdata, 6) if values[5] != 0xaabbccdd: fileversion = 0 offsets = list(values[0:5]) offsets.append(0) checksums = [0 for _ in range(6)] else: fileversion = values[6] if fileversion < 5: offsets = list(values[0:5]) checksums = list(values[8:13]) idsofs, idscheck = struct.unpack_from("> 1 if k < a[mid].key: last = mid else: first = mid + 1 return first - 1 """ ################################################################################ I would have liked to make these classes a nested class of BTree, but the problem is than there is no way for a nested-nested class of BTree to refer back to a toplevel nested class of BTree. So moving these outside of BTree so i can use them as baseclasses in the various page implementations class BTree: class BaseEntry(object): pass class BasePage(object): pass class Page15(BasePage): class Entry(BTree.BaseEntry): pass >>> NameError: name 'BTree' is not defined """ class BaseIndexEntry(object): """ Baseclass for Index Entries. Index entries have a key + value, and a page containing keys larger than that key in this index entry. """ def __init__(self, data): ofs = self.recofs if self.recofs < 6: # reading an invalid page... self.val = self.key = None return keylen, = struct.unpack_from(", <, ==, >=, <= ) """ class BasePage(object): """ Baseclass for Pages. for the various btree versions ( 1.5, 1.6 and 2.0 ) there are subclasses which specify the exact layout of the page header, and index / leaf entries. Leaf pages don't have a 'preceeding' page pointer. """ def __init__(self, data, entsize, entfmt): self.preceeding, self.count = struct.unpack_from(entfmt, data) if self.preceeding: entrytype = self.IndexEntry else: entrytype = self.LeafEntry self.index = [] key = b"" for i in range(self.count): ent = entrytype(key, data, entsize * (1 + i)) self.index.append(ent) key = ent.key self.unknown, self.freeptr = struct.unpack_from(entfmt, data, entsize * (1 + self.count)) def find(self, key): """ Searches pages for key, returns relation to key: recurse -> found a next level index page to search for key. also returns the next level page nr gt -> found a value with a key greater than the one searched for. lt -> found a value with a key less than the one searched for. eq -> found a value with a key equal to the one searched for. gt, lt and eq return the index for the key found. # for an index entry: the key is 'less' than anything in the page pointed to. """ i = binary_search(self.index, key) if i < 0: if self.isindex(): return ('recurse', -1) return ('gt', 0) if self.index[i].key == key: return ('eq', i) if self.isindex(): return ('recurse', i) return ('lt', i) def getpage(self, ix): """ For Indexpages, returns the page ptr for the specified entry """ return self.preceeding if ix < 0 else self.index[ix].page def getkey(self, ix): """ For all page types, returns the key for the specified entry """ return self.index[ix].key def getval(self, ix): """ For all page types, returns the value for the specified entry """ return self.index[ix].val def isleaf(self): """ True when this is a Leaf Page """ return self.preceeding == 0 def isindex(self): """ True when this is an Index Page """ return self.preceeding != 0 def __repr__(self): return ("leaf" if self.isleaf() else ("index<%d>" % self.preceeding)) + repr(self.index) ###################################################### # Page objects for the various versions of the database ###################################################### class Page15(BasePage): """ v1.5 b-tree page """ class IndexEntry(BaseIndexEntry): def __init__(self, key, data, ofs): self.page, self.recofs = struct.unpack_from("= 0: self.stack.append((page, ix)) else: # move towards leaf self.stack.append((page, ix)) while page.isindex(): page = self.db.readpage(page.getpage(ix)) ix = len(page.index) - 1 self.stack.append((page, ix)) def eof(self): return len(self.stack) == 0 def getkey(self): """ return the key value pointed to by the cursor """ page, ix = self.stack[-1] return page.getkey(ix) def getval(self): """ return the data value pointed to by the cursor """ page, ix = self.stack[-1] return page.getval(ix) def __repr__(self): return "cursor:" + repr(self.stack) def __init__(self, fh): """ BTree constructor - takes a filehandle """ self.fh = fh self.fh.seek(0) data = self.fh.read(64) if data[13:].startswith(b"B-tree v 1.5 (C) Pol 1990"): self.parseheader15(data) self.page = self.Page15 self.version = 15 elif data[19:].startswith(b"B-tree v 1.6 (C) Pol 1990"): self.parseheader16(data) self.page = self.Page16 self.version = 16 elif data[19:].startswith(b"B-tree v2"): self.parseheader16(data) self.page = self.Page20 self.version = 20 else: print("unknown btree: %s" % hexdump(data)) raise Exception("unknown b-tree") def parseheader15(self, data): self.firstfree, self.pagesize, self.firstindex, self.reccount, self.pagecount = struct.unpack_from(" record equal to the key, None when not found 'le' -> last record with key <= to key 'ge' -> first record with key >= to key 'lt' -> last record with key < to key 'gt' -> first record with key > to key """ # descend tree to leaf nearest to the `key` page = self.readpage(self.firstindex) stack = [] while len(stack) < 256: act, ix = page.find(key) stack.append((page, ix)) if act != 'recurse': break page = self.readpage(page.getpage(ix)) if len(stack) == 256: raise Exception("b-tree corrupted") cursor = BTree.Cursor(self, stack) # now correct for what was actually asked. if act == rel: pass elif rel == 'eq' and act != 'eq': return None elif rel in ('ge', 'le') and act == 'eq': pass elif rel in ('gt', 'ge') and act == 'lt': cursor.next() elif rel == 'gt' and act == 'eq': cursor.next() elif rel in ('lt', 'le') and act == 'gt': cursor.prev() elif rel == 'lt' and act == 'eq': cursor.prev() return cursor def dump(self): """ raw dump of all records in the b-tree """ print("pagesize=%08x, reccount=%08x, pagecount=%08x" % (self.pagesize, self.reccount, self.pagecount)) self.dumpfree() self.dumptree(self.firstindex) def dumpfree(self): """ list all free pages """ fmt = "L" if self.version > 15 else "H" hdrsize = 8 if self.version > 15 else 4 pn = self.firstfree if pn == 0: print("no free pages") return while pn: self.fh.seek(pn * self.pagesize) data = self.fh.read(self.pagesize) if len(data) == 0: print("could not read FREE data at page %06x" % pn) break count, nextfree = struct.unpack_from("<" + (fmt * 2), data) freepages = list(struct.unpack_from("<" + (fmt * count), data, hdrsize)) freepages.insert(0, pn) for pn in freepages: self.fh.seek(pn * self.pagesize) data = self.fh.read(self.pagesize) print("%06x: free: %s" % (pn, hexdump(data[:64]))) pn = nextfree def dumpindented(self, pn, indent=0): """ Dump all nodes of the current page with keys indented, showing how the `indent` feature works """ page = self.readpage(pn) print(" " * indent, page) if page.isindex(): print(" " * indent, end="") self.dumpindented(page.preceeding, indent + 1) for p in range(len(page.index)): print(" " * indent, end="") self.dumpindented(page.getpage(p), indent + 1) def dumptree(self, pn): """ Walks entire tree, dumping all records on each page in sequential order """ page = self.readpage(pn) print("%06x: preceeding = %06x, reccount = %04x" % (pn, page.preceeding, page.count)) for ent in page.index: print(" %s" % ent) if page.preceeding: self.dumptree(page.preceeding) for ent in page.index: self.dumptree(ent.page) def pagedump(self): """ dump the contents of all pages, ignoring links between pages, this will enable you to view contents of pages which have become lost due to datacorruption. """ self.fh.seek(self.pagesize) pn = 1 while True: try: pagedata = self.fh.read(self.pagesize) if len(pagedata) == 0: break elif len(pagedata) != self.pagesize: print("%06x: incomplete - %d bytes ( pagesize = %d )" % (pn, len(pagedata), self.pagesize)) break elif pagedata == b'\x00' * self.pagesize: print("%06x: empty" % (pn)) else: page = self.page(pagedata) print("%06x: preceeding = %06x, reccount = %04x" % (pn, page.preceeding, page.count)) for ent in page.index: print(" %s" % ent) except Exception as e: print("%06x: ERROR decoding as B-tree page: %s" % (pn, e)) pn += 1 class ID0File(object): """ Reads .id0 or 0.ida files, containing a v1.5, v1.6 or v2.0 b-tree database. This is basically the low level netnode interface from the idasdk. There are two major groups of nodes in the database: key = "N"+name -> value = littleendian(nodeid) key = "."+bigendian(nodeid)+char(tag)+bigendian(value) key = "."+bigendian(nodeid)+char(tag)+string key = "."+bigendian(nodeid)+char(tag) and some special nodes for bookkeeping: "$ MAX LINK" "$ MAX NODE" "$ NET DESC" Very old databases also have name entries with a lowercase 'n', and corresponding '-'+value nodes. I am not sure what those are for. several items have specially named nodes, like "$ structs", "$ enums", "Root Node" nodeByName(name) returns the nodeid for a name bytes(nodeid, tag, val) returns the value for a specific node. """ INDEX = 0 def __init__(self, idb, fh): self.btree = BTree(fh) self.wordsize = None self.maxnode = None if idb.magic == 'IDA2': # .i64 files use 64 bit values for some things. self.wordsize = 8 elif idb.magic in ('IDA0', 'IDA1'): self.wordsize = 4 else: # determine wordsize from value of '$ MAX NODE' c = self.btree.find('eq', b'$ MAX NODE') if c and not c.eof(): self.maxnode = c.getval() self.wordsize = len(c.getval()) if self.wordsize not in (4, 8): print("Can not determine wordsize for database - assuming 32 bit") self.wordsize = 4 if self.wordsize == 4: self.nodebase = 0xFF000000 if not self.maxnode: self.maxnode = self.nodebase + 0x0FFFFF self.fmt = "L" else: self.nodebase = 0xFF00000000000000 if not self.maxnode: self.maxnode = self.nodebase + 0x0FFFFFFF self.fmt = "Q" # set the keyformat for this database self.keyfmt = ">s" + self.fmt + "s" + self.fmt @cachedproperty def root(self): return self.nodeByName("Root Node") # note: versions before 4.7 used a short instead of a long # and stored the versions with one minor digit ( 43 ) , instead of two ( 480 ) @cachedproperty def idaver(self): return self.int(self.root, 'A', -1) @cachedproperty def idbparams(self): return self.bytes(self.root, 'S', 0x41b994) @cachedproperty def idaverstr(self): return self.string(self.root, 'S', 1303) @cachedproperty def nropens(self): return self.int(self.root, 'A', -4) @cachedproperty def creationtime(self): return self.int(self.root, 'A', -2) @cachedproperty def originmd5(self): return self.bytes(self.root, 'S', 1302) @cachedproperty def somecrc(self): return self.int(self.root, 'A', -5) def prettykey(self, key): """ returns the key in a readable format. """ f = list(self.decodekey(key)) f[0] = f[0].decode('utf-8') if len(f) > 2 and type(f[2]) == bytes: f[2] = f[2].decode('utf-8') if f[0] == '.': if len(f) == 2: return "%s%16x" % tuple(f) elif len(f) == 3: return "%s%16x %s" % tuple(f) elif len(f) == 4: if f[2] == 'H' and type(f[3]) in (str, bytes): f[3] = f[3].decode('utf-8') return "%s%16x %s '%s'" % tuple(f) elif type(f[3]) in (int, long): return "%s%16x %s %x" % tuple(f) else: f[3] = hexdump(f[3]) return "%s%16x %s %s" % tuple(f) elif f[0] in ('N', 'n', '$'): if type(f[1]) in (int, long): return "%s %x %16x" % tuple(f) else: return "%s'%s'" % tuple(f) elif f[0] == '-': return "%s %x" % tuple(f) return hexdump(key) def prettyval(self, val): """ returns the value in a readable format. """ if len(val) == self.wordsize and val[-1:] in (b'\x00', b'\xff'): return "%x" % struct.unpack("<" + self.fmt, val) if len(val) == self.wordsize and re.search(b'[\x00-\x08\x0b\x0c\x0e-\x1f]', val, re.DOTALL): return "%x" % struct.unpack("<" + self.fmt, val) if len(val) < 2 or not re.match(b'^[\x09\x0a\x0d\x20-\xff]+.$', val, re.DOTALL): return hexdump(val) val = val.replace(b"\n", b"\\n") return "'%s'" % val.decode('utf-8', 'ignore') def nodeByName(self, name): """ Return a nodeid by name """ # note: really long names are encoded differently: # 'N'+'\x00'+pack('Q', nameid) => ofs # and (ofs, 'N') -> nameid # at nodebase ( 0xFF000000, 'S', 0x100*nameid ) there is a series of blobs for max 0x80000 sized names. cur = self.btree.find('eq', self.namekey(name)) if cur: return struct.unpack('<' + self.fmt, cur.getval())[0] def namekey(self, name): if type(name) in (int, long): return struct.pack(" 1: # utf-8 encode the tag args = args[:1] + (args[1].encode('utf-8'),) + args[2:] if len(args) == 3 and type(args[-1]) == str: # node.tag.string type keys return struct.pack(self.keyfmt[:1 + len(args)], b'.', *args[:-1]) + args[-1].encode('utf-8') elif len(args) == 3 and type(args[-1]) == type(-1) and args[-1] < 0: # negative values -> need lowercase fmt char return struct.pack(self.keyfmt[:1 + len(args)] + self.fmt.lower(), b'.', *args) else: # node.tag.value type keys return struct.pack(self.keyfmt[:2 + len(args)], b'.', *args) def decodekey(self, key): """ splits a key in a tuple, one of: ( [ 'N', 'n', '$' ], 0, bignameid ) ( [ 'N', 'n', '$' ], name ) ( '-', id ) ( '.', id ) ( '.', id, tag ) ( '.', id, tag, value ) ( '.', id, 'H', name ) """ if key[:1] in (b'n', b'N', b'$'): if key[1:2] == b"\x00" and len(key) == 2 + self.wordsize: return struct.unpack(">sB" + self.fmt, key) else: return key[:1], key[1:].decode('utf-8', 'ignore') if key[:1] == b'-': return struct.unpack(">s" + self.fmt, key) if len(key) == 1 + self.wordsize: return struct.unpack(self.keyfmt[:3], key) if len(key) == 1 + self.wordsize + 1: return struct.unpack(self.keyfmt[:4], key) if len(key) == 1 + 2 * self.wordsize + 1: return struct.unpack(self.keyfmt[:5], key) if len(key) > 1 + self.wordsize + 1: f = struct.unpack_from(self.keyfmt[:4], key) return f + (key[2 + self.wordsize:], ) raise Exception("unknown key format") def bytes(self, *args): """ return a raw value for the given arguments """ if len(args) == 1 and isinstance(args[0], BTree.Cursor): cur = args[0] else: cur = self.btree.find('eq', self.makekey(*args)) if cur: return cur.getval() def int(self, *args): """ Return the integer stored in the specified node. Any type of integer will be decoded: byte, short, long, long long """ data = self.bytes(*args) if data is not None: if len(data) == 1: return struct.unpack("" + self.fmt, data, 1) nameblob = self.blob(self.nodebase, 'S', nameid * 256, nameid * 256 + 32) return nameblob.rstrip(b"\x00").decode('utf-8') return data.rstrip(b"\x00").decode('utf-8') def blob(self, nodeid, tag, start=0, end=0xFFFFFFFF): """ Blobs are stored in sequential nodes with increasing index values. most blobs, like scripts start at index 0, long names start at a specified offset. """ startkey = self.makekey(nodeid, tag, start) endkey = self.makekey(nodeid, tag, end) cur = self.btree.find('ge', startkey) data = b'' while cur.getkey() <= endkey: data += cur.getval() cur.next() return data class ID1File(object): """ Reads .id1 or 1.IDA files, containing byte flags This is basically the information for the .idc GetFlags(ea), FirstSeg(), NextSeg(ea), SegStart(ea), SegEnd(ea) functions """ INDEX = 1 class SegInfo: def __init__(self, startea, endea, offset): self.startea = startea self.endea = endea self.offset = offset def __init__(self, idb, fh): if idb.magic == 'IDA2': wordsize, fmt = 8, "Q" else: wordsize, fmt = 4, "L" # todo: verify wordsize using the following heuristic: # L -> starting at: seglistofs + nsegs*seginfosize are all zero # L -> starting at seglistofs .. nsegs*seginfosize every even word must be unique self.fh = fh fh.seek(0) hdrdata = fh.read(32) magic = hdrdata[:4] if magic in (b'Va4\x00', b'Va3\x00', b'Va2\x00', b'Va1\x00', b'Va0\x00'): nsegments, npages = struct.unpack_from(" starting at: seglistofs + nsegs*seginfosize are all zero # L -> starting at seglistofs .. nsegs*seginfosize every even word must be unique def dump(self): """ print first and last bits for each segment """ for seg in self.seglist: print("==== %08x-%08x" % (seg.startea, seg.endea)) if seg.endea - seg.startea < 30: for ea in range(seg.startea, seg.endea): print(" %08x: %08x" % (ea, self.getFlags(ea))) else: for ea in range(seg.startea, seg.startea + 10): print(" %08x: %08x" % (ea, self.getFlags(ea))) print("...") for ea in range(seg.endea - 10, seg.endea): print(" %08x: %08x" % (ea, self.getFlags(ea))) def find_segment(self, ea): """ do a linear search for the given address in the segment list """ for seg in self.seglist: if seg.startea <= ea < seg.endea: return seg def getFlags(self, ea): seg = self.find_segment(ea) if not seg: return 0 self.fh.seek(seg.offset + 4 * (ea - seg.startea)) return struct.unpack(">= 1 self.wordsize = wordsize self.wordfmt = fmt self.nnames = nnames self.pagesize = pagesize def dump(self): print("nam: nnames=%d, npages=%d, pagesize=%08x" % (self.nnames, self.npages, self.pagesize)) def allnames(self): self.fh.seek(self.pagesize) n = 0 while n < self.nnames: data = self.fh.read(self.pagesize) want = min(self.nnames - n, int(self.pagesize / self.wordsize)) ofslist = struct.unpack_from("<%d%s" % (want, self.wordfmt), data, 0) for ea in ofslist: yield ea n += want class SEGFile(object): """ reads .seg or $SEGS.IDA files. """ INDEX = 3 def __init__(self, idb, fh): pass class TILFile(object): """ reads .til files """ INDEX = 4 def __init__(self, idb, fh): pass # note: v3 databases had a .reg instead of .til class ID2File(object): """ Reads .id2 files ID2 sections contain packed data, resulting in tripples of unknown use. """ INDEX = 5 def __init__(self, idb, fh): pass class Struct: """ Decodes info for structures (structnode, N) = structname (structnode, D, address) = xref-type (structnode, M, 0) = packed struct info (structnode, S, 27) = packed value(addr, byte) """ class Member: """ (membernode, N) = struct.member-name (membernode, A, 3) = structid+1 (membernode, A, 8) = (membernode, A, 11) = enumid+1 (membernode, A, 16) = flag? -- 4:variable length flag? (membernode, S, 0x3000) = type (set with 'Y') (membernode, S, 0x3001) = names used in 'type' (membernode, S, 5) = array type? (membernode, S, 9) = offset-type (membernode, D, address) = xref-type (membernode, d, structid) = xref-type -- for sub-structs """ def __init__(self, id0, spec): self._id0 = id0 self._nodeid = spec.nextword() + self._id0.nodebase self.skip = spec.nextword() self.size = spec.nextword() self.flags = spec.next32() self.props = spec.next32() self.ofs = None @cachedproperty def name(self): return self._id0.name(self._nodeid) @cachedproperty def enumid(self): return self._id0.int(self._nodeid, 'A', 11) @cachedproperty def stringtype(self): return self._id0.int(self._nodeid, 'A', 16) @cachedproperty def structid(self): return self._id0.int(self._nodeid, 'A', 3) @cachedproperty def comment(self, repeatable): return self._id0.string(self._nodeid, 'S', 1 if repeatable else 0) @cachedproperty def ptrinfo(self): return self._id0.bytes(self._nodeid, 'S', 9) @cachedproperty def typeinfo(self): return self._id0.bytes(self._nodeid, 'S', 0x3000) def __init__(self, id0, nodeid): self._id0 = id0 self._nodeid = nodeid spec = self._id0.blob(self._nodeid, 'M') p = IdaUnpacker(self._id0.wordsize, spec) if self._id0.idaver >= 40: # 1 = SF_VAR, 2 = SF_UNION, 4 = SF_HASHUNI, 8 = SF_NOLIST, 0x10 = SF_TYPLIB, 0x20 = SF_HIDDEN, 0x40 = SF_FRAME, 0xF80 = SF_ALIGN, 0x1000 = SF_GHOST self.flags = p.next32() else: self.flags = 0 nmembers = p.next32() self.members = [] o = 0 for i in range(nmembers): m = Struct.Member(self._id0, p) m.ofs = o o += m.size self.members.append(m) self.extra = [] while not p.eof(): self.extra.append(p.next32()) @cachedproperty def comment(self, repeatable): return self._id0.string(self._nodeid, 'S', 1 if repeatable else 0) @cachedproperty def name(self): return self._id0.name(self._nodeid) def __iter__(self): for m in self.members: yield m class Enum: """ (enumnode, N) = enum-name (enumnode, A, -1) = nr of values (enumnode, A, -3) = representation (enumnode, A, -5) = flags: bitfield, hidden, ... (enumnode, A, -8) = (enumnode, E, value) = valuenode + 1 """ class Member: """ (membernode, N) = membername (membernode, A, -2) = enumnode + 1 (membernode, A, -3) = member value """ def __init__(self, id0, nodeid): self._id0 = id0 self._nodeid = nodeid @cachedproperty def value(self): return self._id0.int(self._nodeid, 'A', -3) @cachedproperty def comment(self, repeatable): return self._id0.string(self._nodeid, 'S', 1 if repeatable else 0) @cachedproperty def name(self): return self._id0.name(self._nodeid) def __init__(self, id0, nodeid): self._id0 = id0 self._nodeid = nodeid @cachedproperty def count(self): return self._id0.int(self._nodeid, 'A', -1) @cachedproperty def representation(self): return self._id0.int(self._nodeid, 'A', -3) # flags>>3 -> width # flags&1 -> bitfield @cachedproperty def flags(self): return self._id0.int(self._nodeid, 'A', -5) @cachedproperty def comment(self, repeatable): return self._id0.string(self._nodeid, 'S', 1 if repeatable else 0) @cachedproperty def name(self): return self._id0.name(self._nodeid) def __iter__(self): startkey = self._id0.makekey(self._nodeid, 'E') endkey = self._id0.makekey(self._nodeid, 'F') cur = self._id0.btree.find('ge', startkey) while cur.getkey() < endkey: yield Enum.Member(self._id0, self._id0.int(cur) - 1) cur.next() class Bitfield: class Member: def __init__(self, id0, nodeid): self._id0 = id0 self._nodeid = nodeid @cachedproperty def value(self): return self._id0.int(self._nodeid, 'A', -3) @cachedproperty def mask(self): return self._id0.int(self._nodeid, 'A', -6) - 1 @cachedproperty def comment(self, repeatable): return self._id0.string(self._nodeid, 'S', 1 if repeatable else 0) @cachedproperty def name(self): return self._id0.name(self._nodeid) class Mask: def __init__(self, id0, nodeid, mask): self._id0 = id0 self._nodeid = nodeid self.mask = mask @cachedproperty def comment(self, repeatable): return self._id0.string(self._nodeid, 'S', 1 if repeatable else 0) @cachedproperty def name(self): return self._id0.name(self._nodeid) def __iter__(self): """ Enumerates all Masks """ startkey = self._id0.makekey(self._nodeid, 'E') endkey = self._id0.makekey(self._nodeid, 'F') cur = self._id0.btree.find('ge', startkey) while cur.getkey() < endkey: yield Bitfield.Member(self._id0, self._id0.int(cur) - 1) cur.next() def __init__(self, id0, nodeid): self._id0 = id0 self._nodeid = nodeid @cachedproperty def count(self): return self._id0.int(self._nodeid, 'A', -1) @cachedproperty def representation(self): return self._id0.int(self._nodeid, 'A', -3) @cachedproperty def flags(self): return self._id0.int(self._nodeid, 'A', -5) @cachedproperty def comment(self, repeatable): return self._id0.string(self._nodeid, 'S', 1 if repeatable else 0) @cachedproperty def name(self): return self._id0.name(self._nodeid) def __iter__(self): """ Enumerates all Masks """ startkey = self._id0.makekey(self._nodeid, 'm') endkey = self._id0.makekey(self._nodeid, 'n') cur = self._id0.btree.find('ge', startkey) while cur.getkey() < endkey: key = self._id0.decodekey(cur.getkey()) yield Bitfield.Mask(self._id0, self._id0.int(cur) - 1, key[-1]) cur.next() class IDBParams: def __init__(self, id0, data): self._id0 = id0 magic, self.version, = struct.unpack_from("<3sH", data, 0) if self.version<700: cpu, self.idpflags, self.demnames, self.filetype, self.coresize, self.corestart, self.ostype, self.apptype = struct.unpack_from("<8sBBH" + (id0.fmt * 2) + "HH", data, 5) self.cpu = strz(cpu, 0) else: p = IdaUnpacker(id0.wordsize, data[5:]) cpulen = p.next32() self.cpu = p.bytes(cpulen) genflags = p.next32() self.idpflags = p.next32() self.demnames = 0 changecount = p.next32() self.filetype = p.next32() self.ostype = p.next32() self.apptype = p.next32() asmtype = p.next32() specsegs = p.next32() specsegs = p.next32() aflags = p.next32() aflags2 = p.next32() base = p.nextword() startss = p.nextword() startcs = p.nextword() startip = p.nextword() startea = p.nextword() startsp = p.nextword() main = p.nextword() minea = p.nextword() maxea = p.nextword() self.coresize = 0 self.corestart = 0 class Script: def __init__(self, id0, nodeid): self._id0 = id0 self._nodeid = nodeid @cachedproperty def name(self): return self._id0.string(self._nodeid, 'S', 0) @cachedproperty def language(self): return self._id0.string(self._nodeid, 'S', 1) @cachedproperty def body(self): return strz(self._id0.blob(self._nodeid, 'X'), 0) class Segment: """ Decodes a value from "$ segs", see segment_t in segment.hpp for details. """ def __init__(self, id0, spec): self._id0 = id0 p = IdaUnpacker(id0.wordsize, spec) self.startea = p.nextword() self.size = p.nextword() self.name_id = p.nextword() self.class_id = p.nextword() self.orgbase = p.nextword() self.unknown = p.next16() self.align = p.next8() self.comb = p.next8() self.perm = p.next8() self.bitness = p.next8() self.flags = p.next8() self.selector = p.nextword() self.defsr = [p.nextword() for _ in range(16)] self.color = p.next32() ================================================ FILE: idbtool.py ================================================ #!/usr/bin/python3 """ Tool for querying information from Hexrays .idb and .i64 files without launching IDA. Copyright (c) 2016 Willem Hengeveld """ # todo: # '$ segs' # S = packed(startea, size, ....) # '$ srareas' # a = packed(startea, size, flag, flag) -- includes functions # b = packed(startea, size, flag, flag) -- segment # c = packed(startea, size, flag, flag) -- same as 'b' # from __future__ import division, print_function, absolute_import, unicode_literals import sys import os if sys.version_info[0] == 2: import scandir os.scandir = scandir.scandir if sys.version_info[0] == 2: reload(sys) sys.setdefaultencoding('utf-8') if sys.version_info[0] == 2: stdout = sys.stdout else: stdout = sys.stdout.buffer import struct import binascii import argparse import itertools from collections import defaultdict import re from datetime import datetime import idblib from idblib import hexdump def timestring(t): if t == 0: return "....-..-.. ..:..:.." return datetime.strftime(datetime.fromtimestamp(t), "%Y-%m-%d %H:%M:%S") def strz(b, o): return b[o:b.find(b'\x00', o)].decode('utf-8', 'ignore') def nonefmt(fmt, num): if num is None: return "-" return fmt % num ######### license encoding ################ def decryptuser(data): """ The '$ original user' node is encrypted with hexray's private key. Hence we can easily decrypt it, but not change it to something else. We can however copy the entry from another database, or just replace it with garbage. The node contains 128 bytes encrypted license, followed by 32 bytes zero. Note: i found several ida55 databases online where this does not work. possible these were created using a cracked version of IDA. """ data = int(binascii.b2a_hex(data[127::-1]), 16) user = pow(data, 0x13, 0x93AF7A8E3A6EB93D1B4D1FB7EC29299D2BC8F3CE5F84BFE88E47DDBDD5550C3CE3D2B16A2E2FBD0FBD919E8038BB05752EC92DD1498CB283AA087A93184F1DD9DD5D5DF7857322DFCD70890F814B58448071BBABB0FC8A7868B62EB29CC2664C8FE61DFBC5DB0EE8BF6ECF0B65250514576C4384582211896E5478F95C42FDED) user = binascii.a2b_hex("%0256x" % user) return user[1:] def licensestring(lic): """ decode a license blob """ if not lic: return if len(lic) < 127: print("too short license format: %s" % binascii.b2a_hex(lic)) return elif len(lic) > 127 and sum(lic[127:]) != 0: print("too long license format: %s" % binascii.b2a_hex(lic)) return if struct.unpack_from("= 128: user0 = decryptuser(user0) else: user0 = user0[:127] # user0 has 128 bytes rsa encrypted license, followed by 32 bytes zero print("orig: %s" % licensestring(user0)) # ida9 has S10+S11 == license json user10 = id0.blob(orignode, 'S', 16) if user10: import json user10 = json.loads(user10) print("orig: %s" % user10) curnode = id0.nodeByName('$ user1') if curnode: user1 = id0.bytes(curnode, 'S', 0) print("user: %s" % licensestring(user1)) ######### idb summary ######### filetypelist = [ "MS DOS EXE File", "MS DOS COM File", "Binary File", "MS DOS Driver", "New Executable (NE)", "Intel Hex Object File", "MOS Technology Hex Object File", "Linear Executable (LX)", "Linear Executable (LE)", "Netware Loadable Module (NLM)", "Common Object File Format (COFF)", "Portable Executable (PE)", "Object Module Format", "R-records", "ZIP file (this file is never loaded to IDA database)", "Library of OMF Modules", "ar library", "file is loaded using LOADER DLL", "Executable and Linkable Format (ELF)", "Watcom DOS32 Extender (W32RUN)", "Linux a.out (AOUT)", "PalmPilot program file", "MS DOS EXE File", "MS DOS COM File", "AIX ar library", "Mac OS X Mach-O file", ] def dumpinfo(id0): """ print various infos on the idb file """ def ftstring(ft): if 0 < ft < len(filetypelist): return "%02x:%s" % (ft, filetypelist[ft]) return "%02x:unknown" % ft def decodebitmask(fl, bitnames): l = [] knownbits = 0 for bit, name in enumerate(bitnames): if fl & (1 << bit) and name is not None: l.append(name) knownbits |= 1 << bit if fl & ~knownbits: l.append("unknown_%x" % (fl & ~knownbits)) return ",".join(l) def osstring(fl): return decodebitmask(fl, ['msdos', 'win', 'os2', 'netw', 'unix', 'other']) def appstring(fl): return decodebitmask(fl, ['console', 'graphics', 'exe', 'dll', 'driver', '1thread', 'mthread', '16bit', '32bit', '64bit']) ldr = id0.nodeByName("$ loader name") if ldr: print("loader: %s %s" % (id0.string(ldr, 'S', 0), id0.string(ldr, 'S', 1))) if not id0.root: print("database has no RootNode") return if id0.idbparams: params = idblib.IDBParams(id0, id0.idbparams) print("cpu: %s, version=%d, filetype=%s, ostype=%s, apptype=%s, core:%x, size:%x" % (params.cpu, params.version, ftstring(params.filetype), osstring(params.ostype), appstring(params.apptype), params.corestart, params.coresize)) print("idaver=%s: %s" % (nonefmt("%04d", id0.idaver), id0.idaverstr)) srcmd5 = id0.originmd5 print("nopens=%s, ctime=%s, crc=%s, md5=%s" % (nonefmt("%d", id0.nropens), nonefmt("%08x", id0.creationtime), nonefmt("%08x", id0.somecrc), hexdump(srcmd5) if srcmd5 else "-")) dumpuser(id0) def dumpnames(args, id0, nam): for ea in nam.allnames(): print("%08x: %s" % (ea, id0.name(ea))) def dumpscript(id0, node): """ dump all stored scripts """ s = idblib.Script(id0, node) print("======= %s %s =======" % (s.language, s.name)) print(s.body) def dumpstructmember(m): """ Dump info for a struct member. """ print(" %02x %02x %08x %02x: %-40s" % (m.skip, m.size, m.flags, m.props, m.name), end="") if m.enumid: print(" enum %08x" % m.enumid, end="") if m.structid: print(" struct %08x" % m.structid, end="") if m.ptrinfo: # packed # note: 64bit nrs are stored low32, high32 # flags1, target, base, delta, flags2 # flags1: # 0=off8 1=off16 2=off32 3=low8 4=low16 5=high8 6=high16 9=off64 # 0x10 = targetaddr, 0x20 = baseaddr, 0x40 = delta, 0x80 = base is plainnum # flags2: # 1=image is off, 0x10 = subtract, 0x20 = signed operand print(" ptr %s" % m.ptrinfo, end="") if m.typeinfo: print(" type %s" % m.typeinfo, end="") print() def dumpstruct(id0, node): """ dump all info for the struct defined by `node` """ s = idblib.Struct(id0, node) print("struct %s, 0x%x" % (s.name, s.flags)) for m in s: dumpstructmember(m) def dumpbitmember(m): print(" %08x %s" % (m.value or 0, m.name)) def dumpmask(m): print(" mask %08x %s" % (m.mask, m.name)) for m in m: dumpbitmember(m) def dumpbitfield(id0, node): b = idblib.Bitfield(id0, node) print("bitfield %s, %s, %s, %s" % (b.name, nonefmt("0x%x", b.count), nonefmt("0x%x", b.representation), nonefmt("0x%x", b.flags))) for m in b: dumpmask(m) def dumpenummember(m): """ Print information on a single enum member """ print(" %08x %s" % (m.value or 0, m.name)) def dumpenum(id0, node): """ Dump all info for the enum defined by `node` """ e = idblib.Enum(id0, node) if e.flags and e.flags&1: dumpbitfield(id0, node) return print("enum %s, %s, %s, %s" % (e.name, nonefmt("0x%x", e.count), nonefmt("0x%x", e.representation), nonefmt("0x%x", e.flags))) for m in e: dumpenummember(m) def dumpimport(id0, node): # Note that '$ imports' is a list where the actual nodes # are stored in the list, therefore we add '1' to the node here. # first the named imports startkey = id0.makekey(node+1, 'S') endkey = id0.makekey(node+1, 'T') cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: txt = id0.string(cur) key = cur.getkey() ea = id0.decodekey(key)[3] print("%08x: %s" % (ea, txt)) cur.next() # then list the imports by ordinal startkey = id0.makekey(node+1, 'A') endkey = id0.makekey(node+1, 'B') cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: ordinal = id0.decodekey(cur.getkey())[3] ea = id0.int(cur) print("%08x: (ord%04d) %s" % (ea, ordinal, id0.name(ea))) cur.next() def enumlist(id0, listname, callback): """ Lists are all stored in a similar way. (listnode, 'N') = listname (listnode, 'A', -1) = list size <-- not for '$ scriptsnippets' (listnode, 'A', seqnr) = itemnode+1 (listnode, 'Y', itemnode) = seqnr <-- only with '$ enums' (listnode, 'Y', 0) = list size <-- only '$ scriptsnippets' (listnode, 'Y', 1) = ? <-- only '$ scriptsnippets' (listnode, 'S', seqnr) = dllname <-- only '$ imports' """ listnode = id0.nodeByName(listname) if not listnode: return startkey = id0.makekey(listnode, 'A') endkey = id0.makekey(listnode, 'A', 0xFFFFFFFF) cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: item = id0.int(cur) callback(id0, item - 1) cur.next() def listfuncdirs(id0): listnode = id0.nodeByName('$ dirtree/funcs') if not listnode: return dir_id = 0 while True: start = dir_id * 0x10000 end = start + 0xFFFF data = id0.blob(listnode, 'S', start, end) if data == b'': break dumpfuncdir(id0, dir_id, data) dir_id += 1 def dumpfuncdir(id0, dir_index, data): terminate = data.find(b'\0', 1) name = data[1:terminate].decode('utf-8') p = idblib.IdaUnpacker(id0.wordsize, data[terminate+1:]) parent = p.nextword() unk = p.next32() if data[0] == 0: # IDA 7.5 subdir_count = p.next32() subdirs = [] while subdir_count: subdir_id = p.nextwordsigned() if subdirs: subdir_id = subdirs[-1] + subdir_id subdirs.append(subdir_id) subdir_count -= 1 func_count = p.next32() funcs = [] while func_count: func_id = p.nextwordsigned() if funcs: func_id = funcs[-1] + func_id funcs.append(func_id) func_count -= 1 elif data[0] == 1: # IDA 7.6 children_count = p.next32() children = [] for i in range(children_count): next_child = p.nextwordsigned() if children: next_child += children[-1] children.append(next_child) subdir_count = p.next32() children_count -= subdir_count childtype_counts = [subdir_count] while children_count: childtype_count = p.next32() children_count -= childtype_count childtype_counts.append(childtype_count) subdirs = [] funcs = [] i = 0 parsing_subdirs = True # switch back and forth for childtype_count in childtype_counts: for _ in range(childtype_count): if parsing_subdirs: subdirs.append(children[i]) else: funcs.append(children[i]) i += 1 parsing_subdirs = not parsing_subdirs else: raise NotImplementedError('unsupported funcdir schema') if not p.eof(): raise Exception('not EOF after dir parsed') print("dir %d = %s" % (dir_index, name)) print(" parent = %d" % parent) print(" subdirs:") for subdir in subdirs: print(" %d" % subdir) print(" functions:") for func in funcs: print(" 0x%x" % func) def printent(args, id0, c): if args.verbose: print("%s = %s" % (id0.prettykey(c.getkey()), id0.prettyval(c.getval()))) else: print("%s = %s" % (hexdump(c.getkey()), hexdump(c.getval()))) def createkey(args, id0, base, tag, ix): """ parse base node specification: '?' -> explicit N key '#' -> relative to nodebase '.' -> absolute nodeid '' -> lookup by name. """ if base[:1] == '?': return id0.namekey(base[1:]) if re.match(r'^#(?:0[xX][0-9a-fA-F]+|\d+)$', base): nodeid = int(base[1:], 0) + id0.nodebase elif re.match(r'^\.(?:0[xX][0-9a-fA-F]+|\d+)$', base): nodeid = int(base[1:], 0) else: nodeid = id0.nodeByName(base) if nodeid and args.verbose > 1: print("found node %x for %s" % (nodeid, base)) if nodeid is None: print("Could not find '%s'" % base) return s = [nodeid] if tag is not None: s.append(tag) if ix is not None: try: ix = int(ix, 0) except: pass s.append(ix) return id0.makekey(*s) def enumeratecursor(args, c, onerec, callback): """ Enumerate cursor in direction specified by `--dec` or `--inc`, taking into account the optional limit set by `--limit` Output according to verbosity level set by `--verbose`. """ limit = args.limit while c and not c.eof() and (limit is None or limit > 0): callback(c) if args.dec: c.prev() else: c.next() if limit is not None: limit -= 1 elif onerec: break def id0query(args, id0, query): """ queries start with an optional operator: <,<=,>,>=,== followed by either a name or address or nodeid Addresses are specified as a sequence of hexadecimal charaters. Nodeid's may be specified either as the full node id, starting with ff00, or starting with a '_' Names are anything which can be found under the name tree in the database. after the name/addr/node there is optionally a slash, followed by a node tag, and another slash, followed by a index or hash string. """ xlatop = {'=': 'eq', '==': 'eq', '>': 'gt', '<': 'lt', '>=': 'ge', '<=': 'le'} SEP = r";" m = re.match(r'^([=<>]=?)?(.+?)(?:' + SEP + r'(\w+)(?:' + SEP + r'(.+))?)?$', query) op = m.group(1) or "==" base = m.group(2) tag = m.group(3) # optional ;tag ix = m.group(4) # optional ;ix op = xlatop[op] c = id0.btree.find(op, createkey(args, id0, base, tag, ix)) enumeratecursor(args, c, op=='eq', lambda c:printent(args, id0, c)) def getsegs(id0): """ Returns a list of all segments. """ seglist = [] node = id0.nodeByName('$ segs') if not node: return startkey = id0.makekey(node, 'S') endkey = id0.makekey(node, 'T') cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: s = idblib.Segment(id0, cur.getval()) seglist.append(s) cur.next() return seglist def listsegments(id0): """ Print a summary of all segments found in the IDB. """ ssnode = id0.nodeByName('$ segstrings') if not ssnode: print("can't find '$ segstrings' node") return segstrings = id0.blob(ssnode, 'S') p = idblib.IdaUnpacker(id0.wordsize, segstrings) unk = p.next32() nextid = p.next32() slist = [] while not p.eof(): slen = p.next32() if slen is None: break name = p.bytes(slen) if name is None: break slist.append(name.decode('utf-8', 'ignore')) segs = getsegs(id0) for s in segs: print("%08x - %08x %s" % (s.startea, s.startea+s.size, slist[s.name_id-1])) def classifynodes(args, id0): """ Attempt to classify all nodes in the IDA database. Note: this does not work for very old dbs """ nodetype = {} tagstats = defaultdict(lambda : defaultdict(int)) segs = getsegs(id0) print("node: %x .. %x" % (id0.nodebase, id0.maxnode)) def addstat(nodetype, k): if len(k)<3: print("??? strange, expected longer key - %s" % k) return tag = k[2].decode('utf-8') if len(k)==3: tagstats[nodetype][(tag, )] += 1 elif len(k)==4: value = k[3] if type(value)==int: if isaddress(value): tagstats[nodetype][(tag, 'addr')] += 1 elif isnode(value): tagstats[nodetype][(tag, 'node')] += 1 else: if value >= id0.maxnode: value -= pow(0x100, id0.wordsize) tagstats[nodetype][(tag, value)] += 1 else: tagstats[nodetype][(tag, 'string')] += 1 else: print("??? strange, expected shorter key - %s" % k) return def isaddress(addr): for s in segs: if s.startea <= addr < s.startea+s.size: return True def isnode(addr): return id0.nodebase <= addr <= id0.maxnode def processbitfieldvalue(v): nodetype[v._nodeid] = 'bitfieldvalue' def processbitfieldmask(m): nodetype[m._nodeid] = 'bitfieldmask' for m in m: processbitfieldvalue(m) def processbitfield(id0, node): nodetype[node] = 'bitfield' b = idblib.Bitfield(id0, node) for m in b: processbitfieldmask(m) def processenummember(m): nodetype[m._nodeid] = 'enummember' def processenums(id0, node): nodetype[node] = 'enum' e = idblib.Enum(id0, node) if e.flags&1: processbitfield(id0, node) return for m in e: processenummember(m) def processstructmember(m, typename): nodetype[m._nodeid] = typename def processstructs(id0, node, typename): nodetype[node] = typename s = idblib.Struct(id0, node) for m in s: processstructmember(m, typename+"member") def processscripts(id0, node): nodetype[node] = 'script' def processaddr(id0, cur): k = id0.decodekey(cur.getkey()) if len(k)==4 and k[2:4] == (b'A', 2): nodetype[id0.int(cur)-1] = 'hexrays' addstat('addr', k) def processfunc(id0, funcspec): p = idblib.IdaUnpacker(id0.wordsize, funcspec) funcstart = p.nextword() funcsize = p.nextword() flags = p.next16() if flags is None: return if flags&0x8000: # is tail return node = p.nextword() if node<0xFFFFFF and node!=0: processstructs(id0, node + id0.nodebase, "frame") def processimport(id0, node): print("imp %08x" % node) startkey = id0.makekey(node+1, 'A') endkey = id0.makekey(node+1, 'B') cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: dllnode = id0.int(cur) nodetype[dllnode] = 'import' cur.next() # mark enums, structs, scripts. enumlist(id0, '$ enums', processenums) enumlist(id0, '$ structs', lambda id0, node : processstructs(id0, node, "struct")) enumlist(id0, '$ scriptsnippets', processscripts) enumlist(id0, '$ imports', processimport) # enum functions, scan for stackframes funcsnode = id0.nodeByName('$ funcs') startkey = id0.makekey(funcsnode, 'S') endkey = id0.makekey(funcsnode, 'T') cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: processfunc(id0, cur.getval()) cur.next() clinode = id0.nodeByName('$ cli') if clinode: for letter in "ABCDEFGHIJKMcio": startkey = id0.makekey(clinode, letter) endkey = id0.makekey(clinode, chr(ord(letter)+1)) cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: nodetype[id0.int(cur)] = 'cli.'+letter cur.next() # enum addresses, scan for hex-rays nodes startkey = b'.' endkey = id0.makekey(id0.nodebase) cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: processaddr(id0, cur) cur.next() # addresses above node list startkey = id0.makekey(id0.maxnode+1) endkey = b'/' cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: processaddr(id0, cur) cur.next() # scan for unmarked nodes # $ fr[0-9a-f]+\.\w+ # $ fr[0-9a-f]+\. [rs] # $ F[0-9A-F]+\.\w+ # $ Stack of \w+ # Stack[0000007C] # xrefs to \w+ startkey = id0.makekey(id0.nodebase) endkey = id0.makekey(id0.maxnode+1) cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: k = id0.decodekey(cur.getkey()) node = k[1] if node not in nodetype: nodetype[node] = "unknown" if nodetype[node] == "unknown" and k[2] == b'N': name = cur.getval().rstrip(b'\x00') if re.match(br'\$ fr[0-9a-f]+\.\w+$', name): name = 'fr-type-functionframe' elif re.match(br'\$ fr[0-9a-f]+\. [rs]$', name): name = 'fr-type-functionframe' elif re.match(br'\$ F[0-9A-F]+\.\w+$', name): name = 'F-type-functionframe' elif name.startswith(b'Stack of '): name = 'stack-type-functionframe' elif name.startswith(b'Stack['): name = 'old-stack-type-functionframe' elif name.startswith(b'xrefs to '): name = 'old-xrefs' else: name = name.decode('utf-8', 'ignore') nodetype[node] = name cur.next() # output node classification if args.verbose: for k, v in sorted(nodetype.items(), key=lambda kv:kv[0]): print("%08x: %s" % (k, v)) # summarize tags per nodetype startkey = id0.makekey(id0.nodebase) endkey = id0.makekey(id0.maxnode+1) cur = id0.btree.find('ge', startkey) while cur.getkey() < endkey: k = id0.decodekey(cur.getkey()) node = k[1] nt = nodetype[node] addstat(nt, k) cur.next() # output tag statistics for nt, ntstats in sorted(tagstats.items(), key=lambda kv:kv[0]): print("====== %s =====" % nt) for k, v in ntstats.items(): if len(k)==1: print("%5d - %s" % (v, k[0])) elif len(k)==2 and type(k[1])==type(1): print("%5d - %s %8x" % (v, k[0], k[1])) elif type(k[1])==type(1): print("%5d - %s %8x %s" % (v, k[0], k[1], k[2:])) else: print("%5d - %s %s %s" % (v, k[0], k[1], k[2:])) def processid0(args, id0): if args.info: dumpinfo(id0) if args.pagedump: id0.btree.pagedump() if args.query: for query in args.query: id0query(args, id0, query) elif args.id0: id0.btree.dump() elif args.inc: c = id0.btree.find('ge', b'') enumeratecursor(args, c, False, lambda c:printent(args, id0, c)) elif args.dec: c = id0.btree.find('le', b'\x80') enumeratecursor(args, c, False, lambda c:printent(args, id0, c)) def hexascdumprange(id1, a, b): line = asc = "" for ea in range(a, b): if len(line)==0: line = "%08x:" % ea byte = id1.getFlags(ea)&0xFF line += " %02x" % byte asc += chr(byte) if 32 1: print("magic=%s, filever=%d" % (idb.magic, idb.fileversion)) for i in range(6): comp, ofs, size, checksum = idb.getsectioninfo(i) if ofs: part = idb.getpart(i) print("%2d: %02x, %08x %8x [%08x]: %s" % (i, comp, ofs, size, checksum, hexdump(part.read(256)))) nam = idb.getsection(idblib.NAMFile) id0 = idb.getsection(idblib.ID0File) id1 = idb.getsection(idblib.ID1File) processid0(args, id0) processid1(args, id1) processid2(args, idb.getsection(idblib.ID2File)) processnam(args, nam) processtil(args, idb.getsection(idblib.TILFile)) processseg(args, idb.getsection(idblib.SEGFile)) if args.names: dumpnames(args, id0, nam) if args.classify: classifynodes(args, id0) if args.scripts: enumlist(id0, '$ scriptsnippets', dumpscript) if args.structs: enumlist(id0, '$ structs', dumpstruct) if args.enums: enumlist(id0, '$ enums', dumpenum) if args.funcdirs: listfuncdirs(id0) if args.imports: enumlist(id0, '$ imports', dumpimport) if args.segs: listsegments(id0) def processfile(args, filetypehint, fh): class DummyIDB: def __init__(idb, args): if args.i64: idb.magic = 'IDA2' elif args.i32: idb.magic = 'IDA1' else: idb.magic = None try: magic = fh.read(64) fh.seek(-64, 1) if magic.startswith(b"Va") or magic.startswith(b"VA"): idb = DummyIDB(args) if filetypehint == 'id1': processid1(args, idblib.ID1File(idb, fh)) elif filetypehint == 'nam': processnam(args, idblib.NAMFile(idb, fh)) elif filetypehint == 'seg': processseg(args, idblib.SEGFile(idb, fh)) else: print("unknown VA type file: %s" % hexdump(magic)) elif magic.startswith(b"IDAS"): processid2(args, idblib.ID2File(DummyIDB(args), fh)) elif magic.startswith(b"IDATIL"): processtil(args, idblib.ID2File(DummyIDB(args), fh)) elif magic.startswith(b"IDA"): processidb(args, idblib.IDBFile(fh)) elif magic.find(b'B-tree v') > 0: processid0(args, idblib.ID0File(DummyIDB(args), fh)) except Exception as e: print("ERROR %s" % e) if args.debug: raise def recover_database(args, basepath, dbfiles): processidb(args, idblib.RecoverIDBFile(args, basepath, dbfiles)) def DirEnumerator(args, path): """ Enumerate all files / links in a directory, optionally recursing into subdirectories, or ignoring links. """ for d in os.scandir(path): try: if d.name == '.' or d.name == '..': pass elif d.is_symlink() and args.skiplinks: pass elif d.is_file(): yield d.path elif d.is_dir() and args.recurse: for f in DirEnumerator(args, d.path): yield f except Exception as e: print("EXCEPTION %s accessing %s/%s" % (e, path, d.name)) def EnumeratePaths(args, paths): """ Enumerate all paths, files from the commandline optionally recursing into subdirectories. """ for fn in paths: try: # 3 - for ftp://, 4 for http://, 5 for https:// if fn.find("://") in (3, 4, 5): yield fn if os.path.islink(fn) and args.skiplinks: pass elif os.path.isdir(fn) and args.recurse: for f in DirEnumerator(args, fn): yield f elif os.path.isfile(fn): yield fn except Exception as e: print("EXCEPTION %s accessing %s" % (e, fn)) def filetype_from_name(fn): i = max(fn.rfind('.'), fn.rfind('/')) return fn[i + 1:].lower() def isv2name(name): return name.lower() in ('$segregs.ida', '$segs.ida', '0.ida', '1.ida', 'ida.idl', 'names.ida') def isv3ext(ext): return ext.lower() in ('.id0', '.id1', '.id2', '.nam', '.til') def xlatv2name(name): oldnames = { '$segregs.ida': 'reg', '$segs.ida': 'seg', '0.ida': 'id0', '1.ida': 'id1', 'ida.idl': 'idl', 'names.ida': 'nam', } return oldnames.get(name.lower()) def main(): parser = argparse.ArgumentParser(description='idbtool - print info from hex-rays IDA .idb and .i64 files', formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" idbtool can process complete .idb and .i64 files, but also naked .id0, .id1, .nam, .til files. All versions since IDA v2.0 are supported. Queries start with an optional operator: <,<=,>,>=,==. Followed by either a name or address or nodeid. Addresses are specified as a sequence of hexadecimal charaters. Nodeid's may be specified either as the full node id, starting with ff00, or starting with a '_'. Names are anything which can be found under the name tree in the database. After the name/addr/node there is optionally a slash, followed by a node tag, and another slash, followed by a index or hash string. Multiple queries can be specified, terminated by another option, or `--`. Add `-v` for pretty printed keys and values. Examples: idbtool -v --query "$ user1;S;0" -- x.idb idbtool -v --limit 4 --query ">#0xa" -- x.idb idbtool -v --limit 5 --query ">Root Node;S;0" -- x.idb idbtool -v --limit 10 --query ">Root Node;S" -- x.idb idbtool -v --query ".0xff000001;N" -- x.idb """) parser.add_argument('--verbose', '-v', action='count', default=0) parser.add_argument('--recurse', '-r', action='store_true', help='recurse into directories') parser.add_argument('--skiplinks', '-L', action='store_true', help='skip symbolic links') parser.add_argument('--filetype', '-t', type=str, help='specify filetype when loading `naked` id1,nam or seg files') parser.add_argument('--i64', '-i64', action='store_true', help='specify that `naked` file is from a 64 bit database') parser.add_argument('--i32', '-i32', action='store_true', help='specify that `naked` file is from a 32 bit database') parser.add_argument('--names', '-n', action='store_true', help='print names') parser.add_argument('--scripts', '-s', action='store_true', help='print scripts') parser.add_argument('--structs', '-u', action='store_true', help='print structs') # parser.add_argument('--comments', '-c', action='store_true', help='print comments') parser.add_argument('--enums', '-e', action='store_true', help='print enums and bitfields') parser.add_argument('--imports', action='store_true', help='print imports') parser.add_argument('--segs', action='store_true', help='print segments') parser.add_argument('--funcdirs', action='store_true', help='print function dirs (folders)') parser.add_argument('--info', '-i', action='store_true', help='database info') parser.add_argument('--inc', action='store_true', help='dump id0 records by cursor increment') parser.add_argument('--dec', action='store_true', help='dump id0 records by cursor decrement') parser.add_argument('--id0', "-id0", action='store_true', help='dump id0 records, by walking the page tree') parser.add_argument('--id1', "-id1", action='store_true', help='dump id1 records') parser.add_argument('--dump', type=str, help='hexdump id1 bytes', metavar='FROM-UNTIL') parser.add_argument('--dumpraw', type=str, help='output id1 bytes', metavar='FROM-UNTIL') parser.add_argument('--pagedump', "-d", action='store_true', help='dump all btree pages, including any that might have become inaccessible due to datacorruption.') parser.add_argument('--classify', action='store_true', help='Classify nodes found in the database.') parser.add_argument('--query', "-q", type=str, nargs='*', help='search the id0 file for a specific record.') parser.add_argument('--limit', '-m', type=int, help='Max nr of records to return for a query.') parser.add_argument('--recover', action='store_true', help='recover idb from unpacked files, of v2 database') parser.add_argument('--debug', action='store_true') parser.add_argument('FILES', type=str, nargs='*', help='Files') args = parser.parse_args() if args.FILES: dbs = dict() for fn in EnumeratePaths(args, args.FILES): basepath, filename = os.path.split(fn) if isv2name(filename): d = dbs.setdefault(basepath, dict()) d[xlatv2name(filename)] = fn print("%s -> %s : %s" % (xlatv2name(filename), basepath, filename)) else: basepath, ext = os.path.splitext(fn) if isv3ext(ext): d = dbs.setdefault(basepath, dict()) d[ext.lower()] = fn if not args.dumpraw: print("\n==> " + fn + " <==\n") try: filetype = args.filetype or filetype_from_name(fn) with open(fn, "rb") as fh: processfile(args, filetype, fh) except Exception as e: print("ERROR: %s" % e) if args.debug: raise if args.recover: for basepath, dbfiles in dbs.items(): if len(dbfiles) > 1: try: print("\n==> " + basepath + " <==\n") recover_database(args, basepath, dbfiles) except Exception as e: print("ERROR: %s" % e) else: print("==> STDIN <==") processfile(args, args.filetype, sys.stdin.buffer) if __name__ == '__main__': main() ================================================ FILE: setup.cfg ================================================ [flake8] ignore = E402,E501,E731 ================================================ FILE: test_idblib.py ================================================ import unittest from idblib import FileSection, binary_search, makeStringIO class TestFileSection(unittest.TestCase): """ unittest for FileSection object """ def test_file(self): s = makeStringIO(b"0123456789abcdef") fh = FileSection(s, 3, 11) self.assertEqual(fh.read(3), b"345") self.assertEqual(fh.read(8), b"6789a") self.assertEqual(fh.read(8), b"") fh.seek(-1, 2) self.assertEqual(fh.read(8), b"a") fh.seek(3) self.assertEqual(fh.read(2), b"67") fh.seek(-2, 1) self.assertEqual(fh.read(2), b"67") fh.seek(2, 1) self.assertEqual(fh.read(2), b"a") fh.seek(8) self.assertEqual(fh.read(1), b"") with self.assertRaises(Exception): fh.seek(9) class TestBinarySearch(unittest.TestCase): """ unittests for binary_search """ class Object: def __init__(self, num): self.key = num def __repr__(self): return "o(%d)" % self.num def test_bs(self): obj = self.Object lst = [obj(_) for _ in (2, 3, 5, 6)] self.assertEqual(binary_search(lst, 1), -1) self.assertEqual(binary_search(lst, 2), 0) self.assertEqual(binary_search(lst, 3), 1) self.assertEqual(binary_search(lst, 4), 1) self.assertEqual(binary_search(lst, 5), 2) self.assertEqual(binary_search(lst, 6), 3) self.assertEqual(binary_search(lst, 7), 3) def test_emptylist(self): obj = self.Object lst = [] self.assertEqual(binary_search(lst, 1), -1) def test_oneelem(self): obj = self.Object lst = [obj(1)] self.assertEqual(binary_search(lst, 0), -1) self.assertEqual(binary_search(lst, 1), 0) self.assertEqual(binary_search(lst, 2), 0) def test_twoelem(self): obj = self.Object lst = [obj(1), obj(3)] self.assertEqual(binary_search(lst, 0), -1) self.assertEqual(binary_search(lst, 1), 0) self.assertEqual(binary_search(lst, 2), 0) self.assertEqual(binary_search(lst, 3), 1) self.assertEqual(binary_search(lst, 4), 1) def test_listsize(self): obj = self.Object for l in range(3, 32): lst = [obj(_ + 1) for _ in range(l)] lst = lst[:1] + lst[2:] self.assertEqual(binary_search(lst, 0), -1) self.assertEqual(binary_search(lst, 1), 0) self.assertEqual(binary_search(lst, 2), 0) self.assertEqual(binary_search(lst, 3), 1) self.assertEqual(binary_search(lst, l - 1), l - 3) self.assertEqual(binary_search(lst, l), l - 2) self.assertEqual(binary_search(lst, l + 1), l - 2) self.assertEqual(binary_search(lst, l + 2), l - 2) ================================================ FILE: tree-walking.py ================================================ """ Copyright (c) 2016 Willem Hengeveld Experiment in btree walking *-------->[00] *------>[02]---+ [01] root ->[08]---+ [05]-+ | [17]-+ | | +--->[03] | | | [04] | | | | | +----->[06] | | [07] | | | | *-------->[09] | +->[11]---+ [10] | [14]-+ | | | +--->[12] | | [13] | | | +----->[15] | [16] | | *-------->[18] +--->[20]---+ [19] [23]-+ | | +--->[21] | [22] | +----->[24] [25] decrement from 08 : ix-- -> getpage, ix=len-1 -> getpage -> ix=len-1 decrement from 17 : ix-- -> getpage, ix=len-1 -> getpage -> ix=len-1 decrement from 02 : ix-- -> getpage, ix=len-1 decrement from 05 : ix-- -> getpage, ix=len-1 decrement from 01 : ix-- -> ix>=0 -> use key at ix decrement from 03 : ix-- -> <0 -> pop -> ix>=0 -> use key at ix decrement from 09 : ix-- -> <0 -> pop -> ix<0 -> pop -> ix>=0 -> use key at ix increment from 09 : ix++ increment from 10 : ix++ -> ix==len(index) -> pop: ix==-1 -> ix++ -> ix==0 -> use increment from 11 : recurse, ix=0 -> use increment from 08 : recurse, ix=-1 -> recurse, ix=0 -> use increment from 07 : ix++ -> ix==len(index) -> pop, ix++ -> ix==len -> pop -> ix++ -> ix==0 -> use """ from __future__ import division, print_function, absolute_import, unicode_literals # shape of the tree # a <2,2> tree is basically like the tree pictured in the ascii art above. TREEDEPTH = 2 NODEWIDTH = 2 def binary_search(a, k): # c++: a.upperbound(k)-- first, last = 0, len(a) while first < last: mid = (first + last) >> 1 if k < a[mid].key: last = mid else: first = mid + 1 return first - 1 class Entry(object): """ a key/value entry from a b-tree page """ def __init__(self, key, val): self.key = key self.val = val def __repr__(self): return "%s=%d" % (self.key, self.val) class BasePage(object): """ BasePage has methods common to both leaf and index pages """ def __init__(self, kv): self.index = [] for k, v in kv: self.index.append(Entry(k, v)) def find(self, key): i = binary_search(self.index, key) if i < 0: if self.isindex(): return ('recurse', -1) return ('gt', 0) if self.index[i].key == key: return ('eq', i) if self.isindex(): return ('recurse', i) return ('lt', i) def getkey(self, ix): return self.index[ix].key def getval(self, ix): return self.index[ix].val def isleaf(self): return self.preceeding is None def isindex(self): return self.preceeding is not None def __repr__(self): return ("leaf" if self.isleaf() else ("index<%d>" % self.preceeding)) + repr(self.index) class LeafPage(BasePage): """ a leaf page in the b-tree """ def __init__(self, kv): super(self.__class__, self).__init__(kv) self.preceeding = None class IndexPage(BasePage): """ An index page in the b-tree. This page has a preceeding page plus several key+subpage pairs. For each key+subpage: all keys in the subpage are greater than the key """ def __init__(self, preceeding, kv): super(self.__class__, self).__init__(kv) self.preceeding = preceeding def getpage(self, ix): return self.preceeding if ix < 0 else self.index[ix].val class Cursor: """ A Cursor object represents a position in the b-tree. It has methods for moving to the next or previous item. And methods for retrieving the key and value of the current position """ def __init__(self, db, stack): self.db = db self.stack = stack def next(self): page, ix = self.stack.pop() if page.isleaf(): # from leaf move towards root ix += 1 while self.stack and ix == len(page.index): page, ix = self.stack.pop() ix += 1 if ix < len(page.index): self.stack.append((page, ix)) else: # from node move towards leaf self.stack.append((page, ix)) page = self.db.readpage(page.getpage(ix)) while page.isindex(): ix = -1 self.stack.append((page, ix)) page = self.db.readpage(page.getpage(ix)) ix = 0 self.stack.append((page, ix)) self.verify() def prev(self): page, ix = self.stack.pop() ix -= 1 if page.isleaf(): # move towards root, until non 'prec' item found while self.stack and ix < 0: page, ix = self.stack.pop() if ix >= 0: self.stack.append((page, ix)) else: # move towards leaf self.stack.append((page, ix)) while page.isindex(): page = self.db.readpage(page.getpage(ix)) ix = len(page.index) - 1 self.stack.append((page, ix)) self.verify() def verify(self): """ verify cursor state consistency """ if len(self.stack) == 3: if not self.stack[-1][0].isleaf(): print("WARN no leaf") elif len(self.stack) > 3: print("WARN: stack too large") if len(self.stack) >= 2: if self.stack[0][0] == self.stack[1][0]: print("WARN: identical index pages on stack") if not self.stack[0][0].isindex(): print("WARN: expected root=index") if not self.stack[1][0].isindex(): print("WARN: expected 2nd=index") def eof(self): return len(self.stack) == 0 def getkey(self): page, ix = self.stack[-1] return page.getkey(ix) def getval(self): page, ix = self.stack[-1] return page.getval(ix) def __repr__(self): return "cursor:" + repr(self.stack) class Btree: """ A B-tree implementation """ def __init__(self): self.pages = [] self.generate(TREEDEPTH, NODEWIDTH) def manual(self): """ manually construct the ascii art tree """ for i in range(9): self.pages.append(LeafPage((("%02d" % (3 * i), 0), ("%02d" % (3 * i + 1), 0)))) for i in range(3): self.pages.append(IndexPage(3 * i, (("%02d" % (9 * i + 2), 3 * i + 1), ("%02d" % (9 * i + 5), 3 * i + 2)))) self.pages.append(IndexPage(9, (("08", 10), ("17", 11)))) self.rootindex = len(self.pages) - 1 def generate(self, depth, nodesize): """ automatically generate the try in the ascii art above """ def namegen(): i = 0 while True: yield "%03d" % i i += 1 self.rootindex = self.construct(namegen(), depth, nodesize) print("%d pages" % (len(self.pages))) def construct(self, namegen, depth, nodesize): if depth: return self.createindex(namegen, depth, nodesize) else: return self.createleaf(namegen, nodesize) def createindex(self, namegen, depth, nodesize): page = IndexPage(self.construct(namegen, depth - 1, nodesize), [(next(namegen), self.construct(namegen, depth - 1, nodesize)) for _ in range(nodesize)]) self.pages.append(page) return len(self.pages) - 1 def createleaf(self, namegen, nodesize): page = LeafPage([(next(namegen), 0) for _ in range(nodesize)]) self.pages.append(page) return len(self.pages) - 1 def readpage(self, pn): return self.pages[pn] def find(self, key): """ Find a node in the tree, returns the cursor plus the reletion to the wanted key: 'eq' for equal, 'lt' when the found key is less than the wanted key, or 'gt' when the found key is greater than the wanted key. """ page = self.readpage(self.rootindex) stack = [] while True: act, ix = page.find(key) stack.append((page, ix)) if act != 'recurse': break page = self.readpage(page.getpage(ix)) return act, Cursor(self, stack) def dumptree(self, pn, indent=0): """ dump all nodes of the current b-tree """ page = self.readpage(pn) print(" " * indent, page) if page.isindex(): print(" " * indent, end="") self.dumptree(page.preceeding, indent + 1) for p in range(len(page.index)): print(" " * indent, end="") self.dumptree(page.getpage(p), indent + 1) db = Btree() print("<<") db.dumptree(db.rootindex) print(">>") for i in range(NODEWIDTH * len(db.pages)): print("--------- %03d" % i) act, cursor = db.find("%03d" % i) print("found", act, cursor.getkey(), cursor) cursor.prev() if not cursor.eof(): print("prev:", "..", cursor.getkey(), cursor) else: print("prev: EOF", cursor) for i in range(NODEWIDTH * len(db.pages)): print("--------- %03d" % i) act, cursor = db.find("%03d" % i) print("found", act, cursor.getkey(), cursor) cursor.next() if not cursor.eof(): print("next:", "..", cursor.getkey(), cursor) else: print("next: EOF", cursor) for k in ('', '0', '1', '2', '3', '000', '010', '020', '100'): print("--------- %s" % k) act, cursor = db.find(k) print(cursor) print(act, cursor.getkey(), end=" next=") cursor.next() if cursor.eof(): print("EOF") else: print(cursor.getkey()) act, cursor = db.find("000") print("get000", end=" ") for i in range(NODEWIDTH * len(db.pages)): cursor.next() if cursor.eof(): print("EOF") else: print("-> %s" % cursor.getkey(), end=" ") print() act, cursor = db.find("025") print("get025", end=" ") for i in range(NODEWIDTH * len(db.pages)): cursor.prev() if cursor.eof(): print("EOF") else: print("-> %s" % cursor.getkey(), end=" ") print() ================================================ FILE: tstbs.py ================================================ def binary_search(a, k): # c++: a.upperbound(k)-- first, last = 0, len(a) while first>1 if k < a[mid]: last = mid else: first = mid+1 return first-1 for x in range(8): print(x, binary_search([2,3,5,6], x))