[
  {
    "path": "LICENSE",
    "content": "MIT License\r\n\r\nCopyright (c) 2022 0xD0GF00D\r\n\r\nPermission is hereby granted, free of charge, to any person obtaining a copy\r\nof this software and associated documentation files (the \"Software\"), to deal\r\nin the Software without restriction, including without limitation the rights\r\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\ncopies of the Software, and to permit persons to whom the Software is\r\nfurnished to do so, subject to the following conditions:\r\n\r\nThe above copyright notice and this permission notice shall be included in all\r\ncopies or substantial portions of the Software.\r\n\r\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\nSOFTWARE.\r\n"
  },
  {
    "path": "Makefile",
    "content": "# Set up compiler, just compile a .ptx for the lowest arch possible. It does not matter as we are interested in SASS.\r\nCC=cc\r\nNVCC=/usr/local/cuda/bin/nvcc\r\nNVDISASM=/usr/local/cuda/bin/nvdisasm\r\nPYTHON=python3\r\n\r\narchitectures=sm_50 sm_52 sm_53 sm_60 sm_61 sm_62 sm_70 sm_72 sm_75 sm_80 sm_86 sm_90\r\n\r\ntargets = $(architectures:=_instructions.txt) $(architectures:=_latencies.txt)\r\n\r\nall: $(targets)\r\n\r\nclean:\r\n\t-rm -f $(targets)\r\n\r\n# Generate the SASS versions.\r\n%.cubin: example.cu\r\n\t$(NVCC) -o $@ -arch=$(basename $@) -cubin $<\r\n\r\n%.so: %.c\r\n\t$(CC) -fPIC -shared -o $@ $< -ldl\r\n\r\n# Not sure if the OMP things are needed, same with the flushing of stdout. We pipe it through strings to get only readable parts.\r\n%_intercept.txt: %.cubin intercept.so\r\n\tOMP_NUM_THREADS=1 OMP_THREAD_LIMIT=1 LD_PRELOAD=./intercept.so $(NVDISASM) $< | strings -n 1 > $@\r\n\r\n%_instructions.txt %_latencies.txt: %_intercept.txt\r\n\t$(PYTHON) funnel.py $<\r\n"
  },
  {
    "path": "NOTES.md",
    "content": "\r\n# Interpretations\r\nThe resulting data is clearly meant for some program to parse, so it is not as immediately understandable as I would like.\r\nDepending on how much time I have, I'll work on breaking down the data into a more understandable format.\r\n\r\n\r\nSince SASS is undocumented, we can only really do interpretations of their function, unless we validate for every case (a daunting task).<br>\r\nThe grave accent and <code>@</code> symbols appear to be operators to do with table lookups.\r\n\r\n\r\n## Instructions file\r\nWhenever, for example `Predicate(PT):Pg` appears, it means initialization of type Predicate with default value PT (true), stored in variable Pg.\r\nThen `/` means it is optional, maybe also that it must be displayed with a dot, e.g. `FFMA.SAT`. This is often coupled with quotation marks, meaning that it must be displayed as a string, even when the table value is numeric, e.g. `SAT \"nosat\"=0 , \"SAT\"=1;` for `/SAT(\"nosat\"):sat`. The reason the table is numeric is probably because that gives a direct mapping from the bits to the string in the table.\r\n\r\nAnother example is `BITSET(6/0x0000):`, meaning a field of 6 bits (probably displayed as a bitset).\r\nSimilarly `UImm(3/0x7):src_rel_sb` probably means unsigned integer immediate value of 3 bits, with default value 7 (meaning no barrier).\r\n\r\nVariable latency operations appear to use the `VarLatOperandEnc(src_rel_sb)` table value as well as a `VIRTUAL QUEUE`. This is very important information, as the queues tells us which operations can be done without affecting each other (different queues).\r\nFor fixed-latency operations there is no variable `src_rel_sb`, and the encoding for these bits is specified to have the value `*7`, probably literal 7. I noticed there is an exception of EXIT, where it just says `7`.\r\n\r\n\r\nThe binary code for the instruction is given in the OPCODES field. The leading bit is not part of the code; It is always 1, probably just to encode the length of the opcode. The encoding is given in the ENCODING field. These contain references to the bit field tables. An instruction is 128 bits, but in the bit field table, the first 64 bits are written last -- which makes sense since then the instruction field is contiguous, and the first bits give the control code.\r\n\r\nThe encoded opex field appear to give the stall cycles, as well as reuse flags, and something called `batch` (between 0 and 6, probably to do with variable latency instructions). A stall up to 15 is quite simple, but it appears that stalls all the way up to 27 cycles are actually supported, but doing so disables batch functionality (since there are limited bits available). Valid combinations are given in the opex tables.\r\n\r\n## Latencies file\r\nThe latencies between operations is given in the different tables. The table is organized with respect to input and output resources, called connectors.\r\nLatency for variable-latency operations does not necessarily reflect the actual latency, only a lower bound possibly.\r\n\r\n\r\nI am not sure how to interpret latencies above 15, since 15 is the highest possible value in the latencies field of the instructions.\r\nIt seems like they are correct though, those that are given, but they should be interpreted carefully. <br>\r\n\r\nFor example for HMMA.16816.F16 (half-precision tensor-core accelerated matrix multiply): If these are chained with the connection of the input register being the same as the output register, the stated latency is 28. By micro-benchmarking, I found the latency to be 24 cycles. Looking at the SASS it appears there is inserted NOP operations between them. NOP is classified as a MIO_OP, essentially because it appears NOP is used specifically to resolve some data hazards for variable-latency instructions. The correct table value is then (HMMA_OP, MIO_FAST_OPS) = 22! And our observed value is 24 because one cycle is spent executing NOP, and one cycle is lost somewhere else. I wonder where?\r\n\r\nNote that the measured time of HMMA.16816.F32 was 33-34 cycles, which is not in the latencies table. Maybe because it is above 2*15, or maybe cause it is greater than 27, which is the largest value of the usched table in the instructions file -- 28 is the greatest value latency in the latency file. The latency for DMMA.884 was measured to be 38-39 cycles (and as high as 177 with multiple threads), which is quite far from (DMMA_OP, DMMA_OP) = 25 or (DMMA_OP, MIO_FAST_OPS) = 23.\r\n\r\nAnother example is with the variable-latency MUFU operation. It is also a MIO_FAST_OP, but unfortunately the latency for chaining these is not given. In terms of scheduling this is not an issue, as there are no hazards (as long as appropriate variable-latency barriers are set). I measured MUFU.SQRT to be 17 cycles (sm86), which is also the time of MUFU.RSQ (reciprocal square root), MUFU.LG2 (2-logarithm) and MUFU.EX2 (2-exponential). The first case is most interesting, since one might expect MUFU.SQRT to have the latency of MUFU.RSQ + MUFU.RCP (reciprocal).\r\n\r\n\r\nFindings:\r\n1. Truly fixed latencies with times over 15 cycles appear to be implemented by inserting NOP instructions. These NOP probably either use barriers, or maybe just use the truly simple method fixed stall cycles (enough to cover the remaining cycles over 15).\r\n2. Truly variable latencies are not specified in the latencies file. They must be resolved with barriers and the like. This is probably because they involve shared resources.\r\n3. The time before a predicate can be used appear to be 13 cycles (sm86) in many cases, but it can be as low as 5 cycles in the common case of 32-bit float operations it seems (see table for specific connectors).\r\n\r\n## Redundancies\r\nThe unique files are found by `md5sum * | grep \".txt\" | sort | uniq -w33 | awk '{print $2}' | sort`\r\n<pre>\r\nsm_35_instructions.txt\r\nsm_35_latencies.txt\r\nsm_37_instructions.txt\r\nsm_50_instructions.txt\r\nsm_50_latencies.txt\r\nsm_52_instructions.txt\r\nsm_52_latencies.txt\r\nsm_60_instructions.txt\r\nsm_60_latencies.txt\r\nsm_61_instructions.txt\r\nsm_61_latencies.txt\r\nsm_70_instructions.txt\r\nsm_70_latencies.txt\r\nsm_72_instructions.txt\r\nsm_72_latencies.txt\r\nsm_75_instructions.txt\r\nsm_75_latencies.txt\r\nsm_80_instructions.txt\r\nsm_80_latencies.txt\r\nsm_86_instructions.txt\r\nsm_86_latencies.txt\r\n</pre>\r\nThis shows that some parts were unchanged between different compute capabilities. These files are 32.6 MB.<br>\r\nThe size of the .data and .rodata segment in nvdisasm is 30.9 + 0.17 = 31.07 MB. So some decompression is definitely done when running the compiler, which explains why the text files are not easily discoverable.\r\n"
  },
  {
    "path": "OTHER.md",
    "content": "## Essential resources\r\n1. [MaxAs](https://github.com/NervanaSystems/maxas) is a reverse-engineered assembler for the Maxwell architecture (Compute Capability 5.2), by [Scott Gray](https://forums.developer.nvidia.com/u/scottgray/summary). The story is he needed a fast GEMM implementation, but the public CUDA tools do not allow for precise editing of SASS assembly. 80% of the theoretical throughput could be attained in PTX, and only with hand-written assembly could ~98% be reached.\r\n2. [Dissecting the NVidia Turing T4 GPU via Microbenchmarking](https://arxiv.org/abs/1903.07486). This paper uses micro-benchmarks to estimate the undocumented latencies of instructions for Compute Capabilities 3.0-7.5. It also provides a reverse-engineered description of the instruction encoding for these generations.\r\n3. [RTX ON: The NVIDIA Turing GPU Architecture](https://old.hotchips.org/hc31/HC31_2.12_NVIDIA_final.pdf) [[video](https://www.youtube.com/watch?v=IjxpMZUqu6c), [paper](https://ieeexplore.ieee.org/document/8981896)] slides for a talk describing the improved throughput on Turing. Shows how dispatch happens. Appears to confirm the uniform datapath consists of a separate unit and register file. But it is still unclear to me what the interaction between MUFU and MIO is (why MUFU use can cause [MIO stall](https://docs.nvidia.com/nsight-compute/ProfilingGuide/#statistical-sampler)).\r\n4. [L1TEX usage and L2 below](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#memory-tables-l1) from the profiling guide. Shows more in-depth how the L1Tex unit works on GA100 (Compute Capability 8.0).\r\n\r\nSee the diagram in (2) for memory hierarchy and the diagram in (3) for throughput, of Compute Capability 7.5.\r\n\r\n\r\n## Profiling\r\nMetrics with \"sass_\" in the name work by [patching the kernel](https://forums.developer.nvidia.com/t/difference-between-thread-inst-executed-metrics/217587), and therefore come with a greater overhead compared to hardware counters.\r\n\r\n## Compiler arguments\r\nSome apparently undocumented arguments for the compiler ptxas.\r\n<pre>\r\n--compiler-stats (-compilerStats) <t/m/p>\r\n    Prints out compiler statistics.\r\n    time/t       : Prints compilation time.\r\n    memory/m     : Prints peak memory usage.\r\n    phase-wise/p : Prints the above data for various compiler phases.\r\n\r\n--no-fastreg\r\n    Disable fast register allocation.\r\nNo idea if this is good or bad in terms of performance. Very curious to try it out.\r\n\r\n--opt-pointers (-Op)\r\n    Optimize 64-bit pointers by truncating them to 32-bit\r\nAnother option that could impact performance.\r\n\r\n--tool-name <string>\r\n    Change tool name to specified string.\r\nAppears to just change the string \"ptxas\" to something else in the help page.\r\n\r\n--fastimul\r\n    Enable 24 bit integer multiplication\r\nI think this is only relevant for older GPU generations. For some reason 24-bit integer mult. was faster then.\r\n\r\n--fast-compile (-fc)\r\n    EXPERIMENTAL FEATURE: Enable optimization strategies that improve compilation time while reducing runtime performance\r\n\r\n--cuda-api-version <major>.<minor>\r\n    CUDA API version to use to for compilation\r\n\r\n--sw2614554\r\n--sw1729687\r\n--sw200428197\r\n--sw200387803\r\nThese just say \"Enable sw_______\". No idea what it does.\r\n\r\n--key (-k)\r\n    Hash value representing the device code from which the binaries were compiled\r\n--okey (-ok)\r\n    Deobfuscation key for specified ptx input\r\n--ptx-length (-ptxlen)\r\n    Length in bytes of obfuscated ptx string\r\n\r\nSome arguments that appear in the data, but does not seem to work are \"-dump-perf-stats\", \"-forcetext\"\r\n</pre>\r\n\r\n## Half-precision MOV\r\nSome interesting things happen when you compile PTX for SM 86, and then use in on an SM 86 device. Since the PTX is a newer version, newer features are used, which not supported on the default, SM 52.\r\n<pre>\r\nThe line of code (variable names simplified):\r\nfloat sign_value = x > y ? 1 : -1;\r\n\r\n\r\nBecomes partly the SASS:\r\n97\t      HFMA2.MMA R16, -RZ, RZ, 1.875, 0\r\n125\t      FSEL R6, R16, -1, P1 \r\n</pre>\r\nThe utility of the RZ register is to encode literal float zero using only 8 bits instead of e.g. 32 for a float, or 24 for an immediate float value.\r\n\r\nNot sure what the .MMA part is, but HFMA2 is half2 vector fused multiply-add.<br>\r\nWe have in half precision 1.875 = 0011111110000000, and 0 is just 16 zeros. The result is that half2 adding {1.875, 0} to {0, 0} * {0, 0} (-RZ * RZ) gives [the float representation of 1](https://evanw.github.io/float-toy/)! This float \"1\" then gets stored in in R16.\r\n\r\nNow there is the MOV instruction, which could do the same. I have no idea which unit does MOV, it is clearly not the half-precision unit; therefore using this otherwise un-used unit gives a higher throughput.\r\nThis explains why one might see half-precision operations even if half precision is not used anywhere in the code.\r\n"
  },
  {
    "path": "README.md",
    "content": "# DocumentSASS\r\nThe instruction sets for NVIDIA GPUs have a very sparse [official documentation](https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html).\r\n\r\nOther projects have worked on examining the instructions mainly through reverse-engineering, such as \r\n[MaxAs](https://github.com/NervanaSystems/maxas/), [AsFermi](https://github.com/hyqneuron/asfermi), [CuAssembler](https://github.com/cloudcores/CuAssembler),\r\n[TuringAs](https://github.com/daadaada/turingas), [KeplerAs](https://github.com/PAA-NCIC/PPoPP2017_artifact), [Decuda](https://github.com/laanwj/decuda), and the papers [Dissecting the NVidia Turing T4 GPU](https://arxiv.org/abs/1903.07486), [Optimizing Batched Winograd Convolution on GPUs](https://www.cse.ust.hk/~weiwa/papers/yan-ppopp20.pdf).\r\n\r\n\r\nSince the instructions and architecture changes from generation to generation, it is an uphill battle.<br>\r\n**What if a description of the instruction encoding could be found within the tools provided by NVIDIA?**<br>\r\n**What if the instruction latencies could be found inside these as well?**<br>\r\n\r\n\r\nThe answer is **of course they can.** Otherwise the compiler would do a poor job scheduling instructions. Furthermore, for SASS, it appears that fixed-latency instructions have the number of stall cycles hard-coded into them [[src](https://arxiv.org/pdf/1903.07486.pdf)]. It is just a question of finding where this data is hidden.\r\n\r\nIt turns out that an extensive description of SASS instructions as well as latencies was contained in two specific strings in `nvdisasm`. Instead of having to write micro-benchmarks to find latencies, or use reverse engineering to make an assembler, one could in theory just consult these files. [Instruction scheduling](https://en.wikipedia.org/wiki/Instruction_scheduling) info is given in the latencies file, with the minimum time for fixed-latency ops. essentially being the latency. See [NOTES](NOTES.md).\r\n\r\nFor some additional, unrelated observations, see [OTHER](OTHER.md).\r\n\r\n\r\n## How to run\r\nThe easy way is by simply running [this notebook](https://colab.research.google.com/drive/1qjdpjCgozg-yKfW_u9lJfHuxOu0NrnGG) in Google Colab. No requirements.\r\n\r\nRequirements to run locally: Linux, Python 3, CUDA Toolkit. Run `make` to generate the raw files describing instructions and latencies. Be sure to change the paths in the beginning of the Makefile if they are different on your system. Tested with CUDA 11.6.\r\n\r\n## How it works\r\n1. `nvcc` is used to compile example.cu to .cubin binaries for a list of architectures.\r\n2. `cc` is used to compile intercept.c to a .so library that serves as a [man-in-the-middle](https://www.thegeekstuff.com/2012/03/reverse-engineering-tools/) for data from memcpy calls.\r\n3. We intercept `nvdisasm` applied on each binary file using `intercept.so`.\r\n4. The result is filtered with `strings` to only get text, and then the script `funnel.py` gathers the relevant portions and writes them to files.\r\n\r\nAn initial approach was to simply run `strings nvdisasm` to get text embedded in the executable, but it turned out the relevant strings were dynamically generated (and only for the input architecture), which is why this solution is needed.\r\n\r\n## TODO\r\n- It appears the instruction string may be slightly corrupted for compute capability 3.5 currently.\r\n"
  },
  {
    "path": "example.cu",
    "content": "__global__ void kernel(float *out) {\r\n    out[0] = 0;\r\n}\r\n"
  },
  {
    "path": "funnel.py",
    "content": "import collections\r\nimport re\r\nimport sys\r\nfrom collections import *\r\n\r\n# We used delimiters like the one below to separate different outputs.\r\nexample = \"<0x0 0x0 0>\"\r\n# The first value is the destination pointer, the second the src, and the third the size in bytes.\r\ndelimiter = \"<0x[\\da-f]+ 0x[\\da-f]+ [\\d]+>\"\r\n\r\npattern = re.compile(delimiter)\r\nassert pattern.match(example)\r\n\r\ndef getstring(data, src, firstline=False, unique=False):\r\n    # Put unique = true to remove duplicate lines from string. Mor or may not work.\r\n    collect = False\r\n    partial = []\r\n    full = collections.OrderedDict() if unique else []\r\n    for line in data.splitlines() + [example]: # <- we add an empty token to push out last line.\r\n        if pattern.match(line):  # Start of a string.\r\n            collect = line.split()[1] == src  # The code matches. Start collecting pieces of the string.\r\n            if unique:\r\n                full['\\n'.join(partial)] = None\r\n            else:\r\n                full.append('\\n'.join(partial))\r\n            partial.clear()\r\n        elif collect:\r\n            if firstline:\r\n                return line\r\n            partial.append(line)\r\n\r\n\r\n    # Collect the partial results. And connect them not by newlines.\r\n    return ''.join(full)\r\n\r\n\r\ndef getkey(entry):\r\n    # Unpack \"<a b c>\" into a,b,c\r\n    return entry[1:-1].split()\r\n\r\ndef getfile(data, name):\r\n    # Find the occurence of string starting with name, and extract key.\r\n    match = re.search(delimiter + '\\n' + name, data).group()\r\n    _, src, _ = getkey(match.splitlines()[0])\r\n    # t = True\r\n    #for match in re.findall(delimiter + '\\n' + name, data):\r\n    #    _, src, _ = getkey(match.splitlines()[0])\r\n        # print(src)\r\n        # 0x449e908\r\n        # 0x449a840\r\n        # Then use the key to get the rest of the string.\r\n    return getstring(data, src)\r\n\r\ndef getcounts(data):\r\n    # The basic idea is that a string-buffer will end up having written the whole file.\r\n    # Therefore we can look through the amount of trafic to find the strings of interest.\r\n    keys = map(getkey, pattern.findall(data))\r\n    counts = defaultdict(int)\r\n    for dst, src, size in keys:\r\n        size = int(size)\r\n        # We only really care for the src, as this is the buffer containing the string.\r\n        counts[src] = counts[src] + size\r\n    return counts\r\n\r\ndef getreferences(data):\r\n    keys = map(getkey, pattern.findall(data))\r\n    # Something is written a..........b, where a, b is the start and end address.\r\n    # Then something is written: b.......c\r\n    # We would now like to trace this connection a->c starting in c. So the value should be the parent.\r\n    points = {int(entry[0][2:], 16)+int(entry[2]): int(entry[0][2:], 16) for entry in keys}\r\n    points_rev = {int(entry[0][2:], 16):int(entry[0][2:], 16)+int(entry[2]) for entry in keys}\r\n\r\n    # Now, toplevels are those with no parents, e.g. a.\r\n    toplevels = [value for value in points.values() if value not in points]\r\n    bottomlevels = [value for value in points_rev.values() if value not in points_rev]\r\n\r\n\r\n    # Now we can remove all of those\r\n    print(len(bottomlevels))\r\n    #print(len(points))\r\n\r\n    return points\r\n\r\ndef remsuffix(inpt, suffix):\r\n    # https://stackoverflow.com/a/1038845\r\n    if inpt.endswith(suffix):\r\n        inpt = inpt[:-len(suffix)]\r\n    return inpt\r\n\r\nif __name__ == \"__main__\":\r\n    if len(sys.argv) == 1:\r\n        raise Exception('No input file name given.')\r\n\r\n    for fname in sys.argv[1:]:\r\n        with open(fname, \"r\", encoding=\"ascii\") as f:\r\n            data = f.read()\r\n\r\n        fname = remsuffix(remsuffix(fname, '.txt'), '_intercept')\r\n        instructions = getfile(data, 'ARCHITECTURE')[:-2]\r\n        with open(fname + '_instructions.txt', \"w\") as f:\r\n            f.write(instructions)\r\n\r\n        latencies = getfile(data, 'OPERATION SETS')\r\n        with open(fname + '_latencies.txt', \"w\") as f:\r\n            f.write(latencies)\r\n\r\n    if False:\r\n        threshold = 0  # Threshold at 5 Kb e.g.\r\n        goodones = sorted((x[1], x[0]) for x in getcounts(data).items() if x[1] >= threshold)\r\n        for size, src in reversed(goodones):\r\n            try:\r\n                firstline = getstring(data, src, firstline=True)\r\n                txt = getfile(data, firstline)\r\n                print(size, len(txt), src, firstline)\r\n            except e:\r\n                pass\r\n\r\n\r\n        # print(map(lambda x: getstring(x[1]), goodones))\r\n"
  },
  {
    "path": "intercept.c",
    "content": "#define _GNU_SOURCE\r\n#include <stdio.h>\r\n#include <dlfcn.h>\r\n#include <unistd.h>\r\n#include <string.h>\r\n\r\n// https://stackoverflow.com/a/18351147\r\n__attribute__ ((__noinline__))\r\nvoid * get_pc1 () { return __builtin_extract_return_addr(__builtin_return_address(1)); }\r\n\r\n__attribute__ ((__noinline__))\r\nvoid * get_fa1 () { return __builtin_frame_address (1); }\r\n\r\n\r\nvoid * memcpy(void * __restrict dest, const void * __restrict src, size_t num) {\r\n    // https://www.thegeekstuff.com/2012/03/reverse-engineering-tools/\r\n    \r\n    // https://osterlund.xyz/posts/2018-03-12-interceptiong-functions-c.html\r\n    void * (*lmemcpy)(void * __restrict, const void * __restrict, size_t) = dlsym(RTLD_NEXT, \"memcpy\");\r\n    \r\n    // https://stackoverflow.com/a/18351147\r\n    //printf(\"\\n<%p>\\n\", get_pc1());\r\n    //printf(\"\\n<%p>\\n\", get_fa1());\r\n    printf(\"\\n<%p %p %zu>\\n\", dest, src, num);\r\n    // https://stackoverflow.com/a/1716621\r\n    fflush(stdout); // Will now print everything in the stdout buffer\r\n    \r\n    // https://stackoverflow.com/a/15660266\r\n    fwrite(src, sizeof(char), num, stdout);\r\n    fflush(stdout); // Will now print everything in the stdout buffer\r\n    \r\n    return lmemcpy(dest, src, num);\r\n}"
  }
]