Repository: 0xD0GF00D/DocumentSASS
Branch: main
Commit: 5d9acfaef090
Files: 8
Total size: 21.9 KB

Directory structure:
gitextract_0pnoc0m2/

├── LICENSE
├── Makefile
├── NOTES.md
├── OTHER.md
├── README.md
├── example.cu
├── funnel.py
└── intercept.c

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2022 0xD0GF00D

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: Makefile
================================================
# Set up compiler, just compile a .ptx for the lowest arch possible. It does not matter as we are interested in SASS.
CC=cc
NVCC=/usr/local/cuda/bin/nvcc
NVDISASM=/usr/local/cuda/bin/nvdisasm
PYTHON=python3

architectures=sm_50 sm_52 sm_53 sm_60 sm_61 sm_62 sm_70 sm_72 sm_75 sm_80 sm_86 sm_90

targets = $(architectures:=_instructions.txt) $(architectures:=_latencies.txt)

all: $(targets)

clean:
	-rm -f $(targets)

# Generate the SASS versions.
%.cubin: example.cu
	$(NVCC) -o $@ -arch=$(basename $@) -cubin $<

%.so: %.c
	$(CC) -fPIC -shared -o $@ $< -ldl

# Not sure if the OMP things are needed, same with the flushing of stdout. We pipe it through strings to get only readable parts.
%_intercept.txt: %.cubin intercept.so
	OMP_NUM_THREADS=1 OMP_THREAD_LIMIT=1 LD_PRELOAD=./intercept.so $(NVDISASM) $< | strings -n 1 > $@

%_instructions.txt %_latencies.txt: %_intercept.txt
	$(PYTHON) funnel.py $<


================================================
FILE: NOTES.md
================================================

# Interpretations
The resulting data is clearly meant for some program to parse, so it is not as immediately understandable as I would like.
Depending on how much time I have, I'll work on breaking down the data into a more understandable format.


Since SASS is undocumented, we can only really do interpretations of their function, unless we validate for every case (a daunting task).<br>
The grave accent and <code>@</code> symbols appear to be operators to do with table lookups.


## Instructions file
Whenever, for example `Predicate(PT):Pg` appears, it means initialization of type Predicate with default value PT (true), stored in variable Pg.
Then `/` means it is optional, maybe also that it must be displayed with a dot, e.g. `FFMA.SAT`. This is often coupled with quotation marks, meaning that it must be displayed as a string, even when the table value is numeric, e.g. `SAT "nosat"=0 , "SAT"=1;` for `/SAT("nosat"):sat`. The reason the table is numeric is probably because that gives a direct mapping from the bits to the string in the table.

Another example is `BITSET(6/0x0000):`, meaning a field of 6 bits (probably displayed as a bitset).
Similarly `UImm(3/0x7):src_rel_sb` probably means unsigned integer immediate value of 3 bits, with default value 7 (meaning no barrier).

Variable latency operations appear to use the `VarLatOperandEnc(src_rel_sb)` table value as well as a `VIRTUAL QUEUE`. This is very important information, as the queues tells us which operations can be done without affecting each other (different queues).
For fixed-latency operations there is no variable `src_rel_sb`, and the encoding for these bits is specified to have the value `*7`, probably literal 7. I noticed there is an exception of EXIT, where it just says `7`.


The binary code for the instruction is given in the OPCODES field. The leading bit is not part of the code; It is always 1, probably just to encode the length of the opcode. The encoding is given in the ENCODING field. These contain references to the bit field tables. An instruction is 128 bits, but in the bit field table, the first 64 bits are written last -- which makes sense since then the instruction field is contiguous, and the first bits give the control code.

The encoded opex field appear to give the stall cycles, as well as reuse flags, and something called `batch` (between 0 and 6, probably to do with variable latency instructions). A stall up to 15 is quite simple, but it appears that stalls all the way up to 27 cycles are actually supported, but doing so disables batch functionality (since there are limited bits available). Valid combinations are given in the opex tables.

## Latencies file
The latencies between operations is given in the different tables. The table is organized with respect to input and output resources, called connectors.
Latency for variable-latency operations does not necessarily reflect the actual latency, only a lower bound possibly.


I am not sure how to interpret latencies above 15, since 15 is the highest possible value in the latencies field of the instructions.
It seems like they are correct though, those that are given, but they should be interpreted carefully. <br>

For example for HMMA.16816.F16 (half-precision tensor-core accelerated matrix multiply): If these are chained with the connection of the input register being the same as the output register, the stated latency is 28. By micro-benchmarking, I found the latency to be 24 cycles. Looking at the SASS it appears there is inserted NOP operations between them. NOP is classified as a MIO_OP, essentially because it appears NOP is used specifically to resolve some data hazards for variable-latency instructions. The correct table value is then (HMMA_OP, MIO_FAST_OPS) = 22! And our observed value is 24 because one cycle is spent executing NOP, and one cycle is lost somewhere else. I wonder where?

Note that the measured time of HMMA.16816.F32 was 33-34 cycles, which is not in the latencies table. Maybe because it is above 2*15, or maybe cause it is greater than 27, which is the largest value of the usched table in the instructions file -- 28 is the greatest value latency in the latency file. The latency for DMMA.884 was measured to be 38-39 cycles (and as high as 177 with multiple threads), which is quite far from (DMMA_OP, DMMA_OP) = 25 or (DMMA_OP, MIO_FAST_OPS) = 23.

Another example is with the variable-latency MUFU operation. It is also a MIO_FAST_OP, but unfortunately the latency for chaining these is not given. In terms of scheduling this is not an issue, as there are no hazards (as long as appropriate variable-latency barriers are set). I measured MUFU.SQRT to be 17 cycles (sm86), which is also the time of MUFU.RSQ (reciprocal square root), MUFU.LG2 (2-logarithm) and MUFU.EX2 (2-exponential). The first case is most interesting, since one might expect MUFU.SQRT to have the latency of MUFU.RSQ + MUFU.RCP (reciprocal).


Findings:
1. Truly fixed latencies with times over 15 cycles appear to be implemented by inserting NOP instructions. These NOP probably either use barriers, or maybe just use the truly simple method fixed stall cycles (enough to cover the remaining cycles over 15).
2. Truly variable latencies are not specified in the latencies file. They must be resolved with barriers and the like. This is probably because they involve shared resources.
3. The time before a predicate can be used appear to be 13 cycles (sm86) in many cases, but it can be as low as 5 cycles in the common case of 32-bit float operations it seems (see table for specific connectors).

## Redundancies
The unique files are found by `md5sum * | grep ".txt" | sort | uniq -w33 | awk '{print $2}' | sort`
<pre>
sm_35_instructions.txt
sm_35_latencies.txt
sm_37_instructions.txt
sm_50_instructions.txt
sm_50_latencies.txt
sm_52_instructions.txt
sm_52_latencies.txt
sm_60_instructions.txt
sm_60_latencies.txt
sm_61_instructions.txt
sm_61_latencies.txt
sm_70_instructions.txt
sm_70_latencies.txt
sm_72_instructions.txt
sm_72_latencies.txt
sm_75_instructions.txt
sm_75_latencies.txt
sm_80_instructions.txt
sm_80_latencies.txt
sm_86_instructions.txt
sm_86_latencies.txt
</pre>
This shows that some parts were unchanged between different compute capabilities. These files are 32.6 MB.<br>
The size of the .data and .rodata segment in nvdisasm is 30.9 + 0.17 = 31.07 MB. So some decompression is definitely done when running the compiler, which explains why the text files are not easily discoverable.


================================================
FILE: OTHER.md
================================================
## Essential resources
1. [MaxAs](https://github.com/NervanaSystems/maxas) is a reverse-engineered assembler for the Maxwell architecture (Compute Capability 5.2), by [Scott Gray](https://forums.developer.nvidia.com/u/scottgray/summary). The story is he needed a fast GEMM implementation, but the public CUDA tools do not allow for precise editing of SASS assembly. 80% of the theoretical throughput could be attained in PTX, and only with hand-written assembly could ~98% be reached.
2. [Dissecting the NVidia Turing T4 GPU via Microbenchmarking](https://arxiv.org/abs/1903.07486). This paper uses micro-benchmarks to estimate the undocumented latencies of instructions for Compute Capabilities 3.0-7.5. It also provides a reverse-engineered description of the instruction encoding for these generations.
3. [RTX ON: The NVIDIA Turing GPU Architecture](https://old.hotchips.org/hc31/HC31_2.12_NVIDIA_final.pdf) [[video](https://www.youtube.com/watch?v=IjxpMZUqu6c), [paper](https://ieeexplore.ieee.org/document/8981896)] slides for a talk describing the improved throughput on Turing. Shows how dispatch happens. Appears to confirm the uniform datapath consists of a separate unit and register file. But it is still unclear to me what the interaction between MUFU and MIO is (why MUFU use can cause [MIO stall](https://docs.nvidia.com/nsight-compute/ProfilingGuide/#statistical-sampler)).
4. [L1TEX usage and L2 below](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#memory-tables-l1) from the profiling guide. Shows more in-depth how the L1Tex unit works on GA100 (Compute Capability 8.0).

See the diagram in (2) for memory hierarchy and the diagram in (3) for throughput, of Compute Capability 7.5.


## Profiling
Metrics with "sass_" in the name work by [patching the kernel](https://forums.developer.nvidia.com/t/difference-between-thread-inst-executed-metrics/217587), and therefore come with a greater overhead compared to hardware counters.

## Compiler arguments
Some apparently undocumented arguments for the compiler ptxas.
<pre>
--compiler-stats (-compilerStats) <t/m/p>
    Prints out compiler statistics.
    time/t       : Prints compilation time.
    memory/m     : Prints peak memory usage.
    phase-wise/p : Prints the above data for various compiler phases.

--no-fastreg
    Disable fast register allocation.
No idea if this is good or bad in terms of performance. Very curious to try it out.

--opt-pointers (-Op)
    Optimize 64-bit pointers by truncating them to 32-bit
Another option that could impact performance.

--tool-name <string>
    Change tool name to specified string.
Appears to just change the string "ptxas" to something else in the help page.

--fastimul
    Enable 24 bit integer multiplication
I think this is only relevant for older GPU generations. For some reason 24-bit integer mult. was faster then.

--fast-compile (-fc)
    EXPERIMENTAL FEATURE: Enable optimization strategies that improve compilation time while reducing runtime performance

--cuda-api-version <major>.<minor>
    CUDA API version to use to for compilation

--sw2614554
--sw1729687
--sw200428197
--sw200387803
These just say "Enable sw_______". No idea what it does.

--key (-k)
    Hash value representing the device code from which the binaries were compiled
--okey (-ok)
    Deobfuscation key for specified ptx input
--ptx-length (-ptxlen)
    Length in bytes of obfuscated ptx string

Some arguments that appear in the data, but does not seem to work are "-dump-perf-stats", "-forcetext"
</pre>

## Half-precision MOV
Some interesting things happen when you compile PTX for SM 86, and then use in on an SM 86 device. Since the PTX is a newer version, newer features are used, which not supported on the default, SM 52.
<pre>
The line of code (variable names simplified):
float sign_value = x > y ? 1 : -1;


Becomes partly the SASS:
97	      HFMA2.MMA R16, -RZ, RZ, 1.875, 0
125	      FSEL R6, R16, -1, P1 
</pre>
The utility of the RZ register is to encode literal float zero using only 8 bits instead of e.g. 32 for a float, or 24 for an immediate float value.

Not sure what the .MMA part is, but HFMA2 is half2 vector fused multiply-add.<br>
We have in half precision 1.875 = 0011111110000000, and 0 is just 16 zeros. The result is that half2 adding {1.875, 0} to {0, 0} * {0, 0} (-RZ * RZ) gives [the float representation of 1](https://evanw.github.io/float-toy/)! This float "1" then gets stored in in R16.

Now there is the MOV instruction, which could do the same. I have no idea which unit does MOV, it is clearly not the half-precision unit; therefore using this otherwise un-used unit gives a higher throughput.
This explains why one might see half-precision operations even if half precision is not used anywhere in the code.


================================================
FILE: README.md
================================================
# DocumentSASS
The instruction sets for NVIDIA GPUs have a very sparse [official documentation](https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html).

Other projects have worked on examining the instructions mainly through reverse-engineering, such as 
[MaxAs](https://github.com/NervanaSystems/maxas/), [AsFermi](https://github.com/hyqneuron/asfermi), [CuAssembler](https://github.com/cloudcores/CuAssembler),
[TuringAs](https://github.com/daadaada/turingas), [KeplerAs](https://github.com/PAA-NCIC/PPoPP2017_artifact), [Decuda](https://github.com/laanwj/decuda), and the papers [Dissecting the NVidia Turing T4 GPU](https://arxiv.org/abs/1903.07486), [Optimizing Batched Winograd Convolution on GPUs](https://www.cse.ust.hk/~weiwa/papers/yan-ppopp20.pdf).


Since the instructions and architecture changes from generation to generation, it is an uphill battle.<br>
**What if a description of the instruction encoding could be found within the tools provided by NVIDIA?**<br>
**What if the instruction latencies could be found inside these as well?**<br>


The answer is **of course they can.** Otherwise the compiler would do a poor job scheduling instructions. Furthermore, for SASS, it appears that fixed-latency instructions have the number of stall cycles hard-coded into them [[src](https://arxiv.org/pdf/1903.07486.pdf)]. It is just a question of finding where this data is hidden.

It turns out that an extensive description of SASS instructions as well as latencies was contained in two specific strings in `nvdisasm`. Instead of having to write micro-benchmarks to find latencies, or use reverse engineering to make an assembler, one could in theory just consult these files. [Instruction scheduling](https://en.wikipedia.org/wiki/Instruction_scheduling) info is given in the latencies file, with the minimum time for fixed-latency ops. essentially being the latency. See [NOTES](NOTES.md).

For some additional, unrelated observations, see [OTHER](OTHER.md).


## How to run
The easy way is by simply running [this notebook](https://colab.research.google.com/drive/1qjdpjCgozg-yKfW_u9lJfHuxOu0NrnGG) in Google Colab. No requirements.

Requirements to run locally: Linux, Python 3, CUDA Toolkit. Run `make` to generate the raw files describing instructions and latencies. Be sure to change the paths in the beginning of the Makefile if they are different on your system. Tested with CUDA 11.6.

## How it works
1. `nvcc` is used to compile example.cu to .cubin binaries for a list of architectures.
2. `cc` is used to compile intercept.c to a .so library that serves as a [man-in-the-middle](https://www.thegeekstuff.com/2012/03/reverse-engineering-tools/) for data from memcpy calls.
3. We intercept `nvdisasm` applied on each binary file using `intercept.so`.
4. The result is filtered with `strings` to only get text, and then the script `funnel.py` gathers the relevant portions and writes them to files.

An initial approach was to simply run `strings nvdisasm` to get text embedded in the executable, but it turned out the relevant strings were dynamically generated (and only for the input architecture), which is why this solution is needed.

## TODO
- It appears the instruction string may be slightly corrupted for compute capability 3.5 currently.


================================================
FILE: example.cu
================================================
__global__ void kernel(float *out) {
    out[0] = 0;
}


================================================
FILE: funnel.py
================================================
import collections
import re
import sys
from collections import *

# We used delimiters like the one below to separate different outputs.
example = "<0x0 0x0 0>"
# The first value is the destination pointer, the second the src, and the third the size in bytes.
delimiter = "<0x[\da-f]+ 0x[\da-f]+ [\d]+>"

pattern = re.compile(delimiter)
assert pattern.match(example)

def getstring(data, src, firstline=False, unique=False):
    # Put unique = true to remove duplicate lines from string. Mor or may not work.
    collect = False
    partial = []
    full = collections.OrderedDict() if unique else []
    for line in data.splitlines() + [example]: # <- we add an empty token to push out last line.
        if pattern.match(line):  # Start of a string.
            collect = line.split()[1] == src  # The code matches. Start collecting pieces of the string.
            if unique:
                full['\n'.join(partial)] = None
            else:
                full.append('\n'.join(partial))
            partial.clear()
        elif collect:
            if firstline:
                return line
            partial.append(line)


    # Collect the partial results. And connect them not by newlines.
    return ''.join(full)


def getkey(entry):
    # Unpack "<a b c>" into a,b,c
    return entry[1:-1].split()

def getfile(data, name):
    # Find the occurence of string starting with name, and extract key.
    match = re.search(delimiter + '\n' + name, data).group()
    _, src, _ = getkey(match.splitlines()[0])
    # t = True
    #for match in re.findall(delimiter + '\n' + name, data):
    #    _, src, _ = getkey(match.splitlines()[0])
        # print(src)
        # 0x449e908
        # 0x449a840
        # Then use the key to get the rest of the string.
    return getstring(data, src)

def getcounts(data):
    # The basic idea is that a string-buffer will end up having written the whole file.
    # Therefore we can look through the amount of trafic to find the strings of interest.
    keys = map(getkey, pattern.findall(data))
    counts = defaultdict(int)
    for dst, src, size in keys:
        size = int(size)
        # We only really care for the src, as this is the buffer containing the string.
        counts[src] = counts[src] + size
    return counts

def getreferences(data):
    keys = map(getkey, pattern.findall(data))
    # Something is written a..........b, where a, b is the start and end address.
    # Then something is written: b.......c
    # We would now like to trace this connection a->c starting in c. So the value should be the parent.
    points = {int(entry[0][2:], 16)+int(entry[2]): int(entry[0][2:], 16) for entry in keys}
    points_rev = {int(entry[0][2:], 16):int(entry[0][2:], 16)+int(entry[2]) for entry in keys}

    # Now, toplevels are those with no parents, e.g. a.
    toplevels = [value for value in points.values() if value not in points]
    bottomlevels = [value for value in points_rev.values() if value not in points_rev]


    # Now we can remove all of those
    print(len(bottomlevels))
    #print(len(points))

    return points

def remsuffix(inpt, suffix):
    # https://stackoverflow.com/a/1038845
    if inpt.endswith(suffix):
        inpt = inpt[:-len(suffix)]
    return inpt

if __name__ == "__main__":
    if len(sys.argv) == 1:
        raise Exception('No input file name given.')

    for fname in sys.argv[1:]:
        with open(fname, "r", encoding="ascii") as f:
            data = f.read()

        fname = remsuffix(remsuffix(fname, '.txt'), '_intercept')
        instructions = getfile(data, 'ARCHITECTURE')[:-2]
        with open(fname + '_instructions.txt', "w") as f:
            f.write(instructions)

        latencies = getfile(data, 'OPERATION SETS')
        with open(fname + '_latencies.txt', "w") as f:
            f.write(latencies)

    if False:
        threshold = 0  # Threshold at 5 Kb e.g.
        goodones = sorted((x[1], x[0]) for x in getcounts(data).items() if x[1] >= threshold)
        for size, src in reversed(goodones):
            try:
                firstline = getstring(data, src, firstline=True)
                txt = getfile(data, firstline)
                print(size, len(txt), src, firstline)
            except e:
                pass


        # print(map(lambda x: getstring(x[1]), goodones))


================================================
FILE: intercept.c
================================================
#define _GNU_SOURCE
#include <stdio.h>
#include <dlfcn.h>
#include <unistd.h>
#include <string.h>

// https://stackoverflow.com/a/18351147
__attribute__ ((__noinline__))
void * get_pc1 () { return __builtin_extract_return_addr(__builtin_return_address(1)); }

__attribute__ ((__noinline__))
void * get_fa1 () { return __builtin_frame_address (1); }


void * memcpy(void * __restrict dest, const void * __restrict src, size_t num) {
    // https://www.thegeekstuff.com/2012/03/reverse-engineering-tools/
    
    // https://osterlund.xyz/posts/2018-03-12-interceptiong-functions-c.html
    void * (*lmemcpy)(void * __restrict, const void * __restrict, size_t) = dlsym(RTLD_NEXT, "memcpy");
    
    // https://stackoverflow.com/a/18351147
    //printf("\n<%p>\n", get_pc1());
    //printf("\n<%p>\n", get_fa1());
    printf("\n<%p %p %zu>\n", dest, src, num);
    // https://stackoverflow.com/a/1716621
    fflush(stdout); // Will now print everything in the stdout buffer
    
    // https://stackoverflow.com/a/15660266
    fwrite(src, sizeof(char), num, stdout);
    fflush(stdout); // Will now print everything in the stdout buffer
    
    return lmemcpy(dest, src, num);
}